Sample records for accurate automatic speech

  1. The Suitability of Cloud-Based Speech Recognition Engines for Language Learning

    ERIC Educational Resources Information Center

    Daniels, Paul; Iwago, Koji

    2017-01-01

    As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…

  2. Automatic detection of Parkinson's disease in running speech spoken in three different languages.

    PubMed

    Orozco-Arroyave, J R; Hönig, F; Arias-Londoño, J D; Vargas-Bonilla, J F; Daqrouq, K; Skodda, S; Rusz, J; Nöth, E

    2016-01-01

    The aim of this study is the analysis of continuous speech signals of people with Parkinson's disease (PD) considering recordings in different languages (Spanish, German, and Czech). A method for the characterization of the speech signals, based on the automatic segmentation of utterances into voiced and unvoiced frames, is addressed here. The energy content of the unvoiced sounds is modeled using 12 Mel-frequency cepstral coefficients and 25 bands scaled according to the Bark scale. Four speech tasks comprising isolated words, rapid repetition of the syllables /pa/-/ta/-/ka/, sentences, and read texts are evaluated. The method proves to be more accurate than classical approaches in the automatic classification of speech of people with PD and healthy controls. The accuracies range from 85% to 99% depending on the language and the speech task. Cross-language experiments are also performed confirming the robustness and generalization capability of the method, with accuracies ranging from 60% to 99%. This work comprises a step forward for the development of computer aided tools for the automatic assessment of dysarthric speech signals in multiple languages.

  3. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review

    NASA Astrophysics Data System (ADS)

    Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH

    2017-09-01

    This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.

  4. "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing.

    PubMed

    Xiao, Bo; Imel, Zac E; Georgiou, Panayiotis G; Atkins, David C; Narayanan, Shrikanth S

    2015-01-01

    The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.

  5. "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing

    PubMed Central

    Xiao, Bo; Imel, Zac E.; Georgiou, Panayiotis G.; Atkins, David C.; Narayanan, Shrikanth S.

    2015-01-01

    The technology for evaluating patient-provider interactions in psychotherapy–observational coding–has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies. PMID:26630392

  6. Automatic concept extraction from spoken medical reports.

    PubMed

    Happe, André; Pouliquen, Bruno; Burgun, Anita; Cuggia, Marc; Le Beux, Pierre

    2003-07-01

    The objective of this project is to investigate methods whereby a combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices. We based our study on existing speech recognition software programs and on NOMINDEX, a tool that extracts MeSH concepts from medical text in natural language and that is mainly based on a French medical lexicon and on the UMLS. For each document, the process consists of three steps: (1) dictation and digital audio recording, (2) speech recognition, (3) automatic indexing. The evaluation consisted of a comparison between the set of concepts extracted by NOMINDEX after the speech recognition phase and the set of keywords manually extracted from the initial document. The method was evaluated on a set of 28 patient discharge summaries extracted from the MENELAS corpus in French, corresponding to in-patients admitted for coronarography. The overall precision was 73% and the overall recall was 90%. Indexing errors were mainly due to word sense ambiguity and abbreviations. A specific issue was the fact that the standard French translation of MeSH terms lacks diacritics. A preliminary evaluation of speech recognition tools showed that the rate of accurate recognition was higher than 98%. Only 3% of the indexing errors were generated by inadequate speech recognition. We discuss several areas to focus on to improve this prototype. However, the very low rate of indexing errors due to speech recognition errors highlights the potential benefits of combining speech recognition techniques and automatic indexing.

  7. The transition to increased automaticity during finger sequence learning in adult males who stutter.

    PubMed

    Smits-Bandstra, Sarah; De Nil, Luc; Rochon, Elizabeth

    2006-01-01

    The present study compared the automaticity levels of persons who stutter (PWS) and persons who do not stutter (PNS) on a practiced finger sequencing task under dual task conditions. Automaticity was defined as the amount of attention required for task performance. Twelve PWS and 12 control subjects practiced finger tapping sequences under single and then dual task conditions. Control subjects performed the sequencing task significantly faster and less variably under single versus dual task conditions while PWS' performance was consistently slow and variable (comparable to the dual task performance of control subjects) under both conditions. Control subjects were significantly more accurate on a colour recognition distracter task than PWS under dual task conditions. These results suggested that control subjects transitioned to quick, accurate and increasingly automatic performance on the sequencing task after practice, while PWS did not. Because most stuttering treatment programs for adults include practice and automatization of new motor speech skills, findings of this finger sequencing study and future studies of speech sequence learning may have important implications for how to maximize stuttering treatment effectiveness. As a result of this activity, the participant will be able to: (1) Define automaticity and explain the importance of dual task paradigms to investigate automaticity; (2) Relate the proposed relationship between motor learning and automaticity as stated by the authors; (3) Summarize the reviewed literature concerning the performance of PWS on dual tasks; and (4) Explain why the ability to transition to automaticity during motor learning may have important clinical implications for stuttering treatment effectiveness.

  8. An overview of neural function and feedback control in human communication.

    PubMed

    Hood, L J

    1998-01-01

    The speech and hearing mechanisms depend on accurate sensory information and intact feedback mechanisms to facilitate communication. This article provides a brief overview of some components of the nervous system important for human communication and some electrophysiological methods used to measure cortical function in humans. An overview of automatic control and feedback mechanisms in general and as they pertain to the speech motor system and control of the hearing periphery is also presented, along with a discussion of how the speech and auditory systems interact.

  9. Accurate visible speech synthesis based on concatenating variable length motion capture data.

    PubMed

    Ma, Jiyong; Cole, Ron; Pellom, Bryan; Ward, Wayne; Wise, Barbara

    2006-01-01

    We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.

  10. Specific acoustic models for spontaneous and dictated style in indonesian speech recognition

    NASA Astrophysics Data System (ADS)

    Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.

    2018-03-01

    The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.

  11. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    ERIC Educational Resources Information Center

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  12. Use of Speech Analyses within a Mobile Application for the Assessment of Cognitive Impairment in Elderly People.

    PubMed

    Konig, Alexandra; Satt, Aharon; Sorin, Alex; Hoory, Ran; Derreumaux, Alexandre; David, Renaud; Robert, Phillippe H

    2018-01-01

    Various types of dementia and Mild Cognitive Impairment (MCI) are manifested as irregularities in human speech and language, which have proven to be strong predictors for the disease presence and progress ion. Therefore, automatic speech analytics provided by a mobile application may be a useful tool in providing additional indicators for assessment and detection of early stage dementia and MCI. 165 participants (subjects with subjective cognitive impairment (SCI), MCI patients, Alzheimer's disease (AD) and mixed dementia (MD) patients) were recorded with a mobile application while performing several short vocal cognitive tasks during a regular consultation. These tasks included verbal fluency, picture description, counting down and a free speech task. The voice recordings were processed in two steps: in the first step, vocal markers were extracted using speech signal processing techniques; in the second, the vocal markers were tested to assess their 'power' to distinguish between SCI, MCI, AD and MD. The second step included training automatic classifiers for detecting MCI and AD, based on machine learning methods, and testing the detection accuracy. The fluency and free speech tasks obtain the highest accuracy rates of classifying AD vs. MD vs. MCI vs. SCI. Using the data, we demonstrated classification accuracy as follows: SCI vs. AD = 92% accuracy; SCI vs. MD = 92% accuracy; SCI vs. MCI = 86% accuracy and MCI vs. AD = 86%. Our results indicate the potential value of vocal analytics and the use of a mobile application for accurate automatic differentiation between SCI, MCI and AD. This tool can provide the clinician with meaningful information for assessment and monitoring of people with MCI and AD based on a non-invasive, simple and low-cost method. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  13. MARTI: man-machine animation real-time interface

    NASA Astrophysics Data System (ADS)

    Jones, Christian M.; Dlay, Satnam S.

    1997-05-01

    The research introduces MARTI (man-machine animation real-time interface) for the realization of natural human-machine interfacing. The system uses simple vocal sound-tracks of human speakers to provide lip synchronization of computer graphical facial models. We present novel research in a number of engineering disciplines, which include speech recognition, facial modeling, and computer animation. This interdisciplinary research utilizes the latest, hybrid connectionist/hidden Markov model, speech recognition system to provide very accurate phone recognition and timing for speaker independent continuous speech, and expands on knowledge from the animation industry in the development of accurate facial models and automated animation. The research has many real-world applications which include the provision of a highly accurate and 'natural' man-machine interface to assist user interactions with computer systems and communication with one other using human idiosyncrasies; a complete special effects and animation toolbox providing automatic lip synchronization without the normal constraints of head-sets, joysticks, and skilled animators; compression of video data to well below standard telecommunication channel bandwidth for video communications and multi-media systems; assisting speech training and aids for the handicapped; and facilitating player interaction for 'video gaming' and 'virtual worlds.' MARTI has introduced a new level of realism to man-machine interfacing and special effect animation which has been previously unseen.

  14. Speech systems research at Texas Instruments

    NASA Technical Reports Server (NTRS)

    Doddington, George R.

    1977-01-01

    An assessment of automatic speech processing technology is presented. Fundamental problems in the development and the deployment of automatic speech processing systems are defined and a technology forecast for speech systems is presented.

  15. Using Automatic Speech Recognition to Dictate Mathematical Expressions: The Development of the "TalkMaths" Application at Kingston University

    ERIC Educational Resources Information Center

    Wigmore, Angela; Hunter, Gordon; Pflugel, Eckhard; Denholm-Price, James; Binelli, Vincent

    2009-01-01

    Speech technology--especially automatic speech recognition--has now advanced to a level where it can be of great benefit both to able-bodied people and those with various disabilities. In this paper we describe an application "TalkMaths" which, using the output from a commonly-used conventional automatic speech recognition system,…

  16. Text as a Supplement to Speech in Young and Older Adults a)

    PubMed Central

    Krull, Vidya; Humes, Larry E.

    2015-01-01

    Objective The purpose of this experiment was to quantify the contribution of visual text to auditory speech recognition in background noise. Specifically, we tested the hypothesis that partially accurate visual text from an automatic speech recognizer could be used successfully to supplement speech understanding in difficult listening conditions in older adults, with normal or impaired hearing. Our working hypotheses were based on what is known regarding audiovisual speech perception in the elderly from speechreading literature. We hypothesized that: 1) combining auditory and visual text information will result in improved recognition accuracy compared to auditory or visual text information alone; 2) benefit from supplementing speech with visual text (auditory and visual enhancement) in young adults will be greater than that in older adults; and 3) individual differences in performance on perceptual measures would be associated with cognitive abilities. Design Fifteen young adults with normal hearing, fifteen older adults with normal hearing, and fifteen older adults with hearing loss participated in this study. All participants completed sentence recognition tasks in auditory-only, text-only, and combined auditory-text conditions. The auditory sentence stimuli were spectrally shaped to restore audibility for the older participants with impaired hearing. All participants also completed various cognitive measures, including measures of working memory, processing speed, verbal comprehension, perceptual and cognitive speed, processing efficiency, inhibition, and the ability to form wholes from parts. Group effects were examined for each of the perceptual and cognitive measures. Audiovisual benefit was calculated relative to performance on auditory-only and visual-text only conditions. Finally, the relationship between perceptual measures and other independent measures were examined using principal-component factor analyses, followed by regression analyses. Results Both young and older adults performed similarly on nine out of ten perceptual measures (auditory, visual, and combined measures). Combining degraded speech with partially correct text from an automatic speech recognizer improved the understanding of speech in both young and older adults, relative to both auditory- and text-only performance. In all subjects, cognition emerged as a key predictor for a general speech-text integration ability. Conclusions These results suggest that neither age nor hearing loss affected the ability of subjects to benefit from text when used to support speech, after ensuring audibility through spectral shaping. These results also suggest that the benefit obtained by supplementing auditory input with partially accurate text is modulated by cognitive ability, specifically lexical and verbal skills. PMID:26458131

  17. Automatic feedback to promote safe walking and speech loudness control in persons with multiple disabilities: two single-case studies.

    PubMed

    Lancioni, Giulio E; Singh, Nirbhay N; O'Reilly, Mark F; Green, Vanessa A; Alberti, Gloria; Boccasini, Adele; Smaldone, Angela; Oliva, Doretta; Bosco, Andrea

    2014-08-01

    Assessing automatic feedback technologies to promote safe travel and speech loudness control in two men with multiple disabilities, respectively. The men were involved in two single-case studies. In Study I, the technology involved a microprocessor, two photocells, and a verbal feedback device. The man received verbal alerting/feedback when the photocells spotted an obstacle in front of him. In Study II, the technology involved a sound-detecting unit connected to a throat and an airborne microphone, and to a vibration device. Vibration occurred when the man's speech loudness exceeded a preset level. The man included in Study I succeeded in using the automatic feedback in substitution of caregivers' alerting/feedback for safe travel. The man of Study II used the automatic feedback to successfully reduce his speech loudness. Automatic feedback can be highly effective in helping persons with multiple disabilities improve their travel and speech performance.

  18. Integrating hidden Markov model and PRAAT: a toolbox for robust automatic speech transcription

    NASA Astrophysics Data System (ADS)

    Kabir, A.; Barker, J.; Giurgiu, M.

    2010-09-01

    An automatic time-aligned phone transcription toolbox of English speech corpora has been developed. Especially the toolbox would be very useful to generate robust automatic transcription and able to produce phone level transcription using speaker independent models as well as speaker dependent models without manual intervention. The system is based on standard Hidden Markov Models (HMM) approach and it was successfully experimented over a large audiovisual speech corpus namely GRID corpus. One of the most powerful features of the toolbox is the increased flexibility in speech processing where the speech community would be able to import the automatic transcription generated by HMM Toolkit (HTK) into a popular transcription software, PRAAT, and vice-versa. The toolbox has been evaluated through statistical analysis on GRID data which shows that automatic transcription deviates by an average of 20 ms with respect to manual transcription.

  19. Automatic detection of articulation disorders in children with cleft lip and palate.

    PubMed

    Maier, Andreas; Hönig, Florian; Bocklet, Tobias; Nöth, Elmar; Stelzle, Florian; Nkenke, Emeka; Schuster, Maria

    2009-11-01

    Speech of children with cleft lip and palate (CLP) is sometimes still disordered even after adequate surgical and nonsurgical therapies. Such speech shows complex articulation disorders, which are usually assessed perceptually, consuming time and manpower. Hence, there is a need for an easy to apply and reliable automatic method. To create a reference for an automatic system, speech data of 58 children with CLP were assessed perceptually by experienced speech therapists for characteristic phonetic disorders at the phoneme level. The first part of the article aims to detect such characteristics by a semiautomatic procedure and the second to evaluate a fully automatic, thus simple, procedure. The methods are based on a combination of speech processing algorithms. The semiautomatic method achieves moderate to good agreement (kappa approximately 0.6) for the detection of all phonetic disorders. On a speaker level, significant correlations between the perceptual evaluation and the automatic system of 0.89 are obtained. The fully automatic system yields a correlation on the speaker level of 0.81 to the perceptual evaluation. This correlation is in the range of the inter-rater correlation of the listeners. The automatic speech evaluation is able to detect phonetic disorders at an experts'level without any additional human postprocessing.

  20. Bridging automatic speech recognition and psycholinguistics: Extending Shortlist to an end-to-end model of human speech recognition (L)

    NASA Astrophysics Data System (ADS)

    Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis

    2003-12-01

    This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.

  1. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system.

    PubMed

    Zekveld, Adriana A; Kramer, Sophia E; Kessens, Judith M; Vlaming, Marcel S M G; Houtgast, Tammo

    2009-04-01

    The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that speech comprehension in noise by young listeners with normal hearing improves when presenting partly incorrect, automatically generated subtitles. The current study focused on the effects of age, hearing loss, visual working memory capacity, and linguistic skills on the benefit obtained from automatically generated subtitles during listening to speech in noise. In order to investigate the effects of age and hearing loss, three groups of participants were included: 22 young persons with normal hearing (YNH, mean age = 21 years), 22 middle-aged adults with normal hearing (MA-NH, mean age = 55 years) and 30 middle-aged adults with hearing impairment (MA-HI, mean age = 57 years). The benefit from automatic subtitling was measured by Speech Reception Threshold (SRT) tests (Plomp & Mimpen, 1979). Both unimodal auditory and bimodal audiovisual SRT tests were performed. In the audiovisual tests, the subtitles were presented simultaneously with the speech, whereas in the auditory test, only speech was presented. The difference between the auditory and audiovisual SRT was defined as the audiovisual benefit. Participants additionally rated the listening effort. We examined the influences of ASR accuracy level and text delay on the audiovisual benefit and the listening effort using a repeated measures General Linear Model analysis. In a correlation analysis, we evaluated the relationships between age, auditory SRT, visual working memory capacity and the audiovisual benefit and listening effort. The automatically generated subtitles improved speech comprehension in noise for all ASR accuracies and delays covered by the current study. Higher ASR accuracy levels resulted in more benefit obtained from the subtitles. Speech comprehension improved even for relatively low ASR accuracy levels; for example, participants obtained about 2 dB SNR audiovisual benefit for ASR accuracies around 74%. Delaying the presentation of the text reduced the benefit and increased the listening effort. Participants with relatively low unimodal speech comprehension obtained greater benefit from the subtitles than participants with better unimodal speech comprehension. We observed an age-related decline in the working-memory capacity of the listeners with normal hearing. A higher age and a lower working memory capacity were associated with increased effort required to use the subtitles to improve speech comprehension. Participants were able to use partly incorrect and delayed subtitles to increase their comprehension of speech in noise, regardless of age and hearing loss. This supports the further development and evaluation of an assistive listening system that displays automatically recognized speech to aid speech comprehension by listeners with hearing impairment.

  2. Automatic Speech Recognition from Neural Signals: A Focused Review.

    PubMed

    Herff, Christian; Schultz, Tanja

    2016-01-01

    Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.

  3. Military applications of automatic speech recognition and future requirements

    NASA Technical Reports Server (NTRS)

    Beek, Bruno; Cupples, Edward J.

    1977-01-01

    An updated summary of the state-of-the-art of automatic speech recognition and its relevance to military applications is provided. A number of potential systems for military applications are under development. These include: (1) digital narrowband communication systems; (2) automatic speech verification; (3) on-line cartographic processing unit; (4) word recognition for militarized tactical data system; and (5) voice recognition and synthesis for aircraft cockpit.

  4. Automatic speech recognition technology development at ITT Defense Communications Division

    NASA Technical Reports Server (NTRS)

    White, George M.

    1977-01-01

    An assessment of the applications of automatic speech recognition to defense communication systems is presented. Future research efforts include investigations into the following areas: (1) dynamic programming; (2) recognition of speech degraded by noise; (3) speaker independent recognition; (4) large vocabulary recognition; (5) word spotting and continuous speech recognition; and (6) isolated word recognition.

  5. Automatic Method of Pause Measurement for Normal and Dysarthric Speech

    ERIC Educational Resources Information Center

    Rosen, Kristin; Murdoch, Bruce; Folker, Joanne; Vogel, Adam; Cahill, Louise; Delatycki, Martin; Corben, Louise

    2010-01-01

    This study proposes an automatic method for the detection of pauses and identification of pause types in conversational speech for the purpose of measuring the effects of Friedreich's Ataxia (FRDA) on speech. Speech samples of [approximately] 3 minutes were recorded from 13 speakers with FRDA and 18 healthy controls. Pauses were measured from the…

  6. Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners with Simulated Age-Related Hearing Loss

    ERIC Educational Resources Information Center

    Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian

    2017-01-01

    Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist…

  7. Speech Processing and Recognition (SPaRe)

    DTIC Science & Technology

    2011-01-01

    results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing ( NLP ), and...Processing ( NLP ), Information Retrieval (IR) 16. SECURITY CLASSIFICATION OF: UNCLASSIFED 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...Figure 9, the IOC was only expected to provide document submission and search; automatic speech recognition (ASR) for English, Spanish, Arabic , and

  8. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    NASA Astrophysics Data System (ADS)

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  9. The Effect of Intensified Language Exposure on Accommodating Talker Variability.

    PubMed

    Antoniou, Mark; Wong, Patrick C M; Wang, Suiping

    2015-06-01

    This study systematically examined the role of intensified exposure to a second language on accommodating talker variability. English native listeners (n = 37) were compared with Mandarin listeners who had either lived in the United States for an extended period of time (n = 33) or had lived only in China (n = 44). Listeners responded to target words in an English word-monitoring task in which sequences of words were randomized. Half of the sequences were spoken by a single talker and the other half by multiple talkers. Mandarin listeners living in China were slower and less accurate than both English listeners and Mandarin listeners living in the United States. Mandarin listeners living in the United States were less accurate than English natives only in the more cognitively demanding mixed-talker condition. Mixed-talker speech affects processing in native and nonnative listeners alike, although the decrement is larger in nonnatives and further exaggerated in less proficient listeners. Language immersion improves listeners' ability to resolve talker variability, and this suggests that immersion may automatize nonnative processing, freeing cognitive resources that may play a crucial role in speech perception. These results lend support to the active control model of speech perception.

  10. Creating speech-synchronized animation.

    PubMed

    King, Scott A; Parent, Richard E

    2005-01-01

    We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements.

  11. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    ERIC Educational Resources Information Center

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  12. Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks

    ERIC Educational Resources Information Center

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2016-01-01

    This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…

  13. A speech-controlled environmental control system for people with severe dysarthria.

    PubMed

    Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca

    2007-06-01

    Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.

  14. Relationships among Rapid Digit Naming, Phonological Processing, Motor Automaticity, and Speech Perception in Poor, Average, and Good Readers and Spellers

    ERIC Educational Resources Information Center

    Savage, Robert S.; Frederickson, Norah; Goodwin, Roz; Patni, Ulla; Smith, Nicola; Tuersley, Louise

    2005-01-01

    In this article, we explore the relationship between rapid automatized naming (RAN) and other cognitive processes among below-average, average, and above-average readers and spellers. Nonsense word reading, phonological awareness, RAN, automaticity of balance, speech perception, and verbal short-term and working memory were measured. Factor…

  15. Assessing Children's Home Language Environments Using Automatic Speech Recognition Technology

    ERIC Educational Resources Information Center

    Greenwood, Charles R.; Thiemann-Bourque, Kathy; Walker, Dale; Buzhardt, Jay; Gilkerson, Jill

    2011-01-01

    The purpose of this research was to replicate and extend some of the findings of Hart and Risley using automatic speech processing instead of human transcription of language samples. The long-term goal of this work is to make the current approach to speech processing possible by researchers and clinicians working on a daily basis with families and…

  16. An Exploration of the Potential of Automatic Speech Recognition to Assist and Enable Receptive Communication in Higher Education

    ERIC Educational Resources Information Center

    Wald, Mike

    2006-01-01

    The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search…

  17. Multilevel Analysis in Analyzing Speech Data

    ERIC Educational Resources Information Center

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  18. Accelerometer-based automatic voice onset detection in speech mapping with navigated repetitive transcranial magnetic stimulation.

    PubMed

    Vitikainen, Anne-Mari; Mäkelä, Elina; Lioumis, Pantelis; Jousmäki, Veikko; Mäkelä, Jyrki P

    2015-09-30

    The use of navigated repetitive transcranial magnetic stimulation (rTMS) in mapping of speech-related brain areas has recently shown to be useful in preoperative workflow of epilepsy and tumor patients. However, substantial inter- and intraobserver variability and non-optimal replicability of the rTMS results have been reported, and a need for additional development of the methodology is recognized. In TMS motor cortex mappings the evoked responses can be quantitatively monitored by electromyographic recordings; however, no such easily available setup exists for speech mappings. We present an accelerometer-based setup for detection of vocalization-related larynx vibrations combined with an automatic routine for voice onset detection for rTMS speech mapping applying naming. The results produced by the automatic routine were compared with the manually reviewed video-recordings. The new method was applied in the routine navigated rTMS speech mapping for 12 consecutive patients during preoperative workup for epilepsy or tumor surgery. The automatic routine correctly detected 96% of the voice onsets, resulting in 96% sensitivity and 71% specificity. Majority (63%) of the misdetections were related to visible throat movements, extra voices before the response, or delayed naming of the previous stimuli. The no-response errors were correctly detected in 88% of events. The proposed setup for automatic detection of voice onsets provides quantitative additional data for analysis of the rTMS-induced speech response modifications. The objectively defined speech response latencies increase the repeatability, reliability and stratification of the rTMS results. Copyright © 2015 Elsevier B.V. All rights reserved.

  19. How Does Reading Performance Modulate the Impact of Orthographic Knowledge on Speech Processing? A Comparison of Normal Readers and Dyslexic Adults

    ERIC Educational Resources Information Center

    Pattamadilok, Chotiga; Nelis, Aubéline; Kolinsky, Régine

    2014-01-01

    Studies on proficient readers showed that speech processing is affected by knowledge of the orthographic code. Yet, the automaticity of the orthographic influence depends on task demand. Here, we addressed this automaticity issue in normal and dyslexic adult readers by comparing the orthographic effects obtained in two speech processing tasks that…

  20. Female voice communications in high level aircraft cockpit noises--part II: vocoder and automatic speech recognition systems.

    PubMed

    Nixon, C; Anderson, T; Morris, L; McCavitt, A; McKinley, R; Yeager, D; McDaniel, M

    1998-11-01

    The intelligibility of female and male speech is equivalent under most ordinary living conditions. However, due to small differences between their acoustic speech signals, called speech spectra, one can be more or less intelligible than the other in certain situations such as high levels of noise. Anecdotal information, supported by some empirical observations, suggests that some of the high intensity noise spectra of military aircraft cockpits may degrade the intelligibility of female speech more than that of male speech. In an applied research study, the intelligibility of female and male speech was measured in several high level aircraft cockpit noise conditions experienced in military aviation. In Part I, (Nixon CW, et al. Aviat Space Environ Med 1998; 69:675-83) female speech intelligibility measured in the spectra and levels of aircraft cockpit noises and with noise-canceling microphones was lower than that of the male speech in all conditions. However, the differences were small and only those at some of the highest noise levels were significant. Although speech intelligibility of both genders was acceptable during normal cruise noises, improvements are required in most of the highest levels of noise created during maximum aircraft operating conditions. These results are discussed in a Part I technical report. This Part II report examines the intelligibility in the same aircraft cockpit noises of vocoded female and male speech and the accuracy with which female and male speech in some of the cockpit noises were understood by automatic speech recognition systems. The intelligibility of vocoded female speech was generally the same as that of vocoded male speech. No significant differences were measured between the recognition accuracy of male and female speech by the automatic speech recognition systems. The intelligibility of female and male speech was equivalent for these conditions.

  1. Objective voice and speech analysis of persons with chronic hoarseness by prosodic analysis of speech samples.

    PubMed

    Haderlein, Tino; Döllinger, Michael; Matoušek, Václav; Nöth, Elmar

    2016-10-01

    Automatic voice assessment is often performed using sustained vowels. In contrast, speech analysis of read-out texts can be applied to voice and speech assessment. Automatic speech recognition and prosodic analysis were used to find regression formulae between automatic and perceptual assessment of four voice and four speech criteria. The regression was trained with 21 men and 62 women (average age 49.2 years) and tested with another set of 24 men and 49 women (48.3 years), all suffering from chronic hoarseness. They read the text 'Der Nordwind und die Sonne' ('The North Wind and the Sun'). Five voice and speech therapists evaluated the data on 5-point Likert scales. Ten prosodic and recognition accuracy measures (features) were identified which describe all the examined criteria. Inter-rater correlation within the expert group was between r = 0.63 for the criterion 'match of breath and sense units' and r = 0.87 for the overall voice quality. Human-machine correlation was between r = 0.40 for the match of breath and sense units and r = 0.82 for intelligibility. The perceptual ratings of different criteria were highly correlated with each other. Likewise, the feature sets modeling the criteria were very similar. The automatic method is suitable for assessing chronic hoarseness in general and for subgroups of functional and organic dysphonia. In its current version, it is almost as reliable as a randomly picked rater from a group of voice and speech therapists.

  2. Evaluation of an Expert System for the Generation of Speech and Language Therapy Plans.

    PubMed

    Robles-Bykbaev, Vladimir; López-Nores, Martín; García-Duque, Jorge; Pazos-Arias, José J; Arévalo-Lucero, Daysi

    2016-07-01

    Speech and language pathologists (SLPs) deal with a wide spectrum of disorders, arising from many different conditions, that affect voice, speech, language, and swallowing capabilities in different ways. Therefore, the outcomes of Speech and Language Therapy (SLT) are highly dependent on the accurate, consistent, and complete design of personalized therapy plans. However, SLPs often have very limited time to work with their patients and to browse the large (and growing) catalogue of activities and specific exercises that can be put into therapy plans. As a consequence, many plans are suboptimal and fail to address the specific needs of each patient. We aimed to evaluate an expert system that automatically generates plans for speech and language therapy, containing semiannual activities in the five areas of hearing, oral structure and function, linguistic formulation, expressive language and articulation, and receptive language. The goal was to assess whether the expert system speeds up the SLPs' work and leads to more accurate, consistent, and complete therapy plans for their patients. We examined the evaluation results of the SPELTA expert system in supporting the decision making of 4 SLPs treating children in three special education institutions in Ecuador. The expert system was first trained with data from 117 cases, including medical data; diagnosis for voice, speech, language and swallowing capabilities; and therapy plans created manually by the SLPs. It was then used to automatically generate new therapy plans for 13 new patients. The SLPs were finally asked to evaluate the accuracy, consistency, and completeness of those plans. A four-fold cross-validation experiment was also run on the original corpus of 117 cases in order to assess the significance of the results. The evaluation showed that 87% of the outputs provided by the SPELTA expert system were considered valid therapy plans for the different areas. The SLPs rated the overall accuracy, consistency, and completeness of the proposed activities with 4.65, 4.6, and 4.6 points (to a maximum of 5), respectively. The ratings for the subplans generated for the areas of hearing, oral structure and function, and linguistic formulation were nearly perfect, whereas the subplans for expressive language and articulation and for receptive language failed to deal properly with some of the subject cases. Overall, the SLPs indicated that over 90% of the subplans generated automatically were "better than" or "as good as" what the SLPs would have created manually if given the average time they can devote to the task. The cross-validation experiment yielded very similar results. The results show that the SPELTA expert system provides valuable input for SLPs to design proper therapy plans for their patients, in a shorter time and considering a larger set of activities than proceeding manually. The algorithms worked well even in the presence of a sparse corpus, and the evidence suggests that the system will become more reliable as it is trained with more subjects.

  3. Evaluation of an Expert System for the Generation of Speech and Language Therapy Plans

    PubMed Central

    López-Nores, Martín; García-Duque, Jorge; Pazos-Arias, José J; Arévalo-Lucero, Daysi

    2016-01-01

    Background Speech and language pathologists (SLPs) deal with a wide spectrum of disorders, arising from many different conditions, that affect voice, speech, language, and swallowing capabilities in different ways. Therefore, the outcomes of Speech and Language Therapy (SLT) are highly dependent on the accurate, consistent, and complete design of personalized therapy plans. However, SLPs often have very limited time to work with their patients and to browse the large (and growing) catalogue of activities and specific exercises that can be put into therapy plans. As a consequence, many plans are suboptimal and fail to address the specific needs of each patient. Objective We aimed to evaluate an expert system that automatically generates plans for speech and language therapy, containing semiannual activities in the five areas of hearing, oral structure and function, linguistic formulation, expressive language and articulation, and receptive language. The goal was to assess whether the expert system speeds up the SLPs’ work and leads to more accurate, consistent, and complete therapy plans for their patients. Methods We examined the evaluation results of the SPELTA expert system in supporting the decision making of 4 SLPs treating children in three special education institutions in Ecuador. The expert system was first trained with data from 117 cases, including medical data; diagnosis for voice, speech, language and swallowing capabilities; and therapy plans created manually by the SLPs. It was then used to automatically generate new therapy plans for 13 new patients. The SLPs were finally asked to evaluate the accuracy, consistency, and completeness of those plans. A four-fold cross-validation experiment was also run on the original corpus of 117 cases in order to assess the significance of the results. Results The evaluation showed that 87% of the outputs provided by the SPELTA expert system were considered valid therapy plans for the different areas. The SLPs rated the overall accuracy, consistency, and completeness of the proposed activities with 4.65, 4.6, and 4.6 points (to a maximum of 5), respectively. The ratings for the subplans generated for the areas of hearing, oral structure and function, and linguistic formulation were nearly perfect, whereas the subplans for expressive language and articulation and for receptive language failed to deal properly with some of the subject cases. Overall, the SLPs indicated that over 90% of the subplans generated automatically were “better than” or “as good as” what the SLPs would have created manually if given the average time they can devote to the task. The cross-validation experiment yielded very similar results. Conclusions The results show that the SPELTA expert system provides valuable input for SLPs to design proper therapy plans for their patients, in a shorter time and considering a larger set of activities than proceeding manually. The algorithms worked well even in the presence of a sparse corpus, and the evidence suggests that the system will become more reliable as it is trained with more subjects. PMID:27370070

  4. A real-time phoneme counting algorithm and application for speech rate monitoring.

    PubMed

    Aharonson, Vered; Aharonson, Eran; Raichlin-Levi, Katia; Sotzianu, Aviv; Amir, Ofer; Ovadia-Blechman, Zehava

    2017-03-01

    Adults who stutter can learn to control and improve their speech fluency by modifying their speaking rate. Existing speech therapy technologies can assist this practice by monitoring speaking rate and providing feedback to the patient, but cannot provide an accurate, quantitative measurement of speaking rate. Moreover, most technologies are too complex and costly to be used for home practice. We developed an algorithm and a smartphone application that monitor a patient's speaking rate in real time and provide user-friendly feedback to both patient and therapist. Our speaking rate computation is performed by a phoneme counting algorithm which implements spectral transition measure extraction to estimate phoneme boundaries. The algorithm is implemented in real time in a mobile application that presents its results in a user-friendly interface. The application incorporates two modes: one provides the patient with visual feedback of his/her speech rate for self-practice and another provides the speech therapist with recordings, speech rate analysis and tools to manage the patient's practice. The algorithm's phoneme counting accuracy was validated on ten healthy subjects who read a paragraph at slow, normal and fast paces, and was compared to manual counting of speech experts. Test-retest and intra-counter reliability were assessed. Preliminary results indicate differences of -4% to 11% between automatic and human phoneme counting. Differences were largest for slow speech. The application can thus provide reliable, user-friendly, real-time feedback for speaking rate control practice. Copyright © 2017 Elsevier Inc. All rights reserved.

  5. A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech

    PubMed Central

    Tóth, László; Hoffmann, Ildikó; Gosztolya, Gábor; Vincze, Veronika; Szatlóczki, Gréta; Bánréti, Zoltán; Pákáski, Magdolna; Kálmán, János

    2018-01-01

    Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer’s disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive de-cline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech sig-nals, first manually (using the Praat software), and then automatically, with an automatic speech recogni-tion (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. Conclusion: The temporal analysis of spontaneous speech can be exploited in implementing a new, auto-matic detection-based tool for screening MCI for the community. PMID:29165085

  6. A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech.

    PubMed

    Toth, Laszlo; Hoffmann, Ildiko; Gosztolya, Gabor; Vincze, Veronika; Szatloczki, Greta; Banreti, Zoltan; Pakaski, Magdolna; Kalman, Janos

    2018-01-01

    Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. The temporal analysis of spontaneous speech can be exploited in implementing a new, automatic detection-based tool for screening MCI for the community. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  7. On the context dependency of implicit self-esteem in social anxiety disorder.

    PubMed

    Hiller, Thomas S; Steffens, Melanie C; Ritter, Viktoria; Stangier, Ulrich

    2017-12-01

    Cognitive models assume that negative self-evaluations are automatically activated in individuals with Social Anxiety Disorder (SAD) during social situations, increasing their individual level of anxiety. This study examined automatic self-evaluations (i.e., implicit self-esteem) and state anxiety in a group of individuals with SAD (n = 45) and a non-clinical comparison group (NC; n = 46). Participants were randomly assigned to either a speech condition with social threat induction (giving an impromptu speech) or to a no-speech condition without social threat induction. We measured implicit self-esteem with an Implicit Association Test (IAT). Implicit self-esteem differed significantly between SAD and NC groups under the speech condition but not under the no-speech condition. The SAD group showed lower implicit self-esteem than the NC group under the speech-condition. State anxiety was significantly higher under the speech condition than under the no-speech condition in the SAD group but not in the NC group. Mediation analyses supported the idea that for the SAD group, the effect of experimental condition on state anxiety was mediated by implicit self-esteem. The causal relation between implicit self-esteem and state anxiety could not be determined. The findings corroborate hypotheses derived from cognitive models of SAD: Automatic self-evaluations were negatively biased in individuals with SAD facing social threat and showed an inverse relationship to levels of state anxiety. However, automatic self-evaluations in individuals with SAD can be unbiased (similar to NC) in situations without social threat. Copyright © 2017 Elsevier Ltd. All rights reserved.

  8. Automatic Title Generation for Spoken Broadcast News

    DTIC Science & Technology

    2001-01-01

    degrades much less with speech -recognized transcripts. Meanwhile, even though KNN performance not as well as TF.IDF and NBL in terms of F1 metric, it...test corpus of 1006 broadcast news documents, comparing the results over manual transcription to the results over automatically recognized speech . We...use both F1 and the average number of correct title words in the correct order as metric. Overall, the results show that title generation for speech

  9. Automatic lip reading by using multimodal visual features

    NASA Astrophysics Data System (ADS)

    Takahashi, Shohei; Ohya, Jun

    2013-12-01

    Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.

  10. An investigation of articulatory setting using real-time magnetic resonance imaging

    PubMed Central

    Ramanarayanan, Vikram; Goldstein, Louis; Byrd, Dani; Narayanan, Shrikanth S.

    2013-01-01

    This paper presents an automatic procedure to analyze articulatory setting in speech production using real-time magnetic resonance imaging of the moving human vocal tract. The procedure extracts frames corresponding to inter-speech pauses, speech-ready intervals and absolute rest intervals from magnetic resonance imaging sequences of read and spontaneous speech elicited from five healthy speakers of American English and uses automatically extracted image features to quantify vocal tract posture during these intervals. Statistical analyses show significant differences between vocal tract postures adopted during inter-speech pauses and those at absolute rest before speech; the latter also exhibits a greater variability in the adopted postures. In addition, the articulatory settings adopted during inter-speech pauses in read and spontaneous speech are distinct. The results suggest that adopted vocal tract postures differ on average during rest positions, ready positions and inter-speech pauses, and might, in that order, involve an increasing degree of active control by the cognitive speech planning mechanism. PMID:23862826

  11. Assessment of voice, speech, and related quality of life in advanced head and neck cancer patients 10-years+ after chemoradiotherapy.

    PubMed

    Kraaijenga, S A C; Oskam, I M; van Son, R J J H; Hamming-Vrieze, O; Hilgers, F J M; van den Brekel, M W M; van der Molen, L

    2016-04-01

    Assessment of long-term objective and subjective voice, speech, articulation, and quality of life in patients with head and neck cancer (HNC) treated with concurrent chemoradiotherapy (CRT) for advanced, stage IV disease. Twenty-two disease-free survivors, treated with cisplatin-based CRT for inoperable HNC (1999-2004), were evaluated at 10-years post-treatment. A standard Dutch text was recorded. Perceptual analysis of voice, speech, and articulation was conducted by two expert listeners (SLPs). Also an experimental expert system based on automatic speech recognition was used. Patients' perception of voice and speech and related quality of life was assessed with the Voice Handicap Index (VHI) and Speech Handicap Index (SHI) questionnaires. At a median follow-up of 11-years, perceptual evaluation showed abnormal scores in up to 64% of cases, depending on the outcome parameter analyzed. Automatic assessment of voice and speech parameters correlated moderate to strong with perceptual outcome scores. Patient-reported problems with voice (VHI>15) and speech (SHI>6) in daily life were present in 68% and 77% of patients, respectively. Patients treated with IMRT showed significantly less impairment compared to those treated with conventional radiotherapy. More than 10-years after organ-preservation treatment, voice and speech problems are common in this patient cohort, as assessed with perceptual evaluation, automatic speech recognition, and with validated structured questionnaires. There were fewer complaints in patients treated with IMRT than with conventional radiotherapy. Copyright © 2016 Elsevier Ltd. All rights reserved.

  12. Speech fluency profile on different tasks for individuals with Parkinson's disease.

    PubMed

    Juste, Fabiola Staróbole; Andrade, Claudia Regina Furquim de

    2017-07-20

    To characterize the speech fluency profile of patients with Parkinson's disease. Study participants were 40 individuals of both genders aged 40 to 80 years divided into 2 groups: Research Group - RG (20 individuals with diagnosis of Parkinson's disease) and Control Group - CG (20 individuals with no communication or neurological disorders). For all of the participants, three speech samples involving different tasks were collected: monologue, individual reading, and automatic speech. The RG presented a significant larger number of speech disruptions, both stuttering-like and typical dysfluencies, and higher percentage of speech discontinuity in the monologue and individual reading tasks compared with the CG. Both groups presented reduced number of speech disruptions (stuttering-like and typical dysfluencies) in the automatic speech task; the groups presented similar performance in this task. Regarding speech rate, individuals in the RG presented lower number of words and syllables per minute compared with those in the CG in all speech tasks. Participants of the RG presented altered parameters of speech fluency compared with those of the CG; however, this change in fluency cannot be considered a stuttering disorder.

  13. Action Unit Models of Facial Expression of Emotion in the Presence of Speech

    PubMed Central

    Shah, Miraj; Cooper, David G.; Cao, Houwei; Gur, Ruben C.; Nenkova, Ani; Verma, Ragini

    2014-01-01

    Automatic recognition of emotion using facial expressions in the presence of speech poses a unique challenge because talking reveals clues for the affective state of the speaker but distorts the canonical expression of emotion on the face. We introduce a corpus of acted emotion expression where speech is either present (talking) or absent (silent). The corpus is uniquely suited for analysis of the interplay between the two conditions. We use a multimodal decision level fusion classifier to combine models of emotion from talking and silent faces as well as from audio to recognize five basic emotions: anger, disgust, fear, happy and sad. Our results strongly indicate that emotion prediction in the presence of speech from action unit facial features is less accurate when the person is talking. Modeling talking and silent expressions separately and fusing the two models greatly improves accuracy of prediction in the talking setting. The advantages are most pronounced when silent and talking face models are fused with predictions from audio features. In this multi-modal prediction both the combination of modalities and the separate models of talking and silent facial expression of emotion contribute to the improvement. PMID:25525561

  14. Automatic speech recognition in air traffic control

    NASA Technical Reports Server (NTRS)

    Karlsson, Joakim

    1990-01-01

    Automatic Speech Recognition (ASR) technology and its application to the Air Traffic Control system are described. The advantages of applying ASR to Air Traffic Control, as well as criteria for choosing a suitable ASR system are presented. Results from previous research and directions for future work at the Flight Transportation Laboratory are outlined.

  15. Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation

    ERIC Educational Resources Information Center

    Kim, In-Seok

    2006-01-01

    This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…

  16. Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation

    ERIC Educational Resources Information Center

    Elimat, Amal Khalil; AbuSeileek, Ali Farhan

    2014-01-01

    This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…

  17. Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.

    ERIC Educational Resources Information Center

    Makhoul, John I.

    The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…

  18. Investigating Prompt Difficulty in an Automatically Scored Speaking Performance Assessment

    ERIC Educational Resources Information Center

    Cox, Troy L.

    2013-01-01

    Speaking assessments for second language learners have traditionally been expensive to administer because of the cost of rating the speech samples. To reduce the cost, many researchers are investigating the potential of using automatic speech recognition (ASR) as a means to score examinee responses to open-ended prompts. This study examined the…

  19. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    ERIC Educational Resources Information Center

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  20. The performance of an automatic acoustic-based program classifier compared to hearing aid users' manual selection of listening programs.

    PubMed

    Searchfield, Grant D; Linford, Tania; Kobayashi, Kei; Crowhen, David; Latzel, Matthias

    2018-03-01

    To compare preference for and performance of manually selected programmes to an automatic sound classifier, the Phonak AutoSense OS. A single blind repeated measures study. Participants were fit with Phonak Virto V90 ITE aids; preferences for different listening programmes were compared across four different sound scenarios (speech in: quiet, noise, loud noise and a car). Following a 4-week trial preferences were reassessed and the users preferred programme was compared to the automatic classifier for sound quality and hearing in noise (HINT test) using a 12 loudspeaker array. Twenty-five participants with symmetrical moderate-severe sensorineural hearing loss. Participant preferences of manual programme for scenarios varied considerably between and within sessions. A HINT Speech Reception Threshold (SRT) advantage was observed for the automatic classifier over participant's manual selection for speech in quiet, loud noise and car noise. Sound quality ratings were similar for both manual and automatic selections. The use of a sound classifier is a viable alternative to manual programme selection.

  1. Development of A Two-Stage Procedure for the Automatic Recognition of Dysfluencies in the Speech of Children Who Stutter: I. Psychometric Procedures Appropriate for Selection of Training Material for Lexical Dysfluency Classifiers

    PubMed Central

    Howell, Peter; Sackin, Stevie; Glenn, Kazan

    2007-01-01

    This program of work is intended to develop automatic recognition procedures to locate and assess stuttered dysfluencies. This and the following article together, develop and test recognizers for repetitions and prolongations. The automatic recognizers classify the speech in two stages: In the first, the speech is segmented and in the second the segments are categorized. The units that are segmented are words. Here assessments by human judges on the speech of 12 children who stutter are described using a corresponding procedure. The accuracy of word boundary placement across judges, categorization of the words as fluent, repetition or prolongation, and duration of the different fluency categories are reported. These measures allow reliable instances of repetitions and prolongations to be selected for training and assessing the recognizers in the subsequent paper. PMID:9328878

  2. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    NASA Astrophysics Data System (ADS)

    Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.

    2009-12-01

    This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  3. Inter-speaker speech variability assessment using statistical deformable models from 3.0 tesla magnetic resonance images.

    PubMed

    Vasconcelos, Maria J M; Ventura, Sandra M R; Freitas, Diamantino R S; Tavares, João Manuel R S

    2012-03-01

    The morphological and dynamic characterisation of the vocal tract during speech production has been gaining greater attention due to the motivation of the latest improvements in magnetic resonance (MR) imaging; namely, with the use of higher magnetic fields, such as 3.0 Tesla. In this work, the automatic study of the vocal tract from 3.0 Tesla MR images was assessed through the application of statistical deformable models. Therefore, the primary goal focused on the analysis of the shape of the vocal tract during the articulation of European Portuguese sounds, followed by the evaluation of the results concerning the automatic segmentation, i.e. identification of the vocal tract in new MR images. In what concerns speech production, this is the first attempt to automatically characterise and reconstruct the vocal tract shape of 3.0 Tesla MR images by using deformable models; particularly, by using active and appearance shape models. The achieved results clearly evidence the adequacy and advantage of the automatic analysis of the 3.0 Tesla MR images of these deformable models in order to extract the vocal tract shape and assess the involved articulatory movements. These achievements are mostly required, for example, for a better knowledge of speech production, mainly of patients suffering from articulatory disorders, and to build enhanced speech synthesizer models.

  4. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance

    PubMed Central

    Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn

    2016-01-01

    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486

  5. Speech Intelligibility Predicted from Neural Entrainment of the Speech Envelope.

    PubMed

    Vanthornhout, Jonas; Decruy, Lien; Wouters, Jan; Simon, Jonathan Z; Francart, Tom

    2018-04-01

    Speech intelligibility is currently measured by scoring how well a person can identify a speech signal. The results of such behavioral measures reflect neural processing of the speech signal, but are also influenced by language processing, motivation, and memory. Very often, electrophysiological measures of hearing give insight in the neural processing of sound. However, in most methods, non-speech stimuli are used, making it hard to relate the results to behavioral measures of speech intelligibility. The use of natural running speech as a stimulus in electrophysiological measures of hearing is a paradigm shift which allows to bridge the gap between behavioral and electrophysiological measures. Here, by decoding the speech envelope from the electroencephalogram, and correlating it with the stimulus envelope, we demonstrate an electrophysiological measure of neural processing of running speech. We show that behaviorally measured speech intelligibility is strongly correlated with our electrophysiological measure. Our results pave the way towards an objective and automatic way of assessing neural processing of speech presented through auditory prostheses, reducing confounds such as attention and cognitive capabilities. We anticipate that our electrophysiological measure will allow better differential diagnosis of the auditory system, and will allow the development of closed-loop auditory prostheses that automatically adapt to individual users.

  6. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    NASA Astrophysics Data System (ADS)

    Heracleous, Panikos; Kaino, Tomomi; Saruwatari, Hiroshi; Shikano, Kiyohiro

    2006-12-01

    We present the use of stethoscope and silicon NAM (nonaudible murmur) microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (nonaudible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc.) for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a[InlineEquation not available: see fulltext.] word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  7. The Effect of Automatic Speech Recognition Eyespeak Software on Iraqi Students' English Pronunciation: A Pilot Study

    ERIC Educational Resources Information Center

    Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani

    2017-01-01

    The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…

  8. Using Automatic Speech Recognition Technology with Elicited Oral Response Testing

    ERIC Educational Resources Information Center

    Cox, Troy L.; Davies, Randall S.

    2012-01-01

    This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…

  9. Automatic evaluation of hypernasality based on a cleft palate speech database.

    PubMed

    He, Ling; Zhang, Jing; Liu, Qi; Yin, Heng; Lech, Margaret; Huang, Yunzhi

    2015-05-01

    The hypernasality is one of the most typical characteristics of cleft palate (CP) speech. The evaluation outcome of hypernasality grading decides the necessity of follow-up surgery. Currently, the evaluation of CP speech is carried out by experienced speech therapists. However, the result strongly depends on their clinical experience and subjective judgment. This work aims to propose an automatic evaluation system for hypernasality grading in CP speech. The database tested in this work is collected by the Hospital of Stomatology, Sichuan University, which has the largest number of CP patients in China. Based on the production process of hypernasality, source sound pulse and vocal tract filter features are presented. These features include pitch, the first and second energy amplified frequency bands, cepstrum based features, MFCC, short-time energy in the sub-bands features. These features combined with KNN classier are applied to automatically classify four grades of hypernasality: normal, mild, moderate and severe. The experiment results show that the proposed system achieves a good performance. The classification rates for four hypernasality grades reach up to 80.4%. The sensitivity of proposed features to the gender is also discussed.

  10. Primary Progressive Speech Abulia.

    PubMed

    Milano, Nicholas J; Heilman, Kenneth M

    2015-01-01

    Primary progressive aphasia (PPA) is a neurodegenerative disorder characterized by progressive language impairment. The three variants of PPA include the nonfluent/agrammatic, semantic, and logopenic types. The goal of this report is to describe two patients with a loss of speech initiation that was associated with bilateral medial frontal atrophy. Two patients with progressive speech deficits were evaluated and their examinations revealed a paucity of spontaneous speech; however their naming, repetition, reading, and writing were all normal. The patients had no evidence of agrammatism or apraxia of speech but did have impaired speech fluency. In addition to impaired production of propositional spontaneous speech, these patients had impaired production of automatic speech (e.g., reciting the Lord's Prayer) and singing. Structural brain imaging revealed bilateral medial frontal atrophy in both patients. These patients' language deficits are consistent with a PPA, but they are in the pattern of a dynamic aphasia. Whereas the signs-symptoms of dynamic aphasia have been previously described, to our knowledge these are the first cases associated with predominantly bilateral medial frontal atrophy that impaired both propositional and automatic speech. Thus, this profile may represent a new variant of PPA.

  11. Use of Computer Speech Technologies To Enhance Learning.

    ERIC Educational Resources Information Center

    Ferrell, Joe

    1999-01-01

    Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)

  12. Visual speech influences speech perception immediately but not automatically.

    PubMed

    Mitterer, Holger; Reinisch, Eva

    2017-02-01

    Two experiments examined the time course of the use of auditory and visual speech cues to spoken word recognition using an eye-tracking paradigm. Results of the first experiment showed that the use of visual speech cues from lipreading is reduced if concurrently presented pictures require a division of attentional resources. This reduction was evident even when listeners' eye gaze was on the speaker rather than the (static) pictures. Experiment 2 used a deictic hand gesture to foster attention to the speaker. At the same time, the visual processing load was reduced by keeping the visual display constant over a fixed number of successive trials. Under these conditions, the visual speech cues from lipreading were used. Moreover, the eye-tracking data indicated that visual information was used immediately and even earlier than auditory information. In combination, these data indicate that visual speech cues are not used automatically, but if they are used, they are used immediately.

  13. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    NASA Astrophysics Data System (ADS)

    Caballero Morales, Santiago Omar; Cox, Stephen J.

    2009-12-01

    Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  14. Automatic speech recognition using a predictive echo state network classifier.

    PubMed

    Skowronski, Mark D; Harris, John G

    2007-04-01

    We have combined an echo state network (ESN) with a competitive state machine framework to create a classification engine called the predictive ESN classifier. We derive the expressions for training the predictive ESN classifier and show that the model was significantly more noise robust compared to a hidden Markov model in noisy speech classification experiments by 8+/-1 dB signal-to-noise ratio. The simple training algorithm and noise robustness of the predictive ESN classifier make it an attractive classification engine for automatic speech recognition.

  15. Temporal attractors for speech onsets

    NASA Astrophysics Data System (ADS)

    Port, Robert; Oglesbee, Eric

    2003-10-01

    When subjects say a single syllable like da in time with a metronome, what is the easiest relationship? Superimposed on the metronome pulse, of course. The second easiest way is probably to locate the syllable halfway between pulses. We tested these hypotheses by having subjects repeat da at both phase angles at a range of metronome rates. The vowel onset (or P-center) was automatically obtained for each token. In-phase targets were produced close to the metronome onset for rates as fast as 3 per second. Antiphase targets were accurate at slow rates (~2/s) but tended to slip to inphase timing with faster metronomes. These results resemble the findings of Haken et al. [Biol. Cybern. 51, 347-356 (1985)] for oscillatory finger motions. Results suggest a strong attractor for speech onsets at zero phase and a weaker attractor at phase 0.5 that may disappear as rate is increased.

  16. Noise-robust speech recognition through auditory feature detection and spike sequence decoding.

    PubMed

    Schafer, Phillip B; Jin, Dezhe Z

    2014-03-01

    Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.

  17. The Use of an Autonomous Pedagogical Agent and Automatic Speech Recognition for Teaching Sight Words to Students with Autism Spectrum Disorder

    ERIC Educational Resources Information Center

    Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.

    2017-01-01

    In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…

  18. Voice technology and BBN

    NASA Technical Reports Server (NTRS)

    Wolf, Jared J.

    1977-01-01

    The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.

  19. Presentation video retrieval using automatically recovered slide and spoken text

    NASA Astrophysics Data System (ADS)

    Cooper, Matthew

    2013-03-01

    Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.

  20. User Evaluation of a Communication System That Automatically Generates Captions to Improve Telephone Communication

    PubMed Central

    Zekveld, Adriana A.; Kramer, Sophia E.; Kessens, Judith M.; Vlaming, Marcel S. M. G.; Houtgast, Tammo

    2009-01-01

    This study examined the subjective benefit obtained from automatically generated captions during telephone-speech comprehension in the presence of babble noise. Short stories were presented by telephone either with or without captions that were generated offline by an automatic speech recognition (ASR) system. To simulate online ASR, the word accuracy (WA) level of the captions was 60% or 70% and the text was presented delayed to the speech. After each test, the hearing impaired participants (n = 20) completed the NASA-Task Load Index and several rating scales evaluating the support from the captions. Participants indicated that using the erroneous text in speech comprehension was difficult and the reported task load did not differ between the audio + text and audio-only conditions. In a follow-up experiment (n = 10), the perceived benefit of presenting captions increased with an increase of WA levels to 80% and 90%, and elimination of the text delay. However, in general, the task load did not decrease when captions were presented. These results suggest that the extra effort required to process the text could have been compensated for by less effort required to comprehend the speech. Future research should aim at reducing the complexity of the task to increase the willingness of hearing impaired persons to use an assistive communication system automatically providing captions. The current results underline the need for obtaining both objective and subjective measures of benefit when evaluating assistive communication systems. PMID:19126551

  1. Second Language Learners and Speech Act Comprehension

    ERIC Educational Resources Information Center

    Holtgraves, Thomas

    2007-01-01

    Recognizing the specific speech act ( Searle, 1969) that a speaker performs with an utterance is a fundamental feature of pragmatic competence. Past research has demonstrated that native speakers of English automatically recognize speech acts when they comprehend utterances (Holtgraves & Ashley, 2001). The present research examined whether this…

  2. Sentence Recognition Prediction for Hearing-impaired Listeners in Stationary and Fluctuation Noise With FADE

    PubMed Central

    Schädler, Marc René; Warzybok, Anna; Meyer, Bernd T.; Brand, Thomas

    2016-01-01

    To characterize the individual patient’s hearing impairment as obtained with the matrix sentence recognition test, a simulation Framework for Auditory Discrimination Experiments (FADE) is extended here using the Attenuation and Distortion (A+D) approach by Plomp as a blueprint for setting the individual processing parameters. FADE has been shown to predict the outcome of both speech recognition tests and psychoacoustic experiments based on simulations using an automatic speech recognition system requiring only few assumptions. It builds on the closed-set matrix sentence recognition test which is advantageous for testing individual speech recognition in a way comparable across languages. Individual predictions of speech recognition thresholds in stationary and in fluctuating noise were derived using the audiogram and an estimate of the internal level uncertainty for modeling the individual Plomp curves fitted to the data with the Attenuation (A-) and Distortion (D-) parameters of the Plomp approach. The “typical” audiogram shapes from Bisgaard et al with or without a “typical” level uncertainty and the individual data were used for individual predictions. As a result, the individualization of the level uncertainty was found to be more important than the exact shape of the individual audiogram to accurately model the outcome of the German Matrix test in stationary or fluctuating noise for listeners with hearing impairment. The prediction accuracy of the individualized approach also outperforms the (modified) Speech Intelligibility Index approach which is based on the individual threshold data only. PMID:27604782

  3. Two Methods of Automatic Evaluation of Speech Signal Enhancement Recorded in the Open-Air MRI Environment

    NASA Astrophysics Data System (ADS)

    Přibil, Jiří; Přibilová, Anna; Frollo, Ivan

    2017-12-01

    The paper focuses on two methods of evaluation of successfulness of speech signal enhancement recorded in the open-air magnetic resonance imager during phonation for the 3D human vocal tract modeling. The first approach enables to obtain a comparison based on statistical analysis by ANOVA and hypothesis tests. The second method is based on classification by Gaussian mixture models (GMM). The performed experiments have confirmed that the proposed ANOVA and GMM classifiers for automatic evaluation of the speech quality are functional and produce fully comparable results with the standard evaluation based on the listening test method.

  4. Children's Perception of Conversational and Clear American-English Vowels in Noise

    ERIC Educational Resources Information Center

    Leone, Dorothy; Levy, Erika S.

    2015-01-01

    Purpose: Much of a child's day is spent listening to speech in the presence of background noise. Although accurate vowel perception is important for listeners' accurate speech perception and comprehension, little is known about children's vowel perception in noise. "Clear speech" is a speech style frequently used by talkers in the…

  5. Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems.

    PubMed

    Greene, Beth G; Logan, John S; Pisoni, David B

    1986-03-01

    We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.

  6. Performance of Automated Speech Scoring on Different Low- to Medium-Entropy Item Types for Low-Proficiency English Learners. Research Report. ETS RR-17-12

    ERIC Educational Resources Information Center

    Loukina, Anastassia; Zechner, Klaus; Yoon, Su-Youn; Zhang, Mo; Tao, Jidong; Wang, Xinhao; Lee, Chong Min; Mulholland, Matthew

    2017-01-01

    This report presents an overview of the "SpeechRater"? automated scoring engine model building and evaluation process for several item types with a focus on a low-English-proficiency test-taker population. We discuss each stage of speech scoring, including automatic speech recognition, filtering models for nonscorable responses, and…

  7. Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space

    PubMed Central

    Li, Kan; Príncipe, José C.

    2018-01-01

    This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime. PMID:29666568

  8. Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space.

    PubMed

    Li, Kan; Príncipe, José C

    2018-01-01

    This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.

  9. Automatic translation among spoken languages

    NASA Technical Reports Server (NTRS)

    Walter, Sharon M.; Costigan, Kelly

    1994-01-01

    The Machine Aided Voice Translation (MAVT) system was developed in response to the shortage of experienced military field interrogators with both foreign language proficiency and interrogation skills. Combining speech recognition, machine translation, and speech generation technologies, the MAVT accepts an interrogator's spoken English question and translates it into spoken Spanish. The spoken Spanish response of the potential informant can then be translated into spoken English. Potential military and civilian applications for automatic spoken language translation technology are discussed in this paper.

  10. Speech Recognition Software for Language Learning: Toward an Evaluation of Validity and Student Perceptions

    ERIC Educational Resources Information Center

    Cordier, Deborah

    2009-01-01

    A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…

  11. The Path to Language: Toward Bilingual Education for Deaf Children.

    ERIC Educational Resources Information Center

    Bouvet, Danielle

    Discussion of speech instruction in bilingual education for deaf children refutes the assumption that speech is acquired automatically by hearing children and examines a program in which deaf children are taught alongside hearing children. The first part looks at how speech functions and how children acquire it: including the nature of the…

  12. Cognitive Load in Voice Therapy Carry-Over Exercises

    ERIC Educational Resources Information Center

    Iwarsson, Jenny; Morris, David Jackson; Balling, Laura Winther

    2017-01-01

    Purpose: The cognitive load generated by online speech production may vary with the nature of the speech task. This article examines 3 speech tasks used in voice therapy carry-over exercises, in which a patient is required to adopt and automatize new voice behaviors, ultimately in daily spontaneous communication. Method: Twelve subjects produced…

  13. Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

    NASA Technical Reports Server (NTRS)

    Ye, Sherry

    2015-01-01

    NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.

  14. Automatic Analysis of Pronunciations for Children with Speech Sound Disorders.

    PubMed

    Dudy, Shiran; Bedrick, Steven; Asgari, Meysam; Kain, Alexander

    2018-07-01

    Computer-Assisted Pronunciation Training (CAPT) systems aim to help a child learn the correct pronunciations of words. However, while there are many online commercial CAPT apps, there is no consensus among Speech Language Therapists (SLPs) or non-professionals about which CAPT systems, if any, work well. The prevailing assumption is that practicing with such programs is less reliable and thus does not provide the feedback necessary to allow children to improve their performance. The most common method for assessing pronunciation performance is the Goodness of Pronunciation (GOP) technique. Our paper proposes two new GOP techniques. We have found that pronunciation models that use explicit knowledge about error pronunciation patterns can lead to more accurate classification whether a phoneme was correctly pronounced or not. We evaluate the proposed pronunciation assessment methods against a baseline state of the art GOP approach, and show that the proposed techniques lead to classification performance that is more similar to that of a human expert.

  15. Performance of wavelet analysis and neural networks for pathological voices identification

    NASA Astrophysics Data System (ADS)

    Salhi, Lotfi; Talbi, Mourad; Abid, Sabeur; Cherif, Adnane

    2011-09-01

    Within the medical environment, diverse techniques exist to assess the state of the voice of the patient. The inspection technique is inconvenient for a number of reasons, such as its high cost, the duration of the inspection, and above all, the fact that it is an invasive technique. This study focuses on a robust, rapid and accurate system for automatic identification of pathological voices. This system employs non-invasive, non-expensive and fully automated method based on hybrid approach: wavelet transform analysis and neural network classifier. First, we present the results obtained in our previous study while using classic feature parameters. These results allow visual identification of pathological voices. Second, quantified parameters drifting from the wavelet analysis are proposed to characterise the speech sample. On the other hand, a system of multilayer neural networks (MNNs) has been developed which carries out the automatic detection of pathological voices. The developed method was evaluated using voice database composed of recorded voice samples (continuous speech) from normophonic or dysphonic speakers. The dysphonic speakers were patients of a National Hospital 'RABTA' of Tunis Tunisia and a University Hospital in Brussels, Belgium. Experimental results indicate a success rate ranging between 75% and 98.61% for discrimination of normal and pathological voices using the proposed parameters and neural network classifier. We also compared the average classification rate based on the MNN, Gaussian mixture model and support vector machines.

  16. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications.

    PubMed

    VanDam, Mark; Silbert, Noah H

    2016-01-01

    Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output.

  17. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications

    PubMed Central

    2016-01-01

    Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output. PMID:27529813

  18. Massively-Parallel Architectures for Automatic Recognition of Visual Speech Signals

    DTIC Science & Technology

    1988-10-12

    Secusrity Clamifieation, Nlassively-Parallel Architectures for Automa ic Recognitio of Visua, Speech Signals 12. PERSONAL AUTHOR(S) Terrence J...characteristics of speech from tJhe, visual speech signals. Neural networks have been trained on a database of vowels. The rqw images of faces , aligned and...images of faces , aligned and preprocessed, were used as input to these network which were trained to estimate the corresponding envelope of the

  19. Speech Recognition as a Support Service for Deaf and Hard of Hearing Students: Adaptation and Evaluation. Final Report to Spencer Foundation.

    ERIC Educational Resources Information Center

    Stinson, Michael; Elliot, Lisa; McKee, Barbara; Coyne, Gina

    This report discusses a project that adapted new automatic speech recognition (ASR) technology to provide real-time speech-to-text transcription as a support service for students who are deaf and hard of hearing (D/HH). In this system, as the teacher speaks, a hearing intermediary, or captionist, dictates into the speech recognition system in a…

  20. Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems

    PubMed Central

    GREENE, BETH G.; LOGAN, JOHN S.; PISONI, DAVID B.

    2012-01-01

    We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered. PMID:23225916

  1. Fully Automatic Speech-Based Analysis of the Semantic Verbal Fluency Task.

    PubMed

    König, Alexandra; Linz, Nicklas; Tröger, Johannes; Wolters, Maria; Alexandersson, Jan; Robert, Phillipe

    2018-06-08

    Semantic verbal fluency (SVF) tests are routinely used in screening for mild cognitive impairment (MCI). In this task, participants name as many items as possible of a semantic category under a time constraint. Clinicians measure task performance manually by summing the number of correct words and errors. More fine-grained variables add valuable information to clinical assessment, but are time-consuming. Therefore, the aim of this study is to investigate whether automatic analysis of the SVF could provide these as accurate as manual and thus, support qualitative screening of neurocognitive impairment. SVF data were collected from 95 older people with MCI (n = 47), Alzheimer's or related dementias (ADRD; n = 24), and healthy controls (HC; n = 24). All data were annotated manually and automatically with clusters and switches. The obtained metrics were validated using a classifier to distinguish HC, MCI, and ADRD. Automatically extracted clusters and switches were highly correlated (r = 0.9) with manually established values, and performed as well on the classification task separating HC from persons with ADRD (area under curve [AUC] = 0.939) and MCI (AUC = 0.758). The results show that it is possible to automate fine-grained analyses of SVF data for the assessment of cognitive decline. © 2018 S. Karger AG, Basel.

  2. Sentence Recognition Prediction for Hearing-impaired Listeners in Stationary and Fluctuation Noise With FADE: Empowering the Attenuation and Distortion Concept by Plomp With a Quantitative Processing Model.

    PubMed

    Kollmeier, Birger; Schädler, Marc René; Warzybok, Anna; Meyer, Bernd T; Brand, Thomas

    2016-09-07

    To characterize the individual patient's hearing impairment as obtained with the matrix sentence recognition test, a simulation Framework for Auditory Discrimination Experiments (FADE) is extended here using the Attenuation and Distortion (A+D) approach by Plomp as a blueprint for setting the individual processing parameters. FADE has been shown to predict the outcome of both speech recognition tests and psychoacoustic experiments based on simulations using an automatic speech recognition system requiring only few assumptions. It builds on the closed-set matrix sentence recognition test which is advantageous for testing individual speech recognition in a way comparable across languages. Individual predictions of speech recognition thresholds in stationary and in fluctuating noise were derived using the audiogram and an estimate of the internal level uncertainty for modeling the individual Plomp curves fitted to the data with the Attenuation (A-) and Distortion (D-) parameters of the Plomp approach. The "typical" audiogram shapes from Bisgaard et al with or without a "typical" level uncertainty and the individual data were used for individual predictions. As a result, the individualization of the level uncertainty was found to be more important than the exact shape of the individual audiogram to accurately model the outcome of the German Matrix test in stationary or fluctuating noise for listeners with hearing impairment. The prediction accuracy of the individualized approach also outperforms the (modified) Speech Intelligibility Index approach which is based on the individual threshold data only. © The Author(s) 2016.

  3. An Evaluation of Output Signal to Noise Ratio as a Predictor of Cochlear Implant Speech Intelligibility.

    PubMed

    Watkins, Greg D; Swanson, Brett A; Suaning, Gregg J

    2018-02-22

    Cochlear implant (CI) sound processing strategies are usually evaluated in clinical studies involving experienced implant recipients. Metrics which estimate the capacity to perceive speech for a given set of audio and processing conditions provide an alternative means to assess the effectiveness of processing strategies. The aim of this research was to assess the ability of the output signal to noise ratio (OSNR) to accurately predict speech perception. It was hypothesized that compared with the other metrics evaluated in this study (1) OSNR would have equivalent or better accuracy and (2) OSNR would be the most accurate in the presence of variable levels of speech presentation. For the first time, the accuracy of OSNR as a metric which predicts speech intelligibility was compared, in a retrospective study, with that of the input signal to noise ratio (ISNR) and the short-term objective intelligibility (STOI) metric. Because STOI measured audio quality at the input to a CI sound processor, a vocoder was applied to the sound processor output and STOI was also calculated for the reconstructed audio signal (vocoder short-term objective intelligibility [VSTOI] metric). The figures of merit calculated for each metric were Pearson correlation of the metric and a psychometric function fitted to sentence scores at each predictor value (Pearson sigmoidal correlation [PSIG]), epsilon insensitive root mean square error (RMSE*) of the psychometric function and the sentence scores, and the statistical deviance of the fitted curve to the sentence scores (D). Sentence scores were taken from three existing data sets of Australian Sentence Tests in Noise results. The AuSTIN tests were conducted with experienced users of the Nucleus CI system. The score for each sentence was the proportion of morphemes the participant correctly repeated. In data set 1, all sentences were presented at 65 dB sound pressure level (SPL) in the presence of four-talker Babble noise. Each block of sentences used an adaptive procedure, with the speech presented at a fixed level and the ISNR varied. In data set 2, sentences were presented at 65 dB SPL in the presence of stationary speech weighted noise, street-side city noise, and cocktail party noise. An adaptive ISNR procedure was used. In data set 3, sentences were presented at levels ranging from 55 to 89 dB SPL with two automatic gain control configurations and two fixed ISNRs. For data set 1, the ISNR and OSNR were equally most accurate. STOI was significantly different for deviance (p = 0.045) and RMSE* (p < 0.001). VSTOI was significantly different for RMSE* (p < 0.001). For data set 2, ISNR and OSNR had an equivalent accuracy which was significantly better than that of STOI for PSIG (p = 0.029) and VSTOI for deviance (p = 0.001), RMSE*, and PSIG (both p < 0.001). For data set 3, OSNR was the most accurate metric and was significantly more accurate than VSTOI for deviance, RMSE*, and PSIG (all p < 0.001). ISNR and STOI were unable to predict the sentence scores for this data set. The study results supported the hypotheses. OSNR was found to have an accuracy equivalent to or better than ISNR, STOI, and VSTOI for tests conducted at a fixed presentation level and variable ISNR. OSNR was a more accurate metric than VSTOI for tests with fixed ISNRs and variable presentation levels. Overall, OSNR was the most accurate metric across the three data sets. OSNR holds promise as a prediction metric which could potentially improve the effectiveness of sound processor research and CI fitting.

  4. Intra- and Inter-database Study for Arabic, English, and German Databases: Do Conventional Speech Features Detect Voice Pathology?

    PubMed

    Ali, Zulfiqar; Alsulaiman, Mansour; Muhammad, Ghulam; Elamvazuthi, Irraivan; Al-Nasheri, Ahmed; Mesallam, Tamer A; Farahat, Mohamed; Malki, Khalid H

    2017-05-01

    A large population around the world has voice complications. Various approaches for subjective and objective evaluations have been suggested in the literature. The subjective approach strongly depends on the experience and area of expertise of a clinician, and human error cannot be neglected. On the other hand, the objective or automatic approach is noninvasive. Automatic developed systems can provide complementary information that may be helpful for a clinician in the early screening of a voice disorder. At the same time, automatic systems can be deployed in remote areas where a general practitioner can use them and may refer the patient to a specialist to avoid complications that may be life threatening. Many automatic systems for disorder detection have been developed by applying different types of conventional speech features such as the linear prediction coefficients, linear prediction cepstral coefficients, and Mel-frequency cepstral coefficients (MFCCs). This study aims to ascertain whether conventional speech features detect voice pathology reliably, and whether they can be correlated with voice quality. To investigate this, an automatic detection system based on MFCC was developed, and three different voice disorder databases were used in this study. The experimental results suggest that the accuracy of the MFCC-based system varies from database to database. The detection rate for the intra-database ranges from 72% to 95%, and that for the inter-database is from 47% to 82%. The results conclude that conventional speech features are not correlated with voice, and hence are not reliable in pathology detection. Copyright © 2017 The Voice Foundation. Published by Elsevier Inc. All rights reserved.

  5. Reference-free automatic quality assessment of tracheoesophageal speech.

    PubMed

    Huang, Andy; Falk, Tiago H; Chan, Wai-Yip; Parsa, Vijay; Doyle, Philip

    2009-01-01

    Evaluation of the quality of tracheoesophageal (TE) speech using machines instead of human experts can enhance the voice rehabilitation process for patients who have undergone total laryngectomy and voice restoration. Towards the goal of devising a reference-free TE speech quality estimation algorithm, we investigate the efficacy of speech signal features that are used in standard telephone-speech quality assessment algorithms, in conjunction with a recently introduced speech modulation spectrum measure. Tests performed on two TE speech databases demonstrate that the modulation spectral measure and a subset of features in the standard ITU-T P.563 algorithm estimate TE speech quality with better correlation (up to 0.9) than previously proposed features.

  6. On the recognition of emotional vocal expressions: motivations for a holistic approach.

    PubMed

    Esposito, Anna; Esposito, Antonietta M

    2012-10-01

    Human beings seem to be able to recognize emotions from speech very well and information communication technology aims to implement machines and agents that can do the same. However, to be able to automatically recognize affective states from speech signals, it is necessary to solve two main technological problems. The former concerns the identification of effective and efficient processing algorithms capable of capturing emotional acoustic features from speech sentences. The latter focuses on finding computational models able to classify, with an approximation as good as human listeners, a given set of emotional states. This paper will survey these topics and provide some insights for a holistic approach to the automatic analysis, recognition and synthesis of affective states.

  7. [The endpoint detection of cough signal in continuous speech].

    PubMed

    Yang, Guoqing; Mo, Hongqiang; Li, Wen; Lian, Lianfang; Zheng, Zeguang

    2010-06-01

    The endpoint detection of cough signal in continuous speech has been researched in order to improve the efficiency and veracity of manual recognition or computer-based automatic recognition. First, using the short time zero crossing ratio(ZCR) for identifying the suspicious coughs and getting the threshold of short time energy based on acoustic characteristics of cough. Then, the short time energy is combined with short time ZCR in order to implement the endpoint detection of cough in continuous speech. To evaluate the effect of the method, first, the virtual number of coughs in each recording was identified by two experienced doctors using the graphical user interface (GUI). Second, the recordings were analyzed by automatic endpoint detection program under Matlab7.0. Finally, the comparison between these two results showed: The error rate of undetected cough is 2.18%, and 98.13% of noise, silence and speech were removed. The way of setting short time energy threshold is robust. The endpoint detection program can remove most speech and noise, thus maintaining a lower rate of error.

  8. [Repetitive phenomenona in the spontaneous speech of aphasic patients: perseveration, stereotypy, echolalia, automatism and recurring utterance].

    PubMed

    Wallesch, C W; Brunner, R J; Seemüller, E

    1983-12-01

    Repetitive phenomena in spontaneous speech were investigated in 30 patients with chronic infarctions of the left hemisphere which included Broca's and/or Wernicke's area and/or the basal ganglia. Perseverations, stereotypies, and echolalias occurred with all types of brain lesions, automatisms and recurring utterances only with those patients, whose infarctions involved Wernicke's area and basal ganglia. These patients also showed more echolalic responses. The results are discussed in view of the role of the basal ganglia as motor program generators.

  9. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hogden, J.

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values aremore » constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.« less

  10. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers

    PubMed Central

    Liu, Yin; Yin, Heng; Zhang, Junpeng; Zhang, Jing; Zhang, Jiang

    2017-01-01

    The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with “quasi-unvoiced” or with “quasi-voiced” initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the syllables is 91.24%. PMID:28926572

  11. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers.

    PubMed

    He, Ling; Liu, Yin; Yin, Heng; Zhang, Junpeng; Zhang, Jing; Zhang, Jiang

    2017-01-01

    The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with "quasi-unvoiced" or with "quasi-voiced" initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the syllables is 91.24%.

  12. The Mechanical Recognition of Speech: Prospects for Use in the Teaching of Languages.

    ERIC Educational Resources Information Center

    Pulliam, Robert

    1970-01-01

    This paper begins with a brief account of the development of automatic speech recogniton (ASR) and then proceeds to an examination of ASR systems typical of the kind now in operation. It is stressed that such systems, although highly developed, do not recognize speech in the same sense as the human being does, and that they can not deal with a…

  13. Development and Perceptual Evaluation of Amplitude-Based F0 Control in Electrolarynx Speech

    ERIC Educational Resources Information Center

    Saikachi, Yoko; Stevens, Kenneth N.; Hillman, Robert E.

    2009-01-01

    Purpose: Current electrolarynx (EL) devices produce a mechanical speech quality that has been largely attributed to the lack of natural fundamental frequency (F0) variation. In order to improve the quality of EL speech, in the present study the authors aimed to develop and evaluate an automatic F0 control scheme, in which F0 was modulated based on…

  14. Segmentation and Characterization of Chewing Bouts by Monitoring Temporalis Muscle Using Smart Glasses With Piezoelectric Sensor.

    PubMed

    Farooq, Muhammad; Sazonov, Edward

    2017-11-01

    Several methods have been proposed for automatic and objective monitoring of food intake, but their performance suffers in the presence of speech and motion artifacts. This paper presents a novel sensor system and algorithms for detection and characterization of chewing bouts from a piezoelectric strain sensor placed on the temporalis muscle. The proposed data acquisition device was incorporated into the temple of eyeglasses. The system was tested by ten participants in two part experiments, one under controlled laboratory conditions and the other in unrestricted free-living. The proposed food intake recognition method first performed an energy-based segmentation to isolate candidate chewing segments (instead of using epochs of fixed duration commonly reported in research literature), with the subsequent classification of the segments by linear support vector machine models. On participant level (combining data from both laboratory and free-living experiments), with ten-fold leave-one-out cross-validation, chewing were recognized with average F-score of 96.28% and the resultant area under the curve was 0.97, which are higher than any of the previously reported results. A multivariate regression model was used to estimate chew counts from segments classified as chewing with an average mean absolute error of 3.83% on participant level. These results suggest that the proposed system is able to identify chewing segments in the presence of speech and motion artifacts, as well as automatically and accurately quantify chewing behavior, both under controlled laboratory conditions and unrestricted free-living.

  15. Speech recognition for embedded automatic positioner for laparoscope

    NASA Astrophysics Data System (ADS)

    Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin

    2014-07-01

    In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.

  16. Hidden Markov models in automatic speech recognition

    NASA Astrophysics Data System (ADS)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  17. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms.

    PubMed

    Schädler, Marc R; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than -20 dB could not be predicted.

  18. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms

    PubMed Central

    Schädler, Marc R.; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than −20 dB could not be predicted. PMID:29692200

  19. Recognizing intentions in infant-directed speech: evidence for universals.

    PubMed

    Bryant, Gregory A; Barrett, H Clark

    2007-08-01

    In all languages studied to date, distinct prosodic contours characterize different intention categories of infant-directed (ID) speech. This vocal behavior likely exists universally as a species-typical trait, but little research has examined whether listeners can accurately recognize intentions in ID speech using only vocal cues, without access to semantic information. We recorded native-English-speaking mothers producing four intention categories of utterances (prohibition, approval, comfort, and attention) as both ID and adult-directed (AD) speech, and we then presented the utterances to Shuar adults (South American hunter-horticulturalists). Shuar subjects were able to reliably distinguish ID from AD speech and were able to reliably recognize the intention categories in both types of speech, although performance was significantly better with ID speech. This is the first demonstration that adult listeners in an indigenous, nonindustrialized, and nonliterate culture can accurately infer intentions from both ID speech and AD speech in a language they do not speak.

  20. Advances in audio source seperation and multisource audio content retrieval

    NASA Astrophysics Data System (ADS)

    Vincent, Emmanuel

    2012-06-01

    Audio source separation aims to extract the signals of individual sound sources from a given recording. In this paper, we review three recent advances which improve the robustness of source separation in real-world challenging scenarios and enable its use for multisource content retrieval tasks, such as automatic speech recognition (ASR) or acoustic event detection (AED) in noisy environments. We present a Flexible Audio Source Separation Toolkit (FASST) and discuss its advantages compared to earlier approaches such as independent component analysis (ICA) and sparse component analysis (SCA). We explain how cues as diverse as harmonicity, spectral envelope, temporal fine structure or spatial location can be jointly exploited by this toolkit. We subsequently present the uncertainty decoding (UD) framework for the integration of audio source separation and audio content retrieval. We show how the uncertainty about the separated source signals can be accurately estimated and propagated to the features. Finally, we explain how this uncertainty can be efficiently exploited by a classifier, both at the training and the decoding stage. We illustrate the resulting performance improvements in terms of speech separation quality and speaker recognition accuracy.

  1. Analysis of Factors Affecting System Performance in the ASpIRE Challenge

    DTIC Science & Technology

    2015-12-13

    performance in the ASpIRE (Automatic Speech recognition In Reverberant Environments) challenge. In particular, overall word error rate (WER) of the solver...systems is analyzed as a function of room, distance between talker and microphone, and microphone type. We also analyze speech activity detection...analysis will inform the design of future challenges and provide insight into the efficacy of current solutions addressing noisy reverberant speech

  2. Identifying Speech Acts in E-Mails: Toward Automated Scoring of the "TOEIC"® E-Mail Task. Research Report. ETS RR-12-16

    ERIC Educational Resources Information Center

    De Felice, Rachele; Deane, Paul

    2012-01-01

    This study proposes an approach to automatically score the "TOEIC"® Writing e-mail task. We focus on one component of the scoring rubric, which notes whether the test-takers have used particular speech acts such as requests, orders, or commitments. We developed a computational model for automated speech act identification and tested it…

  3. Automated Intelligibility Assessment of Pathological Speech Using Phonological Features

    NASA Astrophysics Data System (ADS)

    Middag, Catherine; Martens, Jean-Pierre; Van Nuffelen, Gwen; De Bodt, Marc

    2009-12-01

    It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words) and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008) is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.

  4. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

    PubMed

    Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

    2011-01-01

    The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

  5. Tuning time-frequency methods for the detection of metered HF speech

    NASA Astrophysics Data System (ADS)

    Nelson, Douglas J.; Smith, Lawrence H.

    2002-12-01

    Speech is metered if the stresses occur at a nearly regular rate. Metered speech is common in poetry, and it can occur naturally in speech, if the speaker is spelling a word or reciting words or numbers from a list. In radio communications, the CQ request, call sign and other codes are frequently metered. In tactical communications and air traffic control, location, heading and identification codes may be metered. Moreover metering may be expected to survive even in HF communications, which are corrupted by noise, interference and mistuning. For this environment, speech recognition and conventional machine-based methods are not effective. We describe Time-Frequency methods which have been adapted successfully to the problem of mitigation of HF signal conditions and detection of metered speech. These methods are based on modeled time and frequency correlation properties of nearly harmonic functions. We derive these properties and demonstrate a performance gain over conventional correlation and spectral methods. Finally, in addressing the problem of HF single sideband (SSB) communications, the problems of carrier mistuning, interfering signals, such as manual Morse, and fast automatic gain control (AGC) must be addressed. We demonstrate simple methods which may be used to blindly mitigate mistuning and narrowband interference, and effectively invert the fast automatic gain function.

  6. Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners With Simulated Age-Related Hearing Loss.

    PubMed

    Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian

    2017-09-18

    The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.

  7. Is automatic speech-to-text transcription ready for use in psychological experiments?

    PubMed

    Ziman, Kirsten; Heusser, Andrew C; Fitzpatrick, Paxton C; Field, Campbell E; Manning, Jeremy R

    2018-04-23

    Verbal responses are a convenient and naturalistic way for participants to provide data in psychological experiments (Salzinger, The Journal of General Psychology, 61(1),65-94:1959). However, audio recordings of verbal responses typically require additional processing, such as transcribing the recordings into text, as compared with other behavioral response modalities (e.g., typed responses, button presses, etc.). Further, the transcription process is often tedious and time-intensive, requiring human listeners to manually examine each moment of recorded speech. Here we evaluate the performance of a state-of-the-art speech recognition algorithm (Halpern et al., 2016) in transcribing audio data into text during a list-learning experiment. We compare transcripts made by human annotators to the computer-generated transcripts. Both sets of transcripts matched to a high degree and exhibited similar statistical properties, in terms of the participants' recall performance and recall dynamics that the transcripts captured. This proof-of-concept study suggests that speech-to-text engines could provide a cheap, reliable, and rapid means of automatically transcribing speech data in psychological experiments. Further, our findings open the door for verbal response experiments that scale to thousands of participants (e.g., administered online), as well as a new generation of experiments that decode speech on the fly and adapt experimental parameters based on participants' prior responses.

  8. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    NASA Technical Reports Server (NTRS)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  9. Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition

    NASA Astrophysics Data System (ADS)

    Ryumin, D.; Karpov, A. A.

    2017-05-01

    In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.

  10. Objective speech quality evaluation of real-time speech coders

    NASA Astrophysics Data System (ADS)

    Viswanathan, V. R.; Russell, W. H.; Huggins, A. W. F.

    1984-02-01

    This report describes the work performed in two areas: subjective testing of a real-time 16 kbit/s adaptive predictive coder (APC) and objective speech quality evaluation of real-time coders. The speech intelligibility of the APC coder was tested using the Diagnostic Rhyme Test (DRT), and the speech quality was tested using the Diagnostic Acceptability Measure (DAM) test, under eight operating conditions involving channel error, acoustic background noise, and tandem link with two other coders. The test results showed that the DRT and DAM scores of the APC coder equalled or exceeded the corresponding test scores fo the 32 kbit/s CVSD coder. In the area of objective speech quality evaluation, the report describes the development, testing, and validation of a procedure for automatically computing several objective speech quality measures, given only the tape-recordings of the input speech and the corresponding output speech of a real-time speech coder.

  11. A Quantitative and Qualitative Evaluation of an Automatic Occlusion Device for Tracheoesophageal Speech: The Provox FreeHands HME

    ERIC Educational Resources Information Center

    Hamade, Rachel; Hewlett, Nigel; Scanlon, Emer

    2006-01-01

    This study aimed to evaluate a new automatic tracheostoma valve: the Provox FreeHands HME (manufactured by Atos Medical AB, Sweden). Data from four laryngectomee participants using automatic and also manual occlusion were subjected to acoustic and perceptual analysis. The main results were a significant decrease, from the manual to automatic…

  12. Thai Automatic Speech Recognition

    DTIC Science & Technology

    2005-01-01

    used in an external DARPA evaluation involving medical scenarios between an American Doctor and a naïve monolingual Thai patient. 2. Thai Language... dictionary generation more challenging, and (3) the lack of word segmentation, which calls for automatic segmentation approaches to make n-gram language...requires a dictionary and provides various segmentation algorithms to automatically select suitable segmentations. Here we used a maximal matching

  13. The Relationship Between Speech Production and Speech Perception Deficits in Parkinson's Disease.

    PubMed

    De Keyser, Kim; Santens, Patrick; Bockstael, Annelies; Botteldooren, Dick; Talsma, Durk; De Vos, Stefanie; Van Cauwenberghe, Mieke; Verheugen, Femke; Corthals, Paul; De Letter, Miet

    2016-10-01

    This study investigated the possible relationship between hypokinetic speech production and speech intensity perception in patients with Parkinson's disease (PD). Participants included 14 patients with idiopathic PD and 14 matched healthy controls (HCs) with normal hearing and cognition. First, speech production was objectified through a standardized speech intelligibility assessment, acoustic analysis, and speech intensity measurements. Second, an overall estimation task and an intensity estimation task were addressed to evaluate overall speech perception and speech intensity perception, respectively. Finally, correlation analysis was performed between the speech characteristics of the overall estimation task and the corresponding acoustic analysis. The interaction between speech production and speech intensity perception was investigated by an intensity imitation task. Acoustic analysis and speech intensity measurements demonstrated significant differences in speech production between patients with PD and the HCs. A different pattern in the auditory perception of speech and speech intensity was found in the PD group. Auditory perceptual deficits may influence speech production in patients with PD. The present results suggest a disturbed auditory perception related to an automatic monitoring deficit in PD.

  14. Voice Preprocessor for Digital Voice Applications

    DTIC Science & Technology

    1989-09-11

    helit tralnsforniers are marketed for use kll ith multitonle MI( LMS and are acceptable for w ice appl ica- tiotns. 3. Automatic Gain (iontro: A...variations of speech spectral tilt to improve the quaiit\\ of the extracted speech parameters. Nlore imnportantly, the onlN analoii circuit "e use is a

  15. Voice-processing technologies--their application in telecommunications.

    PubMed Central

    Wilpon, J G

    1995-01-01

    As the telecommunications industry evolves over the next decade to provide the products and services that people will desire, several key technologies will become commonplace. Two of these, automatic speech recognition and text-to-speech synthesis, will provide users with more freedom on when, where, and how they access information. While these technologies are currently in their infancy, their capabilities are rapidly increasing and their deployment in today's telephone network is expanding. The economic impact of just one application, the automation of operator services, is well over $100 million per year. Yet there still are many technical challenges that must be resolved before these technologies can be deployed ubiquitously in products and services throughout the worldwide telephone network. These challenges include: (i) High level of accuracy. The technology must be perceived by the user as highly accurate, robust, and reliable. (ii) Easy to use. Speech is only one of several possible input/output modalities for conveying information between a human and a machine, much like a computer terminal or Touch-Tone pad on a telephone. It is not the final product. Therefore, speech technologies must be hidden from the user. That is, the burden of using the technology must be on the technology itself. (iii) Quick prototyping and development of new products and services. The technology must support the creation of new products and services based on speech in an efficient and timely fashion. In this paper I present a vision of the voice-processing industry with a focus on the areas with the broadest base of user penetration: speech recognition, text-to-speech synthesis, natural language processing, and speaker recognition technologies. The current and future applications of these technologies in the telecommunications industry will be examined in terms of their strengths, limitations, and the degree to which user needs have been or have yet to be met. Although noteworthy gains have been made in areas with potentially small user bases and in the more mature speech-coding technologies, these subjects are outside the scope of this paper. Images Fig. 1 PMID:7479815

  16. Automatic measurement of prosody in behavioral variant FTD.

    PubMed

    Nevler, Naomi; Ash, Sharon; Jester, Charles; Irwin, David J; Liberman, Mark; Grossman, Murray

    2017-08-15

    To help understand speech changes in behavioral variant frontotemporal dementia (bvFTD), we developed and implemented automatic methods of speech analysis for quantification of prosody, and evaluated clinical and anatomical correlations. We analyzed semi-structured, digitized speech samples from 32 patients with bvFTD (21 male, mean age 63 ± 8.5, mean disease duration 4 ± 3.1 years) and 17 matched healthy controls (HC). We automatically extracted fundamental frequency (f0, the physical property of sound most closely correlating with perceived pitch) and computed pitch range on a logarithmic scale (semitone) that controls for individual and sex differences. We correlated f0 range with neuropsychiatric tests, and related f0 range to gray matter (GM) atrophy using 3T T1 MRI. We found significantly reduced f0 range in patients with bvFTD (mean 4.3 ± 1.8 ST) compared to HC (5.8 ± 2.1 ST; p = 0.03). Regression related reduced f0 range in bvFTD to GM atrophy in bilateral inferior and dorsomedial frontal as well as left anterior cingulate and anterior insular regions. Reduced f0 range reflects impaired prosody in bvFTD. This is associated with neuroanatomic networks implicated in language production and social disorders centered in the frontal lobe. These findings support the feasibility of automated speech analysis in frontotemporal dementia and other disorders. © 2017 American Academy of Neurology.

  17. Automatically Detecting Likely Edits in Clinical Notes Created Using Automatic Speech Recognition

    PubMed Central

    Lybarger, Kevin; Ostendorf, Mari; Yetisgen, Meliha

    2017-01-01

    The use of automatic speech recognition (ASR) to create clinical notes has the potential to reduce costs associated with note creation for electronic medical records, but at current system accuracy levels, post-editing by practitioners is needed to ensure note quality. Aiming to reduce the time required to edit ASR transcripts, this paper investigates novel methods for automatic detection of edit regions within the transcripts, including both putative ASR errors but also regions that are targets for cleanup or rephrasing. We create detection models using logistic regression and conditional random field models, exploring a variety of text-based features that consider the structure of clinical notes and exploit the medical context. Different medical text resources are used to improve feature extraction. Experimental results on a large corpus of practitioner-edited clinical notes show that 67% of sentence-level edits and 45% of word-level edits can be detected with a false detection rate of 15%. PMID:29854187

  18. Factors influencing relative speech intelligibility in patients with oral squamous cell carcinoma: a prospective study using automatic, computer-based speech analysis.

    PubMed

    Stelzle, F; Knipfer, C; Schuster, M; Bocklet, T; Nöth, E; Adler, W; Schempf, L; Vieler, P; Riemann, M; Neukam, F W; Nkenke, E

    2013-11-01

    Oral squamous cell carcinoma (OSCC) and its treatment impair speech intelligibility by alteration of the vocal tract. The aim of this study was to identify the factors of oral cancer treatment that influence speech intelligibility by means of an automatic, standardized speech-recognition system. The study group comprised 71 patients (mean age 59.89, range 35-82 years) with OSCC ranging from stage T1 to T4 (TNM staging). Tumours were located on the tongue (n=23), lower alveolar crest (n=27), and floor of the mouth (n=21). Reconstruction was conducted through local tissue plasty or microvascular transplants. Adjuvant radiotherapy was performed in 49 patients. Speech intelligibility was evaluated before, and at 3, 6, and 12 months after tumour resection, and compared to that of a healthy control group (n=40). Postoperatively, significant influences on speech intelligibility were tumour localization (P=0.010) and resection volume (P=0.019). Additionally, adjuvant radiotherapy (P=0.049) influenced intelligibility at 3 months after surgery. At 6 months after surgery, influences were resection volume (P=0.028) and adjuvant radiotherapy (P=0.034). The influence of tumour localization (P=0.001) and adjuvant radiotherapy (P=0.022) persisted after 12 months. Tumour localization, resection volume, and radiotherapy are crucial factors for speech intelligibility. Radiotherapy significantly impaired word recognition rate (WR) values with a progression of the impairment for up to 12 months after surgery. Copyright © 2013 International Association of Oral and Maxillofacial Surgeons. Published by Elsevier Ltd. All rights reserved.

  19. Slowed Speech Input has a Differential Impact on On-line and Off-line Processing in Children’s Comprehension of Pronouns

    PubMed Central

    Walenski, Matthew; Swinney, David

    2009-01-01

    The central question underlying this study revolves around how children process co-reference relationships—such as those evidenced by pronouns (him) and reflexives (himself)—and how a slowed rate of speech input may critically affect this process. Previous studies of child language processing have demonstrated that typical language developing (TLD) children as young as 4 years of age process co-reference relations in a manner similar to adults on-line. In contrast, off-line measures of pronoun comprehension suggest a developmental delay for pronouns (relative to reflexives). The present study examines dependency relations in TLD children (ages 5–13) and investigates how a slowed rate of speech input affects the unconscious (on-line) and conscious (off-line) parsing of these constructions. For the on-line investigations (using a cross-modal picture priming paradigm), results indicate that at a normal rate of speech TLD children demonstrate adult-like syntactic reflexes. At a slowed rate of speech the typical language developing children displayed a breakdown in automatic syntactic parsing (again, similar to the pattern seen in unimpaired adults). As demonstrated in the literature, our off-line investigations (sentence/picture matching task) revealed that these children performed much better on reflexives than on pronouns at a regular speech rate. However, at the slow speech rate, performance on pronouns was substantially improved, whereas performance on reflexives was not different than at the regular speech rate. We interpret these results in light of a distinction between fast automatic processes (relied upon for on-line processing in real time) and conscious reflective processes (relied upon for off-line processing), such that slowed speech input disrupts the former, yet improves the latter. PMID:19343495

  20. Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production.

    PubMed

    Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy

    2016-04-01

    Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing. Copyright © 2016 Elsevier B.V. All rights reserved.

  1. Automatic intelligibility classification of sentence-level pathological speech

    PubMed Central

    Kim, Jangwon; Kumar, Naveen; Tsiartas, Andreas; Li, Ming; Narayanan, Shrikanth S.

    2014-01-01

    Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. Although automatic evaluation of speech intelligibility and quality could come in handy in these scenarios to assist experts in diagnosis and treatment design, the many sources and types of variability often make it a very challenging computational processing problem. In this work we propose novel sentence-level features to capture abnormal variation in the prosodic, voice quality and pronunciation aspects in pathological speech. In addition, we propose a post-classification posterior smoothing scheme which refines the posterior of a test sample based on the posteriors of other test samples. Finally, we perform feature-level fusions and subsystem decision fusion for arriving at a final intelligibility decision. The performances are tested on two pathological speech datasets, the NKI CCRT Speech Corpus (advanced head and neck cancer) and the TORGO database (cerebral palsy or amyotrophic lateral sclerosis), by evaluating classification accuracy without overlapping subjects’ data among training and test partitions. Results show that the feature sets of each of the voice quality subsystem, prosodic subsystem, and pronunciation subsystem, offer significant discriminating power for binary intelligibility classification. We observe that the proposed posterior smoothing in the acoustic space can further reduce classification errors. The smoothed posterior score fusion of subsystems shows the best classification performance (73.5% for unweighted, and 72.8% for weighted, average recalls of the binary classes). PMID:25414544

  2. How Accurately Can the Google Web Speech API Recognize and Transcribe Japanese L2 English Learners' Oral Production?

    ERIC Educational Resources Information Center

    Ashwell, Tim; Elam, Jesse R.

    2017-01-01

    The ultimate aim of our research project was to use the Google Web Speech API to automate scoring of elicited imitation (EI) tests. However, in order to achieve this goal, we had to take a number of preparatory steps. We needed to assess how accurate this speech recognition tool is in recognizing native speakers' production of the test items; we…

  3. Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students

    ERIC Educational Resources Information Center

    Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice

    2012-01-01

    Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…

  4. The Promise of NLP and Speech Processing Technologies in Language Assessment

    ERIC Educational Resources Information Center

    Chapelle, Carol A.; Chung, Yoo-Ree

    2010-01-01

    Advances in natural language processing (NLP) and automatic speech recognition and processing technologies offer new opportunities for language testing. Despite their potential uses on a range of language test item types, relatively little work has been done in this area, and it is therefore not well understood by test developers, researchers or…

  5. Validation of Automated Scoring of Oral Reading

    ERIC Educational Resources Information Center

    Balogh, Jennifer; Bernstein, Jared; Cheng, Jian; Van Moere, Alistair; Townshend, Brent; Suzuki, Masanori

    2012-01-01

    A two-part experiment is presented that validates a new measurement tool for scoring oral reading ability. Data collected by the U.S. government in a large-scale literacy assessment of adults were analyzed by a system called VersaReader that uses automatic speech recognition and speech processing technologies to score oral reading fluency. In the…

  6. Voice Interactive Analysis System Study. Final Report, August 28, 1978 through March 23, 1979.

    ERIC Educational Resources Information Center

    Harry, D. P.; And Others

    The Voice Interactive Analysis System study continued research and development of the LISTEN real-time, minicomputer based connected speech recognition system, within NAVTRAEQUIPCEN'S program of developing automatic speech technology in support of training. An attempt was made to identify the most effective features detected by the TTI-500 model…

  7. Audio-video feature correlation: faces and speech

    NASA Astrophysics Data System (ADS)

    Durand, Gwenael; Montacie, Claude; Caraty, Marie-Jose; Faudemay, Pascal

    1999-08-01

    This paper presents a study of the correlation of features automatically extracted from the audio stream and the video stream of audiovisual documents. In particular, we were interested in finding out whether speech analysis tools could be combined with face detection methods, and to what extend they should be combined. A generic audio signal partitioning algorithm as first used to detect Silence/Noise/Music/Speech segments in a full length movie. A generic object detection method was applied to the keyframes extracted from the movie in order to detect the presence or absence of faces. The correlation between the presence of a face in the keyframes and of the corresponding voice in the audio stream was studied. A third stream, which is the script of the movie, is warped on the speech channel in order to automatically label faces appearing in the keyframes with the name of the corresponding character. We naturally found that extracted audio and video features were related in many cases, and that significant benefits can be obtained from the joint use of audio and video analysis methods.

  8. Automatic Speech Recognition in Air Traffic Control: a Human Factors Perspective

    NASA Technical Reports Server (NTRS)

    Karlsson, Joakim

    1990-01-01

    The introduction of Automatic Speech Recognition (ASR) technology into the Air Traffic Control (ATC) system has the potential to improve overall safety and efficiency. However, because ASR technology is inherently a part of the man-machine interface between the user and the system, the human factors issues involved must be addressed. Here, some of the human factors problems are identified and related methods of investigation are presented. Research at M.I.T.'s Flight Transportation Laboratory is being conducted from a human factors perspective, focusing on intelligent parser design, presentation of feedback, error correction strategy design, and optimal choice of input modalities.

  9. Multimodal Translation System Using Texture-Mapped Lip-Sync Images for Video Mail and Automatic Dubbing Applications

    NASA Astrophysics Data System (ADS)

    Morishima, Shigeo; Nakamura, Satoshi

    2004-12-01

    We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.

  10. Children's perception of their synthetically corrected speech production.

    PubMed

    Strömbergsson, Sofia; Wengelin, Asa; House, David

    2014-06-01

    We explore children's perception of their own speech - in its online form, in its recorded form, and in synthetically modified forms. Children with phonological disorder (PD) and children with typical speech and language development (TD) performed tasks of evaluating accuracy of the different types of speech stimuli, either immediately after having produced the utterance or after a delay. In addition, they performed a task designed to assess their ability to detect synthetic modification. Both groups showed high performance in tasks involving evaluation of other children's speech, whereas in tasks of evaluating one's own speech, the children with PD were less accurate than their TD peers. The children with PD were less sensitive to misproductions in immediate conjunction with their production of an utterance, and more accurate after a delay. Within-category modification often passed undetected, indicating a satisfactory quality of the generated speech. Potential clinical benefits of using corrective re-synthesis are discussed.

  11. Learner Attention to Form in ACCESS Task-Based Interaction

    ERIC Educational Resources Information Center

    Dao, Phung; Iwashita, Noriko; Gatbonton, Elizabeth

    2017-01-01

    This study explored the potential effects of communicative tasks developed using a reformulation of a task-based language teaching called Automatization in Communicative Contexts of Essential Speech Sequences (ACCESS) that includes automatization of language elements as one of its goals on learner attention to form in task-based interaction. The…

  12. Auditory-Motor Processing of Speech Sounds

    PubMed Central

    Möttönen, Riikka; Dutton, Rebekah; Watkins, Kate E.

    2013-01-01

    The motor regions that control movements of the articulators activate during listening to speech and contribute to performance in demanding speech recognition and discrimination tasks. Whether the articulatory motor cortex modulates auditory processing of speech sounds is unknown. Here, we aimed to determine whether the articulatory motor cortex affects the auditory mechanisms underlying discrimination of speech sounds in the absence of demanding speech tasks. Using electroencephalography, we recorded responses to changes in sound sequences, while participants watched a silent video. We also disrupted the lip or the hand representation in left motor cortex using transcranial magnetic stimulation. Disruption of the lip representation suppressed responses to changes in speech sounds, but not piano tones. In contrast, disruption of the hand representation had no effect on responses to changes in speech sounds. These findings show that disruptions within, but not outside, the articulatory motor cortex impair automatic auditory discrimination of speech sounds. The findings provide evidence for the importance of auditory-motor processes in efficient neural analysis of speech sounds. PMID:22581846

  13. Alternative Speech Communication System for Persons with Severe Speech Disorders

    NASA Astrophysics Data System (ADS)

    Selouani, Sid-Ahmed; Sidi Yakoub, Mohammed; O'Shaughnessy, Douglas

    2009-12-01

    Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders. The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality of communication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and close to the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a grafting technique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system in order to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and two Acadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposed methods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achieved by the speech synthesis systems that deal with SSD and dysarthria, respectively.

  14. Evaluation of synthesized voice approach callouts /SYNCALL/

    NASA Technical Reports Server (NTRS)

    Simpson, C. A.

    1981-01-01

    The two basic approaches to the generation of 'synthesized' speech include a utilization of analog recorded human speech and a construction of speech entirely from algorithms applied to constants describing speech sounds. Given the availability of synthesized speech displays for man-machine systems, research is needed to study suggested applications for speech and design principles for speech displays. The present investigation is concerned with a study for which new performance measures were developed. A number of air carrier approach and landing accidents during low or impaired visibility have been associated with the absence of approach callouts. The study had the purpose to compare a pilot-not-flying (PNF) approach callout system to a system composed of PNF callouts augmented by an automatic synthesized voice callout system (SYNCALL). Pilots were found to favor the use of a SYNCALL system containing certain modifications.

  15. Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated Parkinson's disease.

    PubMed

    Rusz, J; Cmejla, R; Ruzickova, H; Ruzicka, E

    2011-01-01

    An assessment of vocal impairment is presented for separating healthy people from persons with early untreated Parkinson's disease (PD). This study's main purpose was to (a) determine whether voice and speech disorder are present from early stages of PD before starting dopaminergic pharmacotherapy, (b) ascertain the specific characteristics of the PD-related vocal impairment, (c) identify PD-related acoustic signatures for the major part of traditional clinically used measurement methods with respect to their automatic assessment, and (d) design new automatic measurement methods of articulation. The varied speech data were collected from 46 Czech native speakers, 23 with PD. Subsequently, 19 representative measurements were pre-selected, and Wald sequential analysis was then applied to assess the efficiency of each measure and the extent of vocal impairment of each subject. It was found that measurement of the fundamental frequency variations applied to two selected tasks was the best method for separating healthy from PD subjects. On the basis of objective acoustic measures, statistical decision-making theory, and validation from practicing speech therapists, it has been demonstrated that 78% of early untreated PD subjects indicate some form of vocal impairment. The speech defects thus uncovered differ individually in various characteristics including phonation, articulation, and prosody.

  16. Automatic processing of tones and speech stimuli in children with specific language impairment.

    PubMed

    Uwer, Ruth; Albrecht, Ronald; von Suchodoletz, W

    2002-08-01

    It is well known from behavioural experiments that children with specific language impairment (SLI) have difficulties discriminating consonant-vowel (CV) syllables such as /ba/, /da/, and /ga/. Mismatch negativity (MMN) is an auditory event-related potential component that represents the outcome of an automatic comparison process. It could, therefore, be a promising tool for assessing central auditory processing deficits for speech and non-speech stimuli in children with SLI. MMN is typically evoked by occasionally occurring 'deviant' stimuli in a sequence of identical 'standard' sounds. In this study MMN was elicited using simple tone stimuli, which differed in frequency (1000 versus 1200 Hz) and duration (175 versus 100 ms) and to digitized CV syllables which differed in place of articulation (/ba/, /da/, and /ga/) in children with expressive and receptive SLI and healthy control children (n=21 in each group, 46 males and 17 females; age range 5 to 10 years). Mean MMN amplitudes between groups were compared. Additionally, the behavioural discrimination performance was assessed. Children with SLI had attenuated MMN amplitudes to speech stimuli, but there was no significant difference between the two diagnostic subgroups. MMN to tone stimuli did not differ between the groups. Children with SLI made more errors in the discrimination task, but discrimination scores did not correlate with MMN amplitudes. The present data suggest that children with SLI show a specific deficit in automatic discrimination of CV syllables differing in place of articulation, whereas the processing of simple tone differences seems to be unimpaired.

  17. Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.

    DTIC Science & Technology

    1983-05-01

    16.311 % a. Seale In/Se"l tAL4 lrs e y i s 2 I ROM men "Ig eddiei, m releerla ons leveltc. Ŗ dots ghoeea INDtISTRtAIJ%6LITARY SPEECH SYNTHESIS PRODUCTS...saquence The SC-01 Suech Syntheszer conftains 64 cf, arent poneme~hs which are accessed try A 6-tht code. 1 - the proper sequ.enti omthnatiors of thoe...connected speech input with widely differing emotional states, diverse accents, and substantial nonperiodic background noise input. As noted previously

  18. An acoustic comparison of two women's infant- and adult-directed speech

    NASA Astrophysics Data System (ADS)

    Andruski, Jean; Katz-Gershon, Shiri

    2003-04-01

    In addition to having prosodic characteristics that are attractive to infant listeners, infant-directed (ID) speech shares certain characteristics of adult-directed (AD) clear speech, such as increased acoustic distance between vowels, that might be expected to make ID speech easier for adults to perceive in noise than AD conversational speech. However, perceptual tests of two women's ID productions by Andruski and Bessega [J. Acoust. Soc. Am. 112, 2355] showed that is not always the case. In a word identification task that compared ID speech with AD clear and conversational speech, one speaker's ID productions were less well-identified than AD clear speech, but better identified than AD conversational speech. For the second woman, ID speech was the least accurately identified of the three speech registers. For both speakers, hard words (infrequent words with many lexical neighbors) were also at an increased disadvantage relative to easy words (frequent words with few lexical neighbors) in speech registers that were less accurately perceived. This study will compare several acoustic properties of these women's productions, including pitch and formant-frequency characteristics. Results of the acoustic analyses will be examined with the original perceptual results to suggest reasons for differences in listener's accuracy in identifying these two women's ID speech in noise.

  19. Automatic speech recognition and training for severely dysarthric users of assistive technology: the STARDUST project.

    PubMed

    Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil

    2006-01-01

    The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent proximity to "normal" articulatory patterns. Severe dysarthric output may also be characterized by a small mass of distinguishable phonetic tokens making the acoustic differentiation of target words difficult. Speaker dependent computer speech recognition using Hidden Markov Models was achieved by the identification of robust phonetic elements within the individual speaker output patterns. A new system of speech training using computer generated visual and auditory feedback reduced the inconsistent production of key phonetic tokens over time.

  20. Measuring Prevalence of Other-Oriented Transactive Contributions Using an Automated Measure of Speech Style Accommodation

    ERIC Educational Resources Information Center

    Gweon, Gahgene; Jain, Mahaveer; McDonough, John; Raj, Bhiksha; Rose, Carolyn P.

    2013-01-01

    This paper contributes to a theory-grounded methodological foundation for automatic collaborative learning process analysis. It does this by illustrating how insights from the social psychology and sociolinguistics of speech style provide a theoretical framework to inform the design of a computational model. The purpose of that model is to detect…

  1. Detecting and Understanding the Impact of Cognitive and Interpersonal Conflict in Computer Supported Collaborative Learning Environments

    ERIC Educational Resources Information Center

    Prata, David Nadler; Baker, Ryan S. J. d.; Costa, Evandro d. B.; Rose, Carolyn P.; Cui, Yue; de Carvalho, Adriana M. J. B.

    2009-01-01

    This paper presents a model which can automatically detect a variety of student speech acts as students collaborate within a computer supported collaborative learning environment. In addition, an analysis is presented which gives substantial insight as to how students' learning is associated with students' speech acts, knowledge that will…

  2. Evaluation of an Enhanced Stimulus-Stimulus Pairing Procedure to Increase Early Vocalizations of Children with Autism

    ERIC Educational Resources Information Center

    Esch, Barbara E.; Carr, James E.; Grow, Laura L.

    2009-01-01

    Evidence to support stimulus-stimulus pairing (SSP) in speech acquisition is less than robust, calling into question the ability of SSP to reliably establish automatically reinforcing properties of speech and limiting the procedure's clinical utility for increasing vocalizations. We evaluated the effects of a modified SSP procedure on…

  3. Yaounde French Speech Corpus

    DTIC Science & Technology

    2017-03-01

    the Center for Technology Enhanced Language Learning (CTELL), a research cell in the Department of Foreign Languages, United States Military Academy...models for automatic speech recognition (ASR), and to, thereby, investigate the utility of ASR in pedagogical technology . The corpus is a sample of...lexical resources, language technology 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU 18. NUMBER OF

  4. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    ERIC Educational Resources Information Center

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  5. Are the Literacy Difficulties That Characterize Developmental Dyslexia Associated with a Failure to Integrate Letters and Speech Sounds?

    ERIC Educational Resources Information Center

    Nash, Hannah M.; Gooch, Debbie; Hulme, Charles; Mahajan, Yatin; McArthur, Genevieve; Steinmetzger, Kurt; Snowling, Margaret J.

    2017-01-01

    The "automatic letter-sound integration hypothesis" (Blomert, [Blomert, L., 2011]) proposes that dyslexia results from a failure to fully integrate letters and speech sounds into automated audio-visual objects. We tested this hypothesis in a sample of English-speaking children with dyslexic difficulties (N = 13) and samples of…

  6. EduSpeak[R]: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications

    ERIC Educational Resources Information Center

    Franco, Horacio; Bratt, Harry; Rossier, Romain; Rao Gadde, Venkata; Shriberg, Elizabeth; Abrash, Victor; Precoda, Kristin

    2010-01-01

    SRI International's EduSpeak[R] system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to…

  7. Learning L2 Pronunciation with a Mobile Speech Recognizer: French /y/

    ERIC Educational Resources Information Center

    Liakin, Denis; Cardoso, Walcir; Liakina, Natallia

    2015-01-01

    This study investigates the acquisition of the L2 French vowel /y/ in a mobile-assisted learning environment, via the use of automatic speech recognition (ASR). Particularly, it addresses the question of whether ASR-based pronunciation instruction using a mobile device can improve the production and perception of French /y/. Forty-two elementary…

  8. Transcription and Annotation of a Japanese Accented Spoken Corpus of L2 Spanish for the Development of CAPT Applications

    ERIC Educational Resources Information Center

    Carranza, Mario

    2016-01-01

    This paper addresses the process of transcribing and annotating spontaneous non-native speech with the aim of compiling a training corpus for the development of Computer Assisted Pronunciation Training (CAPT) applications, enhanced with Automatic Speech Recognition (ASR) technology. To better adapt ASR technology to CAPT tools, the recognition…

  9. DETECTION AND IDENTIFICATION OF SPEECH SOUNDS USING CORTICAL ACTIVITY PATTERNS

    PubMed Central

    Centanni, T.M.; Sloan, A.M.; Reed, A.C.; Engineer, C.T.; Rennaker, R.; Kilgard, M.P.

    2014-01-01

    We have developed a classifier capable of locating and identifying speech sounds using activity from rat auditory cortex with an accuracy equivalent to behavioral performance without the need to specify the onset time of the speech sounds. This classifier can identify speech sounds from a large speech set within 40 ms of stimulus presentation. To compare the temporal limits of the classifier to behavior, we developed a novel task that requires rats to identify individual consonant sounds from a stream of distracter consonants. The classifier successfully predicted the ability of rats to accurately identify speech sounds for syllable presentation rates up to 10 syllables per second (up to 17.9 ± 1.5 bits/sec), which is comparable to human performance. Our results demonstrate that the spatiotemporal patterns generated in primary auditory cortex can be used to quickly and accurately identify consonant sounds from a continuous speech stream without prior knowledge of the stimulus onset times. Improved understanding of the neural mechanisms that support robust speech processing in difficult listening conditions could improve the identification and treatment of a variety of speech processing disorders. PMID:24286757

  10. Common cues to emotion in the dynamic facial expressions of speech and song.

    PubMed

    Livingstone, Steven R; Thompson, William F; Wanderley, Marcelo M; Palmer, Caroline

    2015-01-01

    Speech and song are universal forms of vocalization that may share aspects of emotional expression. Research has focused on parallels in acoustic features, overlooking facial cues to emotion. In three experiments, we compared moving facial expressions in speech and song. In Experiment 1, vocalists spoke and sang statements each with five emotions. Vocalists exhibited emotion-dependent movements of the eyebrows and lip corners that transcended speech-song differences. Vocalists' jaw movements were coupled to their acoustic intensity, exhibiting differences across emotion and speech-song. Vocalists' emotional movements extended beyond vocal sound to include large sustained expressions, suggesting a communicative function. In Experiment 2, viewers judged silent videos of vocalists' facial expressions prior to, during, and following vocalization. Emotional intentions were identified accurately for movements during and after vocalization, suggesting that these movements support the acoustic message. Experiment 3 compared emotional identification in voice-only, face-only, and face-and-voice recordings. Emotion judgements for voice-only singing were poorly identified, yet were accurate for all other conditions, confirming that facial expressions conveyed emotion more accurately than the voice in song, yet were equivalent in speech. Collectively, these findings highlight broad commonalities in the facial cues to emotion in speech and song, yet highlight differences in perception and acoustic-motor production.

  11. Minimal Pair Distinctions and Intelligibility in Preschool Children with and without Speech Sound Disorders

    ERIC Educational Resources Information Center

    Hodge, Megan M.; Gotzke, Carrie L.

    2011-01-01

    Listeners' identification of young children's productions of minimally contrastive words and predictive relationships between accurately identified words and intelligibility scores obtained from a 100-word spontaneous speech sample were determined for 36 children with typically developing speech (TDS) and 36 children with speech sound disorders…

  12. Methods and apparatus for non-acoustic speech characterization and recognition

    DOEpatents

    Holzrichter, John F.

    1999-01-01

    By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.

  13. Methods and apparatus for non-acoustic speech characterization and recognition

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Holzrichter, J.F.

    By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.

  14. Automatic Classification of Question & Answer Discourse Segments from Teacher's Speech in Classrooms

    ERIC Educational Resources Information Center

    Blanchard, Nathaniel; D'Mello, Sidney; Olney, Andrew M.; Nystrand, Martin

    2015-01-01

    Question-answer (Q&A) is fundamental for dialogic instruction, an important pedagogical technique based on the free exchange of ideas and open-ended discussion. Automatically detecting Q&A is key to providing teachers with feedback on appropriate use of dialogic instructional strategies. In line with this, this paper studies the…

  15. Automatic Online Lecture Highlighting Based on Multimedia Analysis

    ERIC Educational Resources Information Center

    Che, Xiaoyin; Yang, Haojin; Meinel, Christoph

    2018-01-01

    Textbook highlighting is widely considered to be beneficial for students. In this paper, we propose a comprehensive solution to highlight the online lecture videos in both sentence- and segment-level, just as is done with paper books. The solution is based on automatic analysis of multimedia lecture materials, such as speeches, transcripts, and…

  16. Learning diagnostic models using speech and language measures.

    PubMed

    Peintner, Bart; Jarrold, William; Vergyriy, Dimitra; Richey, Colleen; Tempini, Maria Luisa Gorno; Ogar, Jennifer

    2008-01-01

    We describe results that show the effectiveness of machine learning in the automatic diagnosis of certain neurodegenerative diseases, several of which alter speech and language production. We analyzed audio from 9 control subjects and 30 patients diagnosed with one of three subtypes of Frontotemporal Lobar Degeneration. From this data, we extracted features of the audio signal and the words the patient used, which were obtained using our automated transcription technologies. We then automatically learned models that predict the diagnosis of the patient using these features. Our results show that learned models over these features predict diagnosis with accuracy significantly better than random. Future studies using higher quality recordings will likely improve these results.

  17. Estimating psycho-physiological state of a human by speech analysis

    NASA Astrophysics Data System (ADS)

    Ronzhin, A. L.

    2005-05-01

    Adverse effects of intoxication, fatigue and boredom could degrade performance of highly trained operators of complex technical systems with potentially catastrophic consequences. Existing physiological fitness for duty tests are time consuming, costly, invasive, and highly unpopular. Known non-physiological tests constitute a secondary task and interfere with the busy workload of the tested operator. Various attempts to assess the current status of the operator by processing of "normal operational data" often lead to excessive amount of computations, poorly justified metrics, and ambiguity of results. At the same time, speech analysis presents a natural, non-invasive approach based upon well-established efficient data processing. In addition, it supports both behavioral and physiological biometric. This paper presents an approach facilitating robust speech analysis/understanding process in spite of natural speech variability and background noise. Automatic speech recognition is suggested as a technique for the detection of changes in the psycho-physiological state of a human that typically manifest themselves by changes of characteristics of voice tract and semantic-syntactic connectivity of conversation. Preliminary tests have confirmed that the statistically significant correlation between the error rate of automatic speech recognition and the extent of alcohol intoxication does exist. In addition, the obtained data allowed exploring some interesting correlations and establishing some quantitative models. It is proposed to utilize this approach as a part of fitness for duty test and compare its efficiency with analyses of iris, face geometry, thermography and other popular non-invasive biometric techniques.

  18. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1

    NASA Astrophysics Data System (ADS)

    Garofolo, J. S.; Lamel, L. F.; Fisher, W. M.; Fiscus, J. G.; Pallett, D. S.

    1993-02-01

    The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.

  19. Acoustic Event Detection and Classification

    NASA Astrophysics Data System (ADS)

    Temko, Andrey; Nadeu, Climent; Macho, Dušan; Malkin, Robert; Zieger, Christian; Omologo, Maurizio

    The human activity that takes place in meeting rooms or classrooms is reflected in a rich variety of acoustic events (AE), produced either by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity. Indeed, speech is usually the most informative sound, but other kinds of AEs may also carry useful information, for example, clapping or laughing inside a speech, a strong yawn in the middle of a lecture, a chair moving or a door slam when the meeting has just started. Additionally, detection and classification of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition.

  20. Speech to Text: Today and Tomorrow. Proceedings of a Conference at Gallaudet University (Washington, D.C., September, 1988). GRI Monograh Series B, No. 2.

    ERIC Educational Resources Information Center

    Harkins, Judith E., Ed.; Virvan, Barbara M., Ed.

    The conference proceedings contains 23 papers on telephone relay service, real-time captioning, and automatic speech recognition, and a glossary. The keynote address, by Representative Major R. Owens, examines current issues in federal legislation. Other papers have the following titles and authors: "Telephone Relay Service: Rationale and…

  1. Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

    NASA Astrophysics Data System (ADS)

    He, Di; Lim, Boon Pang; Yang, Xuesong; Hasegawa-Johnson, Mark; Chen, Deming

    2018-06-01

    Most mainstream Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea, that some frames are more important than others. Acoustic landmark theory exploits quantal non-linearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception. In this work, we conduct experiments on the TIMIT corpus, with both GMM and DNN based ASR systems and find that frames containing landmarks are more informative for ASR than others. We find that altering the level of emphasis on landmarks by re-weighting acoustic likelihood tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (45.8% to be precise) of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrate the potential of landmarks for reducing computation.

  2. The Role of Categorical Speech Perception and Phonological Processing in Familial Risk Children With and Without Dyslexia.

    PubMed

    Hakvoort, Britt; de Bree, Elise; van der Leij, Aryan; Maassen, Ben; van Setten, Ellie; Maurits, Natasha; van Zuijen, Titia L

    2016-12-01

    This study assessed whether a categorical speech perception (CP) deficit is associated with dyslexia or familial risk for dyslexia, by exploring a possible cascading relation from speech perception to phonology to reading and by identifying whether speech perception distinguishes familial risk (FR) children with dyslexia (FRD) from those without dyslexia (FRND). Data were collected from 9-year-old FRD (n = 37) and FRND (n = 41) children and age-matched controls (n = 49) on CP identification and discrimination and on the phonological processing measures rapid automatized naming, phoneme awareness, and nonword repetition. The FRD group performed more poorly on CP than the FRND and control groups. Findings on phonological processing align with the literature in that (a) phonological processing related to reading and (b) the FRD group showed the lowest phonological processing outcomes. Furthermore, CP correlated weakly with reading, but this relationship was fully mediated by rapid automatized naming. Although CP phonological skills are related to dyslexia, there was no strong evidence for a cascade from CP to phonology to reading. Deficits in CP at the behavioral level are not directly associated with dyslexia.

  3. An attention-gating recurrent working memory architecture for emergent speech representation

    NASA Astrophysics Data System (ADS)

    Elshaw, Mark; Moore, Roger K.; Klein, Michael

    2010-06-01

    This paper describes an attention-gating recurrent self-organising map approach for emergent speech representation. Inspired by evidence from human cognitive processing, the architecture combines two main neural components. The first component, the attention-gating mechanism, uses actor-critic learning to perform selective attention towards speech. Through this selective attention approach, the attention-gating mechanism controls access to working memory processing. The second component, the recurrent self-organising map memory, develops a temporal-distributed representation of speech using phone-like structures. Representing speech in terms of phonetic features in an emergent self-organised fashion, according to research on child cognitive development, recreates the approach found in infants. Using this representational approach, in a fashion similar to infants, should improve the performance of automatic recognition systems through aiding speech segmentation and fast word learning.

  4. Strategies for distant speech recognitionin reverberant environments

    NASA Astrophysics Data System (ADS)

    Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro

    2015-12-01

    Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

  5. Slowed articulation rate is a sensitive diagnostic marker for identifying non-fluent primary progressive aphasia

    PubMed Central

    Cordella, Claire; Dickerson, Bradford C.; Quimby, Megan; Yunusova, Yana; Green, Jordan R.

    2016-01-01

    Background Primary progressive aphasia (PPA) is a neurodegenerative aphasic syndrome with three distinct clinical variants: non-fluent (nfvPPA), logopenic (lvPPA), and semantic (svPPA). Speech (non-) fluency is a key diagnostic marker used to aid identification of the clinical variants, and researchers have been actively developing diagnostic tools to assess speech fluency. Current approaches reveal coarse differences in fluency between subgroups, but often fail to clearly differentiate nfvPPA from the variably fluent lvPPA. More robust subtype differentiation may be possible with finer-grained measures of fluency. Aims We sought to identify the quantitative measures of speech rate—including articulation rate and pausing measures—that best differentiated PPA subtypes, specifically the non-fluent group (nfvPPA) from the more fluent groups (lvPPA, svPPA). The diagnostic accuracy of the quantitative speech rate variables was compared to that of a speech fluency impairment rating made by clinicians. Methods and Procedures Automatic estimates of pause and speech segment durations and rate measures were derived from connected speech samples of participants with PPA (N=38; 11 nfvPPA, 14 lvPPA, 13 svPPA) and healthy age-matched controls (N=8). Clinician ratings of fluency impairment were made using a previously validated clinician rating scale developed specifically for use in PPA. Receiver operating characteristic (ROC) analyses enabled a quantification of diagnostic accuracy. Outcomes and Results Among the quantitative measures, articulation rate was the most effective for differentiating between nfvPPA and the more fluent lvPPA and svPPA groups. The diagnostic accuracy of both speech and articulation rate measures was markedly better than that of the clinician rating scale, and articulation rate was the best classifier overall. Area under the curve (AUC) values for articulation rate were good to excellent for identifying nfvPPA from both svPPA (AUC=.96) and lvPPA (AUC=.86). Cross-validation of accuracy results for articulation rate showed good generalizability outside the training dataset. Conclusions Results provide empirical support for (1) the efficacy of quantitative assessments of speech fluency and (2) a distinct non-fluent PPA subtype characterized, at least in part, by an underlying disturbance in speech motor control. The trend toward improved classifier performance for quantitative rate measures demonstrates the potential for a more accurate and reliable approach to subtyping in the fluency domain, and suggests that articulation rate may be a useful input variable as part of a multi-dimensional clinical subtyping approach. PMID:28757671

  6. Automatic measurement of voice onset time using discriminative structured prediction.

    PubMed

    Sonderegger, Morgan; Keshet, Joseph

    2012-12-01

    A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.

  7. On the Selection of Non-Invasive Methods Based on Speech Analysis Oriented to Automatic Alzheimer Disease Diagnosis

    PubMed Central

    López-de-Ipiña, Karmele; Alonso, Jesus-Bernardino; Travieso, Carlos Manuel; Solé-Casals, Jordi; Egiraun, Harkaitz; Faundez-Zanuy, Marcos; Ezeiza, Aitzol; Barroso, Nora; Ecay-Torres, Miriam; Martinez-Lage, Pablo; de Lizardui, Unai Martinez

    2013-01-01

    The work presented here is part of a larger study to identify novel technologies and biomarkers for early Alzheimer disease (AD) detection and it focuses on evaluating the suitability of a new approach for early AD diagnosis by non-invasive methods. The purpose is to examine in a pilot study the potential of applying intelligent algorithms to speech features obtained from suspected patients in order to contribute to the improvement of diagnosis of AD and its degree of severity. In this sense, Artificial Neural Networks (ANN) have been used for the automatic classification of the two classes (AD and control subjects). Two human issues have been analyzed for feature selection: Spontaneous Speech and Emotional Response. Not only linear features but also non-linear ones, such as Fractal Dimension, have been explored. The approach is non invasive, low cost and without any side effects. Obtained experimental results were very satisfactory and promising for early diagnosis and classification of AD patients. PMID:23698268

  8. The Automation System Censor Speech for the Indonesian Rude Swear Words Based on Support Vector Machine and Pitch Analysis

    NASA Astrophysics Data System (ADS)

    Endah, S. N.; Nugraheni, D. M. K.; Adhy, S.; Sutikno

    2017-04-01

    According to Law No. 32 of 2002 and the Indonesian Broadcasting Commission Regulation No. 02/P/KPI/12/2009 & No. 03/P/KPI/12/2009, stated that broadcast programs should not scold with harsh words, not harass, insult or demean minorities and marginalized groups. However, there are no suitable tools to censor those words automatically. Therefore, researches to develop a system of intelligent software to censor the words automatically are needed. To conduct censor, the system must be able to recognize the words in question. This research proposes the classification of speech divide into two classes using Support Vector Machine (SVM), first class is set of rude words and the second class is set of properly words. The speech pitch values as an input in SVM, it used for the development of the system for the Indonesian rude swear word. The results of the experiment show that SVM is good for this system.

  9. Behavioral and electrophysiological evidence for early and automatic detection of phonological equivalence in variable speech inputs.

    PubMed

    Kharlamov, Viktor; Campbell, Kenneth; Kazanina, Nina

    2011-11-01

    Speech sounds are not always perceived in accordance with their acoustic-phonetic content. For example, an early and automatic process of perceptual repair, which ensures conformity of speech inputs to the listener's native language phonology, applies to individual input segments that do not exist in the native inventory or to sound sequences that are illicit according to the native phonotactic restrictions on sound co-occurrences. The present study with Russian and Canadian English speakers shows that listeners may perceive phonetically distinct and licit sound sequences as equivalent when the native language system provides robust evidence for mapping multiple phonetic forms onto a single phonological representation. In Russian, due to an optional but productive t-deletion process that affects /stn/ clusters, the surface forms [sn] and [stn] may be phonologically equivalent and map to a single phonological form /stn/. In contrast, [sn] and [stn] clusters are usually phonologically distinct in (Canadian) English. Behavioral data from identification and discrimination tasks indicated that [sn] and [stn] clusters were more confusable for Russian than for English speakers. The EEG experiment employed an oddball paradigm with nonwords [asna] and [astna] used as the standard and deviant stimuli. A reliable mismatch negativity response was elicited approximately 100 msec postchange in the English group but not in the Russian group. These findings point to a perceptual repair mechanism that is engaged automatically at a prelexical level to ensure immediate encoding of speech inputs in phonological terms, which in turn enables efficient access to the meaning of a spoken utterance.

  10. A social feedback loop for speech development and its reduction in autism

    PubMed Central

    Warlaumont, Anne S.; Richards, Jeffrey A.; Gilkerson, Jill; Oller, D. Kimbrough

    2014-01-01

    We analyze the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hours total) of children (8- to 48-month-olds) with and without autism. We find that adult responses are more likely when child vocalizations are speech-related. In turn, a child vocalization is more likely to be speech-related if the previous speech-related child vocalization received an immediate adult response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech-language development. Although this feedback loop applies in both typical development and autism, children with autism produce proportionally fewer speech-related vocalizations and the responses they receive are less contingent on whether their vocalizations are speech-related. We argue that such differences will diminish the strength of the social feedback loop with cascading effects on speech development over time. Differences related to socioeconomic status are also reported. PMID:24840717

  11. Random Deep Belief Networks for Recognizing Emotions from Speech Signals.

    PubMed

    Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

    2017-01-01

    Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.

  12. Random Deep Belief Networks for Recognizing Emotions from Speech Signals

    PubMed Central

    Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

    2017-01-01

    Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition. PMID:28356908

  13. A Construction System for CALL Materials from TV News with Captions

    NASA Astrophysics Data System (ADS)

    Kobayashi, Satoshi; Tanaka, Takashi; Mori, Kazumasa; Nakagawa, Seiichi

    Many language learning materials have been published. In language learning, although repetition training is obviously necessary, it is difficult to maintain the learner's interest/motivation using existing learning materials, because those materials are limited in their scope and contents. In addition, we doubt whether the speech sounds used in most materials are natural in various situations. Nowadays, some TV news programs (CNN, ABC, PBS, NHK, etc.) have closed/open captions corresponding to the announcer's speech. We have developed a system that makes Computer Assisted Language Learning (CALL) materials for both English learning by Japanese and Japanese learning by foreign students from such captioned newscasts. This system computes the synchronization between captions and speech by using HMMs and a forced alignment algorithm. Materials made by the system have following functions: full/partial text caption display, repetition listening, consulting an electronic dictionary, display of the user's/announcer's sound waveform and pitch contour, and automatic construction of a dictation test. Materials have following advantages: materials present polite and natural speech, various and timely topics. Furthermore, the materials have the following possibility: automatic creation of listening/understanding tests, and storage/retrieval of the many materials. In this paper, firstly, we present the organization of the system. Then, we describe results of questionnaires on trial use of the materials. As the result, we got enough accuracy on the synchronization between captions and speech. Speaking totally, we encouraged to research this system.

  14. Noise Hampers Children’s Expressive Word Learning

    PubMed Central

    Riley, Kristine Grohne; McGregor, Karla K.

    2013-01-01

    Purpose To determine the effects of noise and speech style on word learning in typically developing school-age children. Method Thirty-one participants ages 9;0 (years; months) to 10;11 attempted to learn 2 sets of 8 novel words and their referents. They heard all of the words 13 times each within meaningful narrative discourse. Signal-to-noise ratio (noise vs. quiet) and speech style (plain vs. clear) were manipulated such that half of the children heard the new words in broadband white noise and half heard them in quiet; within those conditions, each child heard one set of words produced in a plain speech style and another set in a clear speech style. Results Children who were trained in quiet learned to produce the word forms more accurately than those who were trained in noise. Clear speech resulted in more accurate word form productions than plain speech, whether the children had learned in noise or quiet. Learning from clear speech in noise and plain speech in quiet produced comparable results. Conclusion Noise limits expressive vocabulary growth in children, reducing the quality of word form representation in the lexicon. Clear speech input can aid expressive vocabulary growth in children, even in noisy environments. PMID:22411494

  15. Cognitive Load in Voice Therapy Carry-Over Exercises.

    PubMed

    Iwarsson, Jenny; Morris, David Jackson; Balling, Laura Winther

    2017-01-01

    The cognitive load generated by online speech production may vary with the nature of the speech task. This article examines 3 speech tasks used in voice therapy carry-over exercises, in which a patient is required to adopt and automatize new voice behaviors, ultimately in daily spontaneous communication. Twelve subjects produced speech in 3 conditions: rote speech (weekdays), sentences in a set form, and semispontaneous speech. Subjects simultaneously performed a secondary visual discrimination task for which response times were measured. On completion of each speech task, subjects rated their experience on a questionnaire. Response times from the secondary, visual task were found to be shortest for the rote speech, longer for the semispontaneous speech, and longest for the sentences within the set framework. Principal components derived from the subjective ratings were found to be linked to response times on the secondary visual task. Acoustic measures reflecting fundamental frequency distribution and vocal fold compression varied across the speech tasks. The results indicate that consideration should be given to the selection of speech tasks during the process leading to automation of revised speech behavior and that self-reports may be a reliable index of cognitive load.

  16. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    NASA Astrophysics Data System (ADS)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  17. Speech Perception Benefits of FM and Infrared Devices to Children with Hearing Aids in a Typical Classroom

    ERIC Educational Resources Information Center

    Anderson, Karen L.; Goldstein, Howard

    2004-01-01

    Children typically learn in classroom environments that have background noise and reverberation that interfere with accurate speech perception. Amplification technology can enhance the speech perception of students who are hard of hearing. Purpose: This study used a single-subject alternating treatments design to compare the speech recognition…

  18. Evidence-based occupational hearing screening II: validation of a screening methodology using measures of functional hearing ability.

    PubMed

    Soli, Sigfrid D; Amano-Kusumoto, Akiko; Clavier, Odile; Wilbur, Jed; Casto, Kristen; Freed, Daniel; Laroche, Chantal; Vaillancourt, Véronique; Giguère, Christian; Dreschler, Wouter A; Rhebergen, Koenraad S

    2018-05-01

    Validate use of the Extended Speech Intelligibility Index (ESII) for prediction of speech intelligibility in non-stationary real-world noise environments. Define a means of using these predictions for objective occupational hearing screening for hearing-critical public safety and law enforcement jobs. Analyses of predicted and measured speech intelligibility in recordings of real-world noise environments were performed in two studies using speech recognition thresholds (SRTs) and intelligibility measures. ESII analyses of the recordings were used to predict intelligibility. Noise recordings were made in prison environments and at US Army facilities for training ground and airborne forces. Speech materials included full bandwidth sentences and bandpass filtered sentences that simulated radio transmissions. A total of 22 adults with normal hearing (NH) and 15 with mild-moderate hearing impairment (HI) participated in the two studies. Average intelligibility predictions for individual NH and HI subjects were accurate in both studies (r 2  ≥ 0.94). Pooled predictions were slightly less accurate (0.78 ≤ r 2  ≤ 0.92). An individual's SRT and audiogram can accurately predict the likelihood of effective speech communication in noise environments with known ESII characteristics, where essential hearing-critical tasks are performed. These predictions provide an objective means of occupational hearing screening.

  19. Robot Command Interface Using an Audio-Visual Speech Recognition System

    NASA Astrophysics Data System (ADS)

    Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy

    In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.

  20. Cortical activity patterns predict robust speech discrimination ability in noise

    PubMed Central

    Shetake, Jai A.; Wolf, Jordan T.; Cheung, Ryan J.; Engineer, Crystal T.; Ram, Satyananda K.; Kilgard, Michael P.

    2012-01-01

    The neural mechanisms that support speech discrimination in noisy conditions are poorly understood. In quiet conditions, spike timing information appears to be used in the discrimination of speech sounds. In this study, we evaluated the hypothesis that spike timing is also used to distinguish between speech sounds in noisy conditions that significantly degrade neural responses to speech sounds. We tested speech sound discrimination in rats and recorded primary auditory cortex (A1) responses to speech sounds in background noise of different intensities and spectral compositions. Our behavioral results indicate that rats, like humans, are able to accurately discriminate consonant sounds even in the presence of background noise that is as loud as the speech signal. Our neural recordings confirm that speech sounds evoke degraded but detectable responses in noise. Finally, we developed a novel neural classifier that mimics behavioral discrimination. The classifier discriminates between speech sounds by comparing the A1 spatiotemporal activity patterns evoked on single trials with the average spatiotemporal patterns evoked by known sounds. Unlike classifiers in most previous studies, this classifier is not provided with the stimulus onset time. Neural activity analyzed with the use of relative spike timing was well correlated with behavioral speech discrimination in quiet and in noise. Spike timing information integrated over longer intervals was required to accurately predict rat behavioral speech discrimination in noisy conditions. The similarity of neural and behavioral discrimination of speech in noise suggests that humans and rats may employ similar brain mechanisms to solve this problem. PMID:22098331

  1. Stylistic gait synthesis based on hidden Markov models

    NASA Astrophysics Data System (ADS)

    Tilmanne, Joëlle; Moinet, Alexis; Dutoit, Thierry

    2012-12-01

    In this work we present an expressive gait synthesis system based on hidden Markov models (HMMs), following and modifying a procedure originally developed for speaking style adaptation, in speech synthesis. A large database of neutral motion capture walk sequences was used to train an HMM of average walk. The model was then used for automatic adaptation to a particular style of walk using only a small amount of training data from the target style. The open source toolkit that we adapted for motion modeling also enabled us to take into account the dynamics of the data and to model accurately the duration of each HMM state. We also address the assessment issue and propose a procedure for qualitative user evaluation of the synthesized sequences. Our tests show that the style of these sequences can easily be recognized and look natural to the evaluators.

  2. Perceiving pain in others: validation of a dual processing model.

    PubMed

    McCrystal, Kalie N; Craig, Kenneth D; Versloot, Judith; Fashler, Samantha R; Jones, Daniel N

    2011-05-01

    Accurate perception of another person's painful distress would appear to be accomplished through sensitivity to both automatic (unintentional, reflexive) and controlled (intentional, purposive) behavioural expression. We examined whether observers would construe diverse behavioural cues as falling within these domains, consistent with cognitive neuroscience findings describing activation of both automatic and controlled neuroregulatory processes. Using online survey methodology, 308 research participants rated behavioural cues as "goal directed vs. non-goal directed," "conscious vs. unconscious," "uncontrolled vs. controlled," "fast vs. slow," "intentional (deliberate) vs. unintentional," "stimulus driven (obligatory) vs. self driven," and "requiring contemplation vs. not requiring contemplation." The behavioural cues were the 39 items provided by the PROMIS pain behaviour bank, constructed to be representative of the diverse possibilities for pain expression. Inter-item correlations among rating scales provided evidence of sufficient internal consistency justifying a single score on an automatic/controlled dimension (excluding the inconsistent fast vs. slow scale). An initial exploratory factor analysis on 151 participant data sets yielded factors consistent with "controlled" and "automatic" actions, as well as behaviours characterized as "ambiguous." A confirmatory factor analysis using the remaining 151 data sets replicated EFA findings, supporting theoretical predictions that observers would distinguish immediate, reflexive, and spontaneous reactions (primarily facial expression and paralinguistic features of speech) from purposeful and controlled expression (verbal behaviour, instrumental behaviour requiring ongoing, integrated responses). There are implicit dispositions to organize cues signaling pain in others into the well-defined categories predicted by dual process theory. Copyright © 2011 International Association for the Study of Pain. Published by Elsevier B.V. All rights reserved.

  3. How should a speech recognizer work?

    PubMed

    Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M

    2005-11-12

    Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.

  4. A social feedback loop for speech development and its reduction in autism.

    PubMed

    Warlaumont, Anne S; Richards, Jeffrey A; Gilkerson, Jill; Oller, D Kimbrough

    2014-07-01

    We analyzed the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hr total) of children (8- to 48-month-olds) with and without autism. We found that an adult was more likely to respond when the child's vocalization was speech related rather than not speech related. In turn, a child's vocalization was more likely to be speech related if the child's previous speech-related vocalization had received an immediate adult response rather than no response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech development. Although this feedback loop applies in both typical development and autism, children with autism produced proportionally fewer speech-related vocalizations, and the responses they received were less contingent on whether their vocalizations were speech related. We argue that such differences will diminish the strength of the social feedback loop and have cascading effects on speech development over time. Differences related to socioeconomic status are also reported. © The Author(s) 2014.

  5. Processing of speech signals for physical and sensory disabilities.

    PubMed Central

    Levitt, H

    1995-01-01

    Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities. Images Fig. 4 PMID:7479816

  6. Processing of Speech Signals for Physical and Sensory Disabilities

    NASA Astrophysics Data System (ADS)

    Levitt, Harry

    1995-10-01

    Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities.

  7. Joint Spatial-Spectral Feature Space Clustering for Speech Activity Detection from ECoG Signals

    PubMed Central

    Kanas, Vasileios G.; Mporas, Iosif; Benz, Heather L.; Sgarbas, Kyriakos N.; Bezerianos, Anastasios; Crone, Nathan E.

    2014-01-01

    Brain machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines (SVM) as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and non-speech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllable repetition tasks and may contribute to the development of portable ECoG-based communication. PMID:24658248

  8. [Creating language model of the forensic medicine domain for developing a autopsy recording system by automatic speech recognition].

    PubMed

    Niijima, H; Ito, N; Ogino, S; Takatori, T; Iwase, H; Kobayashi, M

    2000-11-01

    For the purpose of practical use of speech recognition technology for recording of forensic autopsy, a language model of the speech recording system, specialized for the forensic autopsy, was developed. The language model for the forensic autopsy by applying 3-gram model was created, and an acoustic model for Japanese speech recognition by Hidden Markov Model in addition to the above were utilized to customize the speech recognition engine for forensic autopsy. A forensic vocabulary set of over 10,000 words was compiled and some 300,000 sentence patterns were made to create the forensic language model, then properly mixing with a general language model to attain high exactitude. When tried by dictating autopsy findings, this speech recognition system was proved to be about 95% of recognition rate that seems to have reached to the practical usability in view of speech recognition software, though there remains rooms for improving its hardware and application-layer software.

  9. Optimal pattern synthesis for speech recognition based on principal component analysis

    NASA Astrophysics Data System (ADS)

    Korsun, O. N.; Poliyev, A. V.

    2018-02-01

    The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.

  10. Auditory models for speech analysis

    NASA Astrophysics Data System (ADS)

    Maybury, Mark T.

    This paper reviews the psychophysical basis for auditory models and discusses their application to automatic speech recognition. First an overview of the human auditory system is presented, followed by a review of current knowledge gleaned from neurological and psychoacoustic experimentation. Next, a general framework describes established peripheral auditory models which are based on well-understood properties of the peripheral auditory system. This is followed by a discussion of current enhancements to that models to include nonlinearities and synchrony information as well as other higher auditory functions. Finally, the initial performance of auditory models in the task of speech recognition is examined and additional applications are mentioned.

  11. Effects and modeling of phonetic and acoustic confusions in accented speech.

    PubMed

    Fung, Pascale; Liu, Yi

    2005-11-01

    Accented speech recognition is more challenging than standard speech recognition due to the effects of phonetic and acoustic confusions. Phonetic confusion in accented speech occurs when an expected phone is pronounced as a different one, which leads to erroneous recognition. Acoustic confusion occurs when the pronounced phone is found to lie acoustically between two baseform models and can be equally recognized as either one. We propose that it is necessary to analyze and model these confusions separately in order to improve accented speech recognition without degrading standard speech recognition. Since low phonetic confusion units in accented speech do not give rise to automatic speech recognition errors, we focus on analyzing and reducing phonetic and acoustic confusability under high phonetic confusion conditions. We propose using likelihood ratio test to measure phonetic confusion, and asymmetric acoustic distance to measure acoustic confusion. Only accent-specific phonetic units with low acoustic confusion are used in an augmented pronunciation dictionary, while phonetic units with high acoustic confusion are reconstructed using decision tree merging. Experimental results show that our approach is effective and superior to methods modeling phonetic confusion or acoustic confusion alone in accented speech, with a significant 5.7% absolute WER reduction, without degrading standard speech recognition.

  12. Pathways of the inferior frontal occipital fasciculus in overt speech and reading.

    PubMed

    Rollans, Claire; Cheema, Kulpreet; Georgiou, George K; Cummine, Jacqueline

    2017-11-19

    In this study, we examined the relationship between tractography-based measures of white matter integrity (ex. fractional anisotropy [FA]) from diffusion tensor imaging (DTI) and five reading-related tasks, including rapid automatized naming (RAN) of letters, digits, and objects, and reading of real words and nonwords. Twenty university students with no reported history of reading difficulties were tested on all five tasks and their performance was correlated with diffusion measures extracted through DTI tractography. A secondary analysis using whole-brain Tract-Based Spatial Statistics (TBSS) was also used to find clusters showing significant negative correlations between reaction time and FA. Results showed a significant relationship between the left inferior fronto-occipital fasciculus FA and performance on the RAN of objects task, as well as a strong relationship to nonword reading, which suggests a role for this tract in slower, non-automatic and/or resource-demanding speech tasks. There were no significant relationships between FA and the faster, more automatic speech tasks (RAN of letters and digits, and real word reading). These findings provide evidence for the role of the inferior fronto-occipital fasciculus in tasks that are highly demanding of orthography-phonology translation (e.g., nonword reading) and semantic processing (e.g., RAN object). This demonstrates the importance of the inferior fronto-occipital fasciculus in basic naming and suggests that this tract may be a sensitive predictor of rapid naming performance within the typical population. We discuss the findings in the context of current models of reading and speech production to further characterize the white matter pathways associated with basic reading processes. Copyright © 2017 IBRO. Published by Elsevier Ltd. All rights reserved.

  13. Adaptive method of recognition of signals for one and two-frequency signal system in the telephony on the background of speech

    NASA Astrophysics Data System (ADS)

    Kuznetsov, Michael V.

    2006-05-01

    For reliable teamwork of various systems of automatic telecommunication including transferring systems of optical communication networks it is necessary authentic recognition of signals for one- or two-frequency service signal system. The analysis of time parameters of an accepted signal allows increasing reliability of detection and recognition of the service signal system on a background of speech.

  14. Open-Source Multi-Language Audio Database for Spoken Language Processing Applications

    DTIC Science & Technology

    2012-12-01

    Mandarin, and Russian . Approximately 30 hours of speech were collected for each language. Each passage has been carefully transcribed at the...manual and automatic methods. The Russian passages have not yet been marked at the phonetic level. Another phase of the work was to explore...You Tube. 300 passages were collected in each of three languages—English, Mandarin, and Russian . Approximately 30 hours of speech were

  15. A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms

    PubMed Central

    Wołk, Agnieszka; Glinkowski, Wojciech

    2017-01-01

    People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer. PMID:29230254

  16. A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms.

    PubMed

    Wołk, Krzysztof; Wołk, Agnieszka; Glinkowski, Wojciech

    2017-01-01

    People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer.

  17. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    PubMed Central

    Caballero-Morales, Santiago-Omar

    2013-01-01

    An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness). Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR's output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech. PMID:23935410

  18. The Psychologist as an Interlocutor in Autism Spectrum Disorder Assessment: Insights From a Study of Spontaneous Prosody

    PubMed Central

    Bone, Daniel; Lee, Chi-Chun; Black, Matthew P.; Williams, Marian E.; Lee, Sungbok; Levitt, Pat; Narayanan, Shrikanth

    2015-01-01

    Purpose The purpose of this study was to examine relationships between prosodic speech cues and autism spectrum disorder (ASD) severity, hypothesizing a mutually interactive relationship between the speech characteristics of the psychologist and the child. The authors objectively quantified acoustic-prosodic cues of the psychologist and of the child with ASD during spontaneous interaction, establishing a methodology for future large-sample analysis. Method Speech acoustic-prosodic features were semiautomatically derived from segments of semistructured interviews (Autism Diagnostic Observation Schedule, ADOS; Lord, Rutter, DiLavore, & Risi, 1999; Lord et al., 2012) with 28 children who had previously been diagnosed with ASD. Prosody was quantified in terms of intonation, volume, rate, and voice quality. Research hypotheses were tested via correlation as well as hierarchical and predictive regression between ADOS severity and prosodic cues. Results Automatically extracted speech features demonstrated prosodic characteristics of dyadic interactions. As rated ASD severity increased, both the psychologist and the child demonstrated effects for turn-end pitch slope, and both spoke with atypical voice quality. The psychologist’s acoustic cues predicted the child’s symptom severity better than did the child’s acoustic cues. Conclusion The psychologist, acting as evaluator and interlocutor, was shown to adjust his or her behavior in predictable ways based on the child’s social-communicative impairments. The results support future study of speech prosody of both interaction partners during spontaneous conversation, while using automatic computational methods that allow for scalable analysis on much larger corpora. PMID:24686340

  19. Speech recognition systems on the Cell Broadband Engine

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, Y; Jones, H; Vaidya, S

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousandsmore » of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.« less

  20. Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data

    PubMed Central

    Cao, Beiming; Kim, Myungjong; Mau, Ted; Wang, Jun

    2017-01-01

    Individuals with larynx (vocal folds) impaired have problems in controlling their glottal vibration, producing whispered speech with extreme hoarseness. Standard automatic speech recognition using only acoustic cues is typically ineffective for whispered speech because the corresponding spectral characteristics are distorted. Articulatory cues such as the tongue and lip motion may help in recognizing whispered speech since articulatory motion patterns are generally not affected. In this paper, we investigated whispered speech recognition for patients with reconstructed larynx using articulatory movement data. A data set with both acoustic and articulatory motion data was collected from a patient with surgically reconstructed larynx using an electromagnetic articulograph. Two speech recognition systems, Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural network-HMM (DNN-HMM), were used in the experiments. Experimental results showed adding either tongue or lip motion data to acoustic features such as mel-frequency cepstral coefficient (MFCC) significantly reduced the phone error rates on both speech recognition systems. Adding both tongue and lip data achieved the best performance. PMID:29423453

  1. Effect of Dialect on the Identification of Speech Impairment in Indigenous Children

    ERIC Educational Resources Information Center

    Laffey, Kate; Pearce, Wendy M.; Steed, William

    2014-01-01

    The influence of dialect on child speech assessment processes is important to consider in order to ensure accurate diagnosis and appropriate intervention (teaching or therapy) for bidialectal children. In Australia, there is limited research evidence documenting the influence of dialectal variations on identification of speech impairment among…

  2. Brain Diseases

    MedlinePlus

    The brain is the control center of the body. It controls thoughts, memory, speech, and movement. It regulates the function of many organs. When the brain is healthy, it works quickly and automatically. However, ...

  3. Two distinct auditory-motor circuits for monitoring speech production as revealed by content-specific suppression of auditory cortex.

    PubMed

    Ylinen, Sari; Nora, Anni; Leminen, Alina; Hakala, Tero; Huotilainen, Minna; Shtyrov, Yury; Mäkelä, Jyrki P; Service, Elisabet

    2015-06-01

    Speech production, both overt and covert, down-regulates the activation of auditory cortex. This is thought to be due to forward prediction of the sensory consequences of speech, contributing to a feedback control mechanism for speech production. Critically, however, these regulatory effects should be specific to speech content to enable accurate speech monitoring. To determine the extent to which such forward prediction is content-specific, we recorded the brain's neuromagnetic responses to heard multisyllabic pseudowords during covert rehearsal in working memory, contrasted with a control task. The cortical auditory processing of target syllables was significantly suppressed during rehearsal compared with control, but only when they matched the rehearsed items. This critical specificity to speech content enables accurate speech monitoring by forward prediction, as proposed by current models of speech production. The one-to-one phonological motor-to-auditory mappings also appear to serve the maintenance of information in phonological working memory. Further findings of right-hemispheric suppression in the case of whole-item matches and left-hemispheric enhancement for last-syllable mismatches suggest that speech production is monitored by 2 auditory-motor circuits operating on different timescales: Finer grain in the left versus coarser grain in the right hemisphere. Taken together, our findings provide hemisphere-specific evidence of the interface between inner and heard speech. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. Data analysis and integration of environmental sensors to meet human needs

    NASA Astrophysics Data System (ADS)

    Santamaria, Amilcare Francesco; De Rango, Floriano; Barletta, Domenico; Falbo, Domenico; Imbrogno, Alessandro

    2014-05-01

    Nowadays one of the main task of technology is to make people's life simpler and easier. Ambient intelligence is an emerging discipline that brings intelligence to environments making them sensitive to us. This discipline has developed following the spread of sensors devices, sensor networks, pervasive computing and artificial intelligence. In this work, we attempt to enhance the Internet Of Things (loT) with intelligence and environments exploring various interactions between humans' beings and the environment they live in. In particular, the core of the system is composed of an automation system, which is made up with a domotic control unit and several sensors installed in the environment. The task of the sensors is to collect information from the environment and to send them to the control unit. Once the information is collected, the core combines them in order to infer the most accurate human needs. The knowledge of human needs and the current environment status compose the inputs of the intelligence block whose main goal is to find the right automations to satisfy human needs in a real time way. The system also provides a Speech Recognition service which allow users to interact with the system by their voice so human speech can be considered as additional input for smart automatisms.

  5. Automatic detection of swallowing events by acoustical means for applications of monitoring of ingestive behavior.

    PubMed

    Sazonov, Edward S; Makeyev, Oleksandr; Schuckers, Stephanie; Lopez-Meyer, Paulo; Melanson, Edward L; Neuman, Michael R

    2010-03-01

    Our understanding of etiology of obesity and overweight is incomplete due to lack of objective and accurate methods for monitoring of ingestive behavior (MIB) in the free-living population. Our research has shown that frequency of swallowing may serve as a predictor for detecting food intake, differentiating liquids and solids, and estimating ingested mass. This paper proposes and compares two methods of acoustical swallowing detection from sounds contaminated by motion artifacts, speech, and external noise. Methods based on mel-scale Fourier spectrum, wavelet packets, and support vector machines are studied considering the effects of epoch size, level of decomposition, and lagging on classification accuracy. The methodology was tested on a large dataset (64.5 h with a total of 9966 swallows) collected from 20 human subjects with various degrees of adiposity. Average weighted epoch-recognition accuracy for intravisit individual models was 96.8%, which resulted in 84.7% average weighted accuracy in detection of swallowing events. These results suggest high efficiency of the proposed methodology in separation of swallowing sounds from artifacts that originate from respiration, intrinsic speech, head movements, food ingestion, and ambient noise. The recognition accuracy was not related to body mass index, suggesting that the methodology is suitable for obese individuals.

  6. Automatic Detection of Swallowing Events by Acoustical Means for Applications of Monitoring of Ingestive Behavior

    PubMed Central

    Sazonov, Edward S.; Makeyev, Oleksandr; Schuckers, Stephanie; Lopez-Meyer, Paulo; Melanson, Edward L.; Neuman, Michael R.

    2010-01-01

    Our understanding of etiology of obesity and overweight is incomplete due to lack of objective and accurate methods for Monitoring of Ingestive Behavior (MIB) in the free living population. Our research has shown that frequency of swallowing may serve as a predictor for detecting food intake, differentiating liquids and solids, and estimating ingested mass. This paper proposes and compares two methods of acoustical swallowing detection from sounds contaminated by motion artifacts, speech and external noise. Methods based on mel-scale Fourier spectrum, wavelet packets, and support vector machines are studied considering the effects of epoch size, level of decomposition and lagging on classification accuracy. The methodology was tested on a large dataset (64.5 hours with a total of 9,966 swallows) collected from 20 human subjects with various degrees of adiposity. Average weighted epoch recognition accuracy for intra-visit individual models was 96.8% which resulted in 84.7% average weighted accuracy in detection of swallowing events. These results suggest high efficiency of the proposed methodology in separation of swallowing sounds from artifacts that originate from respiration, intrinsic speech, head movements, food ingestion, and ambient noise. The recognition accuracy was not related to body mass index, suggesting that the methodology is suitable for obese individuals. PMID:19789095

  7. A Joint Time-Frequency and Matrix Decomposition Feature Extraction Methodology for Pathological Voice Classification

    NASA Astrophysics Data System (ADS)

    Ghoraani, Behnaz; Krishnan, Sridhar

    2009-12-01

    The number of people affected by speech problems is increasing as the modern world places increasing demands on the human voice via mobile telephones, voice recognition software, and interpersonal verbal communications. In this paper, we propose a novel methodology for automatic pattern classification of pathological voices. The main contribution of this paper is extraction of meaningful and unique features using Adaptive time-frequency distribution (TFD) and nonnegative matrix factorization (NMF). We construct Adaptive TFD as an effective signal analysis domain to dynamically track the nonstationarity in the speech and utilize NMF as a matrix decomposition (MD) technique to quantify the constructed TFD. The proposed method extracts meaningful and unique features from the joint TFD of the speech, and automatically identifies and measures the abnormality of the signal. Depending on the abnormality measure of each signal, we classify the signal into normal or pathological. The proposed method is applied on the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database which consists of 161 pathological and 51 normal speakers, and an overall classification accuracy of 98.6% was achieved.

  8. Speech training alters tone frequency tuning in rat primary auditory cortex

    PubMed Central

    Engineer, Crystal T.; Perez, Claudia A.; Carraway, Ryan S.; Chang, Kevin Q.; Roland, Jarod L.; Kilgard, Michael P.

    2013-01-01

    Previous studies in both humans and animals have documented improved performance following discrimination training. This enhanced performance is often associated with cortical response changes. In this study, we tested the hypothesis that long-term speech training on multiple tasks can improve primary auditory cortex (A1) responses compared to rats trained on a single speech discrimination task or experimentally naïve rats. Specifically, we compared the percent of A1 responding to trained sounds, the responses to both trained and untrained sounds, receptive field properties of A1 neurons, and the neural discrimination of pairs of speech sounds in speech trained and naïve rats. Speech training led to accurate discrimination of consonant and vowel sounds, but did not enhance A1 response strength or the neural discrimination of these sounds. Speech training altered tone responses in rats trained on six speech discrimination tasks but not in rats trained on a single speech discrimination task. Extensive speech training resulted in broader frequency tuning, shorter onset latencies, a decreased driven response to tones, and caused a shift in the frequency map to favor tones in the range where speech sounds are the loudest. Both the number of trained tasks and the number of days of training strongly predict the percent of A1 responding to a low frequency tone. Rats trained on a single speech discrimination task performed less accurately than rats trained on multiple tasks and did not exhibit A1 response changes. Our results indicate that extensive speech training can reorganize the A1 frequency map, which may have downstream consequences on speech sound processing. PMID:24344364

  9. Computational Modeling of Emotions and Affect in Social-Cultural Interaction

    DTIC Science & Technology

    2013-10-02

    acoustic and textual information sources. Second, a cross-lingual study was performed that shed light on how human perception and automatic recognition...speech is produced, a speaker’s pitch and intonational pattern, and word usage. Better feature representation and advanced approaches were used to...recognition performance, and improved our understanding of language/cultural impact on human perception of emotion and automatic classification. • Units

  10. Automatic Barometric Updates from Ground-Based Navigational Aids

    DTIC Science & Technology

    1990-03-12

    ro fAutomatic Barometric Updates US Department from of Transportation Ground-Based Federal Aviation Administration Navigational Aids Office of Safety...tighter vertical spacing controls , particularly for operations near Terminal Control Areas (TCAs), Airport Radar Service Areas (ARSAs), military climb and...E.F., Ruth, J.C., and Williges, B.H. (1987). Speech Controls and Displays. In Salvendy, G., E. Handbook of Human Factors/Ergonomics, New York, John

  11. Childhood apraxia of speech: A survey of praxis and typical speech characteristics.

    PubMed

    Malmenholt, Ann; Lohmander, Anette; McAllister, Anita

    2017-07-01

    The purpose of this study was to investigate current knowledge of the diagnosis childhood apraxia of speech (CAS) in Sweden and compare speech characteristics and symptoms to those of earlier survey findings in mainly English-speakers. In a web-based questionnaire 178 Swedish speech-language pathologists (SLPs) anonymously answered questions about their perception of typical speech characteristics for CAS. They graded own assessment skills and estimated clinical occurrence. The seven top speech characteristics reported as typical for children with CAS were: inconsistent speech production (85%), sequencing difficulties (71%), oro-motor deficits (63%), vowel errors (62%), voicing errors (61%), consonant cluster deletions (54%), and prosodic disturbance (53%). Motor-programming deficits described as lack of automatization of speech movements were perceived by 82%. All listed characteristics were consistent with the American Speech-Language-Hearing Association (ASHA) consensus-based features, Strand's 10-point checklist, and the diagnostic model proposed by Ozanne. The mode for clinical occurrence was 5%. Number of suspected cases of CAS in the clinical caseload was approximately one new patient/year and SLP. The results support and add to findings from studies of CAS in English-speaking children with similar speech characteristics regarded as typical. Possibly, these findings could contribute to cross-linguistic consensus on CAS characteristics.

  12. Segmental intelligibility of synthetic speech produced by rule.

    PubMed

    Logan, J S; Greene, B G; Pisoni, D B

    1989-08-01

    This paper reports the results of an investigation that employed the modified rhyme test (MRT) to measure the segmental intelligibility of synthetic speech generated automatically by rule. Synthetic speech produced by ten text-to-speech systems was studied and compared to natural speech. A variation of the standard MRT was also used to study the effects of response set size on perceptual confusions. Results indicated that the segmental intelligibility scores formed a continuum. Several systems displayed very high levels of performance that were close to or equal to scores obtained with natural speech; other systems displayed substantially worse performance compared to natural speech. The overall performance of the best system, DECtalk--Paul, was equivalent to the data obtained with natural speech for consonants in syllable-initial position. The findings from this study are discussed in terms of the use of a set of standardized procedures for measuring intelligibility of synthetic speech under controlled laboratory conditions. Recent work investigating the perception of synthetic speech under more severe conditions in which greater demands are made on the listener's processing resources is also considered. The wide range of intelligibility scores obtained in the present study demonstrates important differences in perception and suggests that not all synthetic speech is perceptually equivalent to the listener.

  13. Segmental intelligibility of synthetic speech produced by rule

    PubMed Central

    Logan, John S.; Greene, Beth G.; Pisoni, David B.

    2012-01-01

    This paper reports the results of an investigation that employed the modified rhyme test (MRT) to measure the segmental intelligibility of synthetic speech generated automatically by rule. Synthetic speech produced by ten text-to-speech systems was studied and compared to natural speech. A variation of the standard MRT was also used to study the effects of response set size on perceptual confusions. Results indicated that the segmental intelligibility scores formed a continuum. Several systems displayed very high levels of performance that were close to or equal to scores obtained with natural speech; other systems displayed substantially worse performance compared to natural speech. The overall performance of the best system, DECtalk—Paul, was equivalent to the data obtained with natural speech for consonants in syllable-initial position. The findings from this study are discussed in terms of the use of a set of standardized procedures for measuring intelligibility of synthetic speech under controlled laboratory conditions. Recent work investigating the perception of synthetic speech under more severe conditions in which greater demands are made on the listener’s processing resources is also considered. The wide range of intelligibility scores obtained in the present study demonstrates important differences in perception and suggests that not all synthetic speech is perceptually equivalent to the listener. PMID:2527884

  14. Speech rate in Parkinson's disease: A controlled study.

    PubMed

    Martínez-Sánchez, F; Meilán, J J G; Carro, J; Gómez Íñiguez, C; Millian-Morell, L; Pujante Valverde, I M; López-Alburquerque, T; López, D E

    2016-09-01

    Speech disturbances will affect most patients with Parkinson's disease (PD) over the course of the disease. The origin and severity of these symptoms are of clinical and diagnostic interest. To evaluate the clinical pattern of speech impairment in PD patients and identify significant differences in speech rate and articulation compared to control subjects. Speech rate and articulation in a reading task were measured using an automatic analytical method. A total of 39 PD patients in the 'on' state and 45 age-and sex-matched asymptomatic controls participated in the study. None of the patients experienced dyskinesias or motor fluctuations during the test. The patients with PD displayed a significant reduction in speech and articulation rates; there were no significant correlations between the studied speech parameters and patient characteristics such as L-dopa dose, duration of the disorder, age, and UPDRS III scores and Hoehn & Yahr scales. Patients with PD show a characteristic pattern of declining speech rate. These results suggest that in PD, disfluencies are the result of the movement disorder affecting the physiology of speech production systems. Copyright © 2014 Sociedad Española de Neurología. Publicado por Elsevier España, S.L.U. All rights reserved.

  15. Modular Neural Networks for Speech Recognition.

    DTIC Science & Technology

    1996-08-01

    automatic speech rccogni- tion, understanding and translation since the early 1950’ s . Although researchers have demonstrated impressive results with...nodes. It serves only as a data source for the following hidden layer( s ). Finally, the networks output is computed by neurons in the output layer. The...following update rule for weights in the hidden layer: w (,,•+I) ("’) E/V S (W W k- = wj, -- 7 - / v It is easy to generalize the backpropagation

  16. Teacher Candidates' Mastery of Phoneme-Grapheme Correspondence: Massed versus Distributed Practice in Teacher Education

    ERIC Educational Resources Information Center

    Sayeski, Kristin L.; Earle, Gentry A.; Eslinger, R. Paige; Whitenton, Jessy N.

    2017-01-01

    Matching phonemes (speech sounds) to graphemes (letters and letter combinations) is an important aspect of decoding (translating print to speech) and encoding (translating speech to print). Yet, many teacher candidates do not receive explicit training in phoneme-grapheme correspondence. Difficulty with accurate phoneme production and/or lack of…

  17. The Effect of Feedback Schedule Manipulation on Speech Priming Patterns and Reaction Time

    ERIC Educational Resources Information Center

    Slocomb, Dana; Spencer, Kristie A.

    2009-01-01

    Speech priming tasks are frequently used to delineate stages in the speech process such as lexical retrieval and motor programming. These tasks, often measured in reaction time (RT), require fast and accurate responses, reflecting maximized participant performance, to result in robust priming effects. Encouraging speed and accuracy in responding…

  18. Relationship between listeners' nonnative speech recognition and categorization abilities

    PubMed Central

    Atagi, Eriko; Bent, Tessa

    2015-01-01

    Enhancement of the perceptual encoding of talker characteristics (indexical information) in speech can facilitate listeners' recognition of linguistic content. The present study explored this indexical-linguistic relationship in nonnative speech processing by examining listeners' performance on two tasks: nonnative accent categorization and nonnative speech-in-noise recognition. Results indicated substantial variability across listeners in their performance on both the accent categorization and nonnative speech recognition tasks. Moreover, listeners' accent categorization performance correlated with their nonnative speech-in-noise recognition performance. These results suggest that having more robust indexical representations for nonnative accents may allow listeners to more accurately recognize the linguistic content of nonnative speech. PMID:25618098

  19. Automated detection and characterization of harmonic tremor in continuous seismic data

    NASA Astrophysics Data System (ADS)

    Roman, Diana C.

    2017-06-01

    Harmonic tremor is a common feature of volcanic, hydrothermal, and ice sheet seismicity and is thus an important proxy for monitoring changes in these systems. However, no automated methods for detecting harmonic tremor currently exist. Because harmonic tremor shares characteristics with speech and music, digital signal processing techniques for analyzing these signals can be adapted. I develop a novel pitch-detection-based algorithm to automatically identify occurrences of harmonic tremor and characterize their frequency content. The algorithm is applied to seismic data from Popocatepetl Volcano, Mexico, and benchmarked against a monthlong manually detected catalog of harmonic tremor events. During a period of heightened eruptive activity from December 2014 to May 2015, the algorithm detects 1465 min of harmonic tremor, which generally precede periods of heightened explosive activity. These results demonstrate the algorithm's ability to accurately characterize harmonic tremor while highlighting the need for additional work to understand its causes and implications at restless volcanoes.

  20. [Speech syndrome and its course in patients who have sustained an ischemic stroke (clinico-tomographic study)].

    PubMed

    Stoliarova, L G; Varakin, Iu Ia; Vavilov, S B

    1981-01-01

    Clinical and tomographic examinations of 40 patients with aphasia developed after an ischemic stroke were carried out. In more than half of them no correlation between the aphasia gravity and character on the one hand, and the size and localization of the ischemic focus (or foci) in the brain on the other was noted. With similar character and gravity of the speech disorder the size and localization of the ischemic foci may be different, ad vice versa. It has been shown that the interrelations between the focal pathology of the brain and the character and gravity of speech disorders are very complicated. One should take into consideration the possibility of individual organization of the speech functions, the degree of the speech activity automatism before the disease, and the state of the cerebrovascular system as a whole.

  1. An experimental version of the MZT (speech-from-text) system with external F(sub 0) control

    NASA Astrophysics Data System (ADS)

    Nowak, Ignacy

    1994-12-01

    The version of a Polish speech from text system described in this article was developed using the speech-from-text system. The new system has additional functions which make it possible to enter commands in edited orthographic text to control the phrase component and accentuation parameters. This makes it possible to generate a series of modified intonation contours in the texts spoken by the system. The effects obtained are made easier to control by a graphic illustration of the base frequency pattern in phrases that were last 'spoken' by the system. This version of the system was designed as a test prototype which will help us expand and refine our set of rules for automatic generation of intonation contours, which in turn will enable the fully automated speech-from-text system to generate speech with a more varied and precisely formed fundamental frequency pattern.

  2. Common cues to emotion in the dynamic facial expressions of speech and song

    PubMed Central

    Livingstone, Steven R.; Thompson, William F.; Wanderley, Marcelo M.; Palmer, Caroline

    2015-01-01

    Speech and song are universal forms of vocalization that may share aspects of emotional expression. Research has focused on parallels in acoustic features, overlooking facial cues to emotion. In three experiments, we compared moving facial expressions in speech and song. In Experiment 1, vocalists spoke and sang statements each with five emotions. Vocalists exhibited emotion-dependent movements of the eyebrows and lip corners that transcended speech–song differences. Vocalists’ jaw movements were coupled to their acoustic intensity, exhibiting differences across emotion and speech–song. Vocalists’ emotional movements extended beyond vocal sound to include large sustained expressions, suggesting a communicative function. In Experiment 2, viewers judged silent videos of vocalists’ facial expressions prior to, during, and following vocalization. Emotional intentions were identified accurately for movements during and after vocalization, suggesting that these movements support the acoustic message. Experiment 3 compared emotional identification in voice-only, face-only, and face-and-voice recordings. Emotion judgements for voice-only singing were poorly identified, yet were accurate for all other conditions, confirming that facial expressions conveyed emotion more accurately than the voice in song, yet were equivalent in speech. Collectively, these findings highlight broad commonalities in the facial cues to emotion in speech and song, yet highlight differences in perception and acoustic-motor production. PMID:25424388

  3. How much is a word? Predicting ease of articulation planning from apraxic speech error patterns.

    PubMed

    Ziegler, Wolfram; Aichert, Ingrid

    2015-08-01

    According to intuitive concepts, 'ease of articulation' is influenced by factors like word length or the presence of consonant clusters in an utterance. Imaging studies of speech motor control use these factors to systematically tax the speech motor system. Evidence from apraxia of speech, a disorder supposed to result from speech motor planning impairment after lesions to speech motor centers in the left hemisphere, supports the relevance of these and other factors in disordered speech planning and the genesis of apraxic speech errors. Yet, there is no unified account of the structural properties rendering a word easy or difficult to pronounce. To model the motor planning demands of word articulation by a nonlinear regression model trained to predict the likelihood of accurate word production in apraxia of speech. We used a tree-structure model in which vocal tract gestures are embedded in hierarchically nested prosodic domains to derive a recursive set of terms for the computation of the likelihood of accurate word production. The model was trained with accuracy data from a set of 136 words averaged over 66 samples from apraxic speakers. In a second step, the model coefficients were used to predict a test dataset of accuracy values for 96 new words, averaged over 120 samples produced by a different group of apraxic speakers. Accurate modeling of the first dataset was achieved in the training study (R(2)adj = .71). In the cross-validation, the test dataset was predicted with a high accuracy as well (R(2)adj = .67). The model shape, as reflected by the coefficient estimates, was consistent with current phonetic theories and with clinical evidence. In accordance with phonetic and psycholinguistic work, a strong influence of word stress on articulation errors was found. The proposed model provides a unified and transparent account of the motor planning requirements of word articulation. Copyright © 2015 Elsevier Ltd. All rights reserved.

  4. Text-to-audiovisual speech synthesizer for children with learning disabilities.

    PubMed

    Mendi, Engin; Bayrak, Coskun

    2013-01-01

    Learning disabilities affect the ability of children to learn, despite their having normal intelligence. Assistive tools can highly increase functional capabilities of children with learning disorders such as writing, reading, or listening. In this article, we describe a text-to-audiovisual synthesizer that can serve as an assistive tool for such children. The system automatically converts an input text to audiovisual speech, providing synchronization of the head, eye, and lip movements of the three-dimensional face model with appropriate facial expressions and word flow of the text. The proposed system can enhance speech perception and help children having learning deficits to improve their chances of success.

  5. Speaker emotion recognition: from classical classifiers to deep neural networks

    NASA Astrophysics Data System (ADS)

    Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri

    2018-04-01

    Speaker emotion recognition is considered among the most challenging tasks in recent years. In fact, automatic systems for security, medicine or education can be improved when considering the speech affective state. In this paper, a twofold approach for speech emotion classification is proposed. At the first side, a relevant set of features is adopted, and then at the second one, numerous supervised training techniques, involving classic methods as well as deep learning, are experimented. Experimental results indicate that deep architecture can improve classification performance on two affective databases, the Berlin Dataset of Emotional Speech and the SAVEE Dataset Surrey Audio-Visual Expressed Emotion.

  6. Speech input system for meat inspection and pathological coding used thereby

    NASA Astrophysics Data System (ADS)

    Abe, Shozo

    Meat inspection is one of exclusive and important jobs of veterinarians though it is not well known in general. As the inspection should be conducted skillfully during a series of continuous operations in a slaughter house, development of automatic inspecting systems has been required for a long time. We employed a hand-free speech input system to record the inspecting data because inspecters have to use their both hands to treat the internals of catles and check their health conditions by necked eyes. The data collected by the inspectors are transfered to a speech recognizer and then stored as controlable data of each catle inspected. Control of terms such as pathological conditions to be input and their coding are also important in this speech input system and practical examples are shown.

  7. Visual Benefits in Apparent Motion Displays: Automatically Driven Spatial and Temporal Anticipation Are Partially Dissociated

    PubMed Central

    Ahrens, Merle-Marie; Veniero, Domenica; Gross, Joachim; Harvey, Monika; Thut, Gregor

    2015-01-01

    Many behaviourally relevant sensory events such as motion stimuli and speech have an intrinsic spatio-temporal structure. This will engage intentional and most likely unintentional (automatic) prediction mechanisms enhancing the perception of upcoming stimuli in the event stream. Here we sought to probe the anticipatory processes that are automatically driven by rhythmic input streams in terms of their spatial and temporal components. To this end, we employed an apparent visual motion paradigm testing the effects of pre-target motion on lateralized visual target discrimination. The motion stimuli either moved towards or away from peripheral target positions (valid vs. invalid spatial motion cueing) at a rhythmic or arrhythmic pace (valid vs. invalid temporal motion cueing). Crucially, we emphasized automatic motion-induced anticipatory processes by rendering the motion stimuli non-predictive of upcoming target position (by design) and task-irrelevant (by instruction), and by creating instead endogenous (orthogonal) expectations using symbolic cueing. Our data revealed that the apparent motion cues automatically engaged both spatial and temporal anticipatory processes, but that these processes were dissociated. We further found evidence for lateralisation of anticipatory temporal but not spatial processes. This indicates that distinct mechanisms may drive automatic spatial and temporal extrapolation of upcoming events from rhythmic event streams. This contrasts with previous findings that instead suggest an interaction between spatial and temporal attention processes when endogenously driven. Our results further highlight the need for isolating intentional from unintentional processes for better understanding the various anticipatory mechanisms engaged in processing behaviourally relevant stimuli with predictable spatio-temporal structure such as motion and speech. PMID:26623650

  8. The Development of the Speaker Independent ARM Continuous Speech Recognition System

    DTIC Science & Technology

    1992-01-01

    spokeTi airborne reconnaissance reports u-ing a speech recognition system based on phoneme-level hidden Markov models (HMMs). Previous versions of the ARM...will involve automatic selection from multiple model sets, corresponding to different speaker types, and that the most rudimen- tary partition of a...The vocabulary size for the ARM task is 497 words. These words are related to the phoneme-level symbols corresponding to the models in the model set

  9. Stochastic Modeling as a Means of Automatic Speech Recognition

    DTIC Science & Technology

    1975-04-01

    companng ihc features of different speech recognition systems, attention is often focused on thc control structures and the methods o’ communication...with no need to use secondary storage . Note that we go from a group of separate knowledge sources to an integrated network representation in...exhaust the available lime or storage . - - - . . 1- .-.-.. mmm^~ i — ■ ■ ’ ■ C haplcr I - IN I ROÜliCl ION Page 13 On the other hand

  10. An Acquired Deficit of Audiovisual Speech Processing

    ERIC Educational Resources Information Center

    Hamilton, Roy H.; Shenton, Jeffrey T.; Coslett, H. Branch

    2006-01-01

    We report a 53-year-old patient (AWF) who has an acquired deficit of audiovisual speech integration, characterized by a perceived temporal mismatch between speech sounds and the sight of moving lips. AWF was less accurate on an auditory digit span task with vision of a speaker's face as compared to a condition in which no visual information from…

  11. Human phoneme recognition depending on speech-intrinsic variability.

    PubMed

    Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger

    2010-11-01

    The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).

  12. A voice-input voice-output communication aid for people with severe speech impairment.

    PubMed

    Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter

    2013-01-01

    A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.

  13. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar

    PubMed Central

    Shin, Young Hoon; Seo, Jiwon

    2016-01-01

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867

  14. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar.

    PubMed

    Shin, Young Hoon; Seo, Jiwon

    2016-10-29

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  15. Speech endpoint detection with non-language speech sounds for generic speech processing applications

    NASA Astrophysics Data System (ADS)

    McClain, Matthew; Romanowski, Brian

    2009-05-01

    Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.

  16. Institute for the Study of Human Capabilities Summary Descriptions of Research for the Period September 1988 through June 1989

    DTIC Science & Technology

    1989-06-01

    12 1.7 Application of the Modified Speech Transmission Index to Monaural and Binaural Speech Recognition in Normal and Impaired...describe all of the data from both groups. 1.7 Application of the Modified Speech Transmission Index to Monaural and Binaural Speech Recognition in...were obtained for materials presented to each ear separately (monaurally) and to both ears ( binaurally ). Results from the normal listeners are accurately

  17. Analysis and synthesis of intonation using the Tilt model.

    PubMed

    Taylor, P

    2000-03-01

    This paper introduces the Tilt intonational model and describes how this model can be used to automatically analyze and synthesize intonation. In the model, intonation is represented as a linear sequence of events, which can be pitch accents or boundary tones. Each event is characterized by continuous parameters representing amplitude, duration, and tilt (a measure of the shape of the event). The paper describes an event detector, in effect an intonational recognition system, which produces a transcription of an utterance's intonation. The features and parameters of the event detector are discussed and performance figures are shown on a variety of read and spontaneous speaker independent conversational speech databases. Given the event locations, algorithms are described which produce an automatic analysis of each event in terms of the Tilt parameters. Synthesis algorithms are also presented which generate F0 contours from Tilt representations. The accuracy of these is shown by comparing synthetic F0 contours to real F0 contours. The paper concludes with an extensive discussion on linguistic representations of intonation and gives evidence that the Tilt model goes a long way to satisfying the desired goals of such a representation in that it has the right number of degrees of freedom to be able to describe and synthesize intonation accurately.

  18. A bootstrapping method for development of Treebank

    NASA Astrophysics Data System (ADS)

    Zarei, F.; Basirat, A.; Faili, H.; Mirain, M.

    2017-01-01

    Using statistical approaches beside the traditional methods of natural language processing could significantly improve both the quality and performance of several natural language processing (NLP) tasks. The effective usage of these approaches is subject to the availability of the informative, accurate and detailed corpora on which the learners are trained. This article introduces a bootstrapping method for developing annotated corpora based on a complex and rich linguistically motivated elementary structure called supertag. To this end, a hybrid method for supertagging is proposed that combines both of the generative and discriminative methods of supertagging. The method was applied on a subset of Wall Street Journal (WSJ) in order to annotate its sentences with a set of linguistically motivated elementary structures of the English XTAG grammar that is using a lexicalised tree-adjoining grammar formalism. The empirical results confirm that the bootstrapping method provides a satisfactory way for annotating the English sentences with the mentioned structures. The experiments show that the method could automatically annotate about 20% of WSJ with the accuracy of F-measure about 80% of which is particularly 12% higher than the F-measure of the XTAG Treebank automatically generated from the approach proposed by Basirat and Faili [(2013). Bridge the gap between statistical and hand-crafted grammars. Computer Speech and Language, 27, 1085-1104].

  19. Missed Opportunity? Was Iran s Green Movement an Unconventional Warfare Option?

    DTIC Science & Technology

    2014-12-12

    leadership. Speeches , reports, websites, and foreign documents constituted the majority of usable research. The author assumed accurate translation of...expanding economic influence. The RAND Corporation’s study compiled research from the OpenSource website, scholarly reports, and translated speeches ...constructed from Mir Houssein Mousavi’s speeches . Although difficult to accredit, the manifesto echoed Green Movement leadership ideologies. This work

  20. Robust Recognition of Loud and Lombard speech in the Fighter Cockpit Environment

    DTIC Science & Technology

    1988-08-01

    the latter as inter-speaker variability. According to Zue [Z85j, inter-speaker variabilities can be attributed to sociolinguistic background, dialect...34 Journal of the Acoustical Society of America , Vol 50, 1971. [At74I B. S. Atal, "Linear prediction for speaker identification," Journal of the Acoustical...Society of America , Vol 55, 1974. [B771 B. Beek, E. P. Neuberg, and D. C. Hodge, "An Assessment of the Technology of Automatic Speech Recognition for

  1. A technology prototype system for rating therapist empathy from audio recordings in addiction counseling.

    PubMed

    Xiao, Bo; Huang, Chewei; Imel, Zac E; Atkins, David C; Georgiou, Panayiotis; Narayanan, Shrikanth S

    2016-04-01

    Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.

  2. A technology prototype system for rating therapist empathy from audio recordings in addiction counseling

    PubMed Central

    Xiao, Bo; Huang, Chewei; Imel, Zac E.; Atkins, David C.; Georgiou, Panayiotis; Narayanan, Shrikanth S.

    2016-01-01

    Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy—a key therapy quality index—from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training. PMID:28286867

  3. A hypothetical neurological association between dehumanization and human rights abuses.

    PubMed

    Murrow, Gail B; Murrow, Richard

    2015-07-01

    Dehumanization is anecdotally and historically associated with reduced empathy for the pain of dehumanized individuals and groups and with psychological and legal denial of their human rights and extreme violence against them. We hypothesize that 'empathy' for the pain and suffering of dehumanized social groups is automatically reduced because, as the research we review suggests, an individual's neural mechanisms of pain empathy best respond to (or produce empathy for) the pain of people whom the individual automatically or implicitly associates with her or his own species. This theory has implications for the philosophical conception of 'human' and of 'legal personhood' in human rights jurisprudence. It further has implications for First Amendment free speech jurisprudence, including the doctrine of 'corporate personhood' and consideration of the potential harm caused by dehumanizing hate speech. We suggest that the new, social neuroscience of empathy provides evidence that both the vagaries of the legal definition or legal fiction of 'personhood' and hate speech that explicitly and implicitly dehumanizes may (in their respective capacities to artificially humanize or dehumanize) manipulate the neural mechanisms of pain empathy in ways that could pose more of a true threat to human rights and rights-based democracy than previously appreciated.

  4. Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies.

    PubMed

    Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y; Saltzman, Elliot; Goldstein, Louis

    2010-09-13

    Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed "speech-inversion." This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.

  5. Exploring expressivity and emotion with artificial voice and speech technologies.

    PubMed

    Pauletto, Sandra; Balentine, Bruce; Pidcock, Chris; Jones, Kevin; Bottaci, Leonardo; Aretoulaki, Maria; Wells, Jez; Mundy, Darren P; Balentine, James

    2013-10-01

    Emotion in audio-voice signals, as synthesized by text-to-speech (TTS) technologies, was investigated to formulate a theory of expression for user interface design. Emotional parameters were specified with markup tags, and the resulting audio was further modulated with post-processing techniques. Software was then developed to link a selected TTS synthesizer with an automatic speech recognition (ASR) engine, producing a chatbot that could speak and listen. Using these two artificial voice subsystems, investigators explored both artistic and psychological implications of artificial speech emotion. Goals of the investigation were interdisciplinary, with interest in musical composition, augmentative and alternative communication (AAC), commercial voice announcement applications, human-computer interaction (HCI), and artificial intelligence (AI). The work-in-progress points towards an emerging interdisciplinary ontology for artificial voices. As one study output, HCI tools are proposed for future collaboration.

  6. The benefit of combining a deep neural network architecture with ideal ratio mask estimation in computational speech segregation to improve speech intelligibility.

    PubMed

    Bentsen, Thomas; May, Tobias; Kressner, Abigail A; Dau, Torsten

    2018-01-01

    Computational speech segregation attempts to automatically separate speech from noise. This is challenging in conditions with interfering talkers and low signal-to-noise ratios. Recent approaches have adopted deep neural networks and successfully demonstrated speech intelligibility improvements. A selection of components may be responsible for the success with these state-of-the-art approaches: the system architecture, a time frame concatenation technique and the learning objective. The aim of this study was to explore the roles and the relative contributions of these components by measuring speech intelligibility in normal-hearing listeners. A substantial improvement of 25.4 percentage points in speech intelligibility scores was found going from a subband-based architecture, in which a Gaussian Mixture Model-based classifier predicts the distributions of speech and noise for each frequency channel, to a state-of-the-art deep neural network-based architecture. Another improvement of 13.9 percentage points was obtained by changing the learning objective from the ideal binary mask, in which individual time-frequency units are labeled as either speech- or noise-dominated, to the ideal ratio mask, where the units are assigned a continuous value between zero and one. Therefore, both components play significant roles and by combining them, speech intelligibility improvements were obtained in a six-talker condition at a low signal-to-noise ratio.

  7. Speech intelligibility enhancement after maxillary denture treatment and its impact on quality of life.

    PubMed

    Knipfer, Christian; Riemann, Max; Bocklet, Tobias; Noeth, Elmar; Schuster, Maria; Sokol, Biljana; Eitner, Stephan; Nkenke, Emeka; Stelzle, Florian

    2014-01-01

    Tooth loss and its prosthetic rehabilitation significantly affect speech intelligibility. However, little is known about the influence of speech deficiencies on oral health-related quality of life (OHRQoL). The aim of this study was to investigate whether speech intelligibility enhancement through prosthetic rehabilitation significantly influences OHRQoL in patients wearing complete maxillary dentures. Speech intelligibility by means of an automatic speech recognition system (ASR) was prospectively evaluated and compared with subjectively assessed Oral Health Impact Profile (OHIP) scores. Speech was recorded in 28 edentulous patients 1 week prior to the fabrication of new complete maxillary dentures and 6 months thereafter. Speech intelligibility was computed based on the word accuracy (WA) by means of an ASR and compared with a matched control group. One week before and 6 months after rehabilitation, patients assessed themselves for OHRQoL. Speech intelligibility improved significantly after 6 months. Subjects reported a significantly higher OHRQoL after maxillary rehabilitation with complete dentures. No significant correlation was found between the OHIP sum score or its subscales to the WA. Speech intelligibility enhancement achieved through the fabrication of new complete maxillary dentures might not be in the forefront of the patients' perception of their quality of life. For the improvement of OHRQoL in patients wearing complete maxillary dentures, food intake and mastication as well as freedom from pain play a more prominent role.

  8. Speech recognition-based and automaticity programs to help students with severe reading and spelling problems.

    PubMed

    Higgins, Eleanor L; Raskind, Marshall H

    2004-12-01

    This study was conducted to assess the effectiveness of two programs developed by the Frostig Center Research Department to improve the reading and spelling of students with learning disabilities (LD): a computer Speech Recognition-based Program (SRBP) and a computer and text-based Automaticity Program (AP). Twenty-eight LD students with reading and spelling difficulties (aged 8 to 18) received each program for 17 weeks and were compared with 16 students in a contrast group who did not receive either program. After adjusting for age and IQ, both the SRBP and AP groups showed significant differences over the contrast group in improving word recognition and reading comprehension. Neither program showed significant differences over contrasts in spelling. The SRBP also improved the performance of the target group when compared with the contrast group on phonological elision and nonword reading efficiency tasks. The AP showed significant differences in all process and reading efficiency measures.

  9. Do perceived context pictures automatically activate their phonological code?

    PubMed

    Jescheniak, Jörg D; Oppermann, Frank; Hantsch, Ansgar; Wagner, Valentin; Mädebach, Andreas; Schriefers, Herbert

    2009-01-01

    Morsella and Miozzo (Morsella, E., & Miozzo, M. (2002). Evidence for a cascade model of lexical access in speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 555-563) have reported that the to-be-ignored context pictures become phonologically activated when participants name a target picture, and took this finding as support for cascaded models of lexical retrieval in speech production. In a replication and extension of their experiment in German, we failed to obtain priming effects from context pictures phonologically related to a to-be-named target picture. By contrast, corresponding context words (i.e., the names of the respective pictures) and the same context pictures, when used in an identity condition, did reliably facilitate the naming process. This pattern calls into question the generality of the claim advanced by Morsella and Miozzo that perceptual processing of pictures in the context of a naming task automatically leads to the activation of corresponding lexical-phonological codes.

  10. LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.

    PubMed

    Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu

    2005-01-01

    Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.

  11. Effects of emotion on different phoneme classes

    NASA Astrophysics Data System (ADS)

    Lee, Chul Min; Yildirim, Serdar; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Abe; Lee, Sungbok; Narayanan, Shrikanth

    2004-10-01

    This study investigates the effects of emotion on different phoneme classes using short-term spectral features. In the research on emotion in speech, most studies have focused on prosodic features of speech. In this study, based on the hypothesis that different emotions have varying effects on the properties of the different speech sounds, we investigate the usefulness of phoneme-class level acoustic modeling for automatic emotion classification. Hidden Markov models (HMM) based on short-term spectral features for five broad phonetic classes are used for this purpose using data obtained from recordings of two actresses. Each speaker produces 211 sentences with four different emotions (neutral, sad, angry, happy). Using the speech material we trained and compared the performances of two sets of HMM classifiers: a generic set of ``emotional speech'' HMMs (one for each emotion) and a set of broad phonetic-class based HMMs (vowel, glide, nasal, stop, fricative) for each emotion type considered. Comparison of classification results indicates that different phoneme classes were affected differently by emotional change and that the vowel sounds are the most important indicator of emotions in speech. Detailed results and their implications on the underlying speech articulation will be discussed.

  12. Study of acoustic correlates associate with emotional speech

    NASA Astrophysics Data System (ADS)

    Yildirim, Serdar; Lee, Sungbok; Lee, Chul Min; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Ebrahim; Narayanan, Shrikanth

    2004-10-01

    This study investigates the acoustic characteristics of four different emotions expressed in speech. The aim is to obtain detailed acoustic knowledge on how a speech signal is modulated by changes from neutral to a certain emotional state. Such knowledge is necessary for automatic emotion recognition and classification and emotional speech synthesis. Speech data obtained from two semi-professional actresses are analyzed and compared. Each subject produces 211 sentences with four different emotions; neutral, sad, angry, happy. We analyze changes in temporal and acoustic parameters such as magnitude and variability of segmental duration, fundamental frequency and the first three formant frequencies as a function of emotion. Acoustic differences among the emotions are also explored with mutual information computation, multidimensional scaling and acoustic likelihood comparison with normal speech. Results indicate that speech associated with anger and happiness is characterized by longer duration, shorter interword silence, higher pitch and rms energy with wider ranges. Sadness is distinguished from other emotions by lower rms energy and longer interword silence. Interestingly, the difference in formant pattern between [happiness/anger] and [neutral/sadness] are better reflected in back vowels such as /a/(/father/) than in front vowels. Detailed results on intra- and interspeaker variability will be reported.

  13. Toward dynamic magnetic resonance imaging of the vocal tract during speech production.

    PubMed

    Ventura, Sandra M Rua; Freitas, Diamantino Rui S; Tavares, João Manuel R S

    2011-07-01

    The most recent and significant magnetic resonance imaging (MRI) improvements allow for the visualization of the vocal tract during speech production, which has been revealed to be a powerful tool in dynamic speech research. However, a synchronization technique with enhanced temporal resolution is still required. The study design was transversal in nature. Throughout this work, a technique for the dynamic study of the vocal tract with MRI by using the heart's signal to synchronize and trigger the imaging-acquisition process is presented and described. The technique in question is then used in the measurement of four speech articulatory parameters to assess three different syllables (articulatory gestures) of European Portuguese Language. The acquired MR images are automatically reconstructed so as to result in a variable sequence of images (slices) of different vocal tract shapes in articulatory positions associated with Portuguese speech sounds. The knowledge obtained as a result of the proposed technique represents a direct contribution to the improvement of speech synthesis algorithms, thereby allowing for novel perceptions in coarticulation studies, in addition to providing further efficient clinical guidelines in the pursuit of more proficient speech rehabilitation processes. Copyright © 2011 The Voice Foundation. Published by Mosby, Inc. All rights reserved.

  14. Analysis of human scream and its impact on text-independent speaker verification.

    PubMed

    Hansen, John H L; Nandwana, Mahesh Kumar; Shokouhi, Navid

    2017-04-01

    Scream is defined as sustained, high-energy vocalizations that lack phonological structure. Lack of phonological structure is how scream is identified from other forms of loud vocalization, such as "yell." This study investigates the acoustic aspects of screams and addresses those that are known to prevent standard speaker identification systems from recognizing the identity of screaming speakers. It is well established that speaker variability due to changes in vocal effort and Lombard effect contribute to degraded performance in automatic speech systems (i.e., speech recognition, speaker identification, diarization, etc.). However, previous research in the general area of speaker variability has concentrated on human speech production, whereas less is known about non-speech vocalizations. The UT-NonSpeech corpus is developed here to investigate speaker verification from scream samples. This study considers a detailed analysis in terms of fundamental frequency, spectral peak shift, frame energy distribution, and spectral tilt. It is shown that traditional speaker recognition based on the Gaussian mixture models-universal background model framework is unreliable when evaluated with screams.

  15. Phonologically-based biomarkers for major depressive disorder

    NASA Astrophysics Data System (ADS)

    Trevino, Andrea Carolina; Quatieri, Thomas Francis; Malyska, Nicolas

    2011-12-01

    Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.

  16. Automatic measurement and representation of prosodic features

    NASA Astrophysics Data System (ADS)

    Ying, Goangshiuan Shawn

    Effective measurement and representation of prosodic features of the acoustic signal for use in automatic speech recognition and understanding systems is the goal of this work. Prosodic features-stress, duration, and intonation-are variations of the acoustic signal whose domains are beyond the boundaries of each individual phonetic segment. Listeners perceive prosodic features through a complex combination of acoustic correlates such as intensity, duration, and fundamental frequency (F0). We have developed new tools to measure F0 and intensity features. We apply a probabilistic global error correction routine to an Average Magnitude Difference Function (AMDF) pitch detector. A new short-term frequency-domain Teager energy algorithm is used to measure the energy of a speech signal. We have conducted a series of experiments performing lexical stress detection on words in continuous English speech from two speech corpora. We have experimented with two different approaches, a segment-based approach and a rhythm unit-based approach, in lexical stress detection. The first approach uses pattern recognition with energy- and duration-based measurements as features to build Bayesian classifiers to detect the stress level of a vowel segment. In the second approach we define rhythm unit and use only the F0-based measurement and a scoring system to determine the stressed segment in the rhythm unit. A duration-based segmentation routine was developed to break polysyllabic words into rhythm units. The long-term goal of this work is to develop a system that can effectively detect the stress pattern for each word in continuous speech utterances. Stress information will be integrated as a constraint for pruning the word hypotheses in a word recognition system based on hidden Markov models.

  17. Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers

    PubMed Central

    Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng

    2014-01-01

    Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004

  18. Effects of low harmonics on tone identification in natural and vocoded speech.

    PubMed

    Liu, Chang; Azimi, Behnam; Tahmina, Qudsia; Hu, Yi

    2012-11-01

    This study investigated the contribution of low-frequency harmonics to identifying Mandarin tones in natural and vocoded speech in quiet and noisy conditions. Results showed that low-frequency harmonics of natural speech led to highly accurate tone identification; however, for vocoded speech, low-frequency harmonics yielded lower tone identification than stimuli with full harmonics, except for tone 4. Analysis of the correlation between tone accuracy and the amplitude-F0 correlation index suggested that "more" speech contents (i.e., more harmonics) did not necessarily yield better tone recognition for vocoded speech, especially when the amplitude contour of the signals did not co-vary with the F0 contour.

  19. Speech emotion recognition methods: A literature review

    NASA Astrophysics Data System (ADS)

    Basharirad, Babak; Moradhaseli, Mohammadreza

    2017-10-01

    Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.

  20. Computer-based planning of optimal donor sites for autologous osseous grafts

    NASA Astrophysics Data System (ADS)

    Krol, Zdzislaw; Chlebiej, Michal; Zerfass, Peter; Zeilhofer, Hans-Florian U.; Sader, Robert; Mikolajczak, Pawel; Keeve, Erwin

    2002-05-01

    Bone graft surgery is often necessary for reconstruction of craniofacial defects after trauma, tumor, infection or congenital malformation. In this operative technique the removed or missing bone segment is filled with a bone graft. The mainstay of the craniofacial reconstruction rests with the replacement of the defected bone by autogeneous bone grafts. To achieve sufficient incorporation of the autograft into the host bone, precise planning and simulation of the surgical intervention is required. The major problem is to determine as accurately as possible the donor site where the graft should be dissected from and to define the shape of the desired transplant. A computer-aided method for semi-automatic selection of optimal donor sites for autografts in craniofacial reconstructive surgery has been developed. The non-automatic step of graft design and constraint setting is followed by a fully automatic procedure to find the best fitting position. In extension to preceding work, a new optimization approach based on the Levenberg-Marquardt method has been implemented and embedded into our computer-based surgical planning system. This new technique enables, once the pre-processing step has been performed, selection of the optimal donor site in time less than one minute. The method has been applied during surgery planning step in more than 20 cases. The postoperative observations have shown that functional results, such as speech and chewing ability as well as restoration of bony continuity were clearly better compared to conventionally planned operations. Moreover, in most cases the duration of the surgical interventions has been distinctly reduced.

  1. Brain-to-text: decoding spoken phrases from phone representations in the brain.

    PubMed

    Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

    2015-01-01

    It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.

  2. Brain-to-text: decoding spoken phrases from phone representations in the brain

    PubMed Central

    Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

    2015-01-01

    It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech. PMID:26124702

  3. On the Development of Speech Resources for the Mixtec Language

    PubMed Central

    2013-01-01

    The Mixtec language is one of the main native languages in Mexico. In general, due to urbanization, discrimination, and limited attempts to promote the culture, the native languages are disappearing. Most of the information available about the Mixtec language is in written form as in dictionaries which, although including examples about how to pronounce the Mixtec words, are not as reliable as listening to the correct pronunciation from a native speaker. Formal acoustic resources, as speech corpora, are almost non-existent for the Mixtec, and no speech technologies are known to have been developed for it. This paper presents the development of the following resources for the Mixtec language: (1) a speech database of traditional narratives of the Mixtec culture spoken by a native speaker (labelled at the phonetic and orthographic levels by means of spectral analysis) and (2) a native speaker-adaptive automatic speech recognition (ASR) system (trained with the speech database) integrated with a Mixtec-to-Spanish/Spanish-to-Mixtec text translator. The speech database, although small and limited to a single variant, was reliable enough to build the multiuser speech application which presented a mean recognition/translation performance up to 94.36% in experiments with non-native speakers (the target users). PMID:23710134

  4. Is talking to an automated teller machine natural and fun?

    PubMed

    Chan, F Y; Khalid, H M

    Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.

  5. Clinical evidence of the role of the cerebellum in the suppression of overt articulatory movements during reading. A study of reading in children and adolescents treated for cerebellar pilocytic astrocytoma.

    PubMed

    Ait Khelifa-Gallois, N; Puget, S; Longaud, A; Laroussinie, F; Soria, C; Sainte-Rose, C; Dellatolas, G

    2015-04-01

    It has been suggested that the cerebellum is involved in reading acquisition and in particular in the progression from automatic grapheme-phoneme conversion to the internalization of speech required for silent reading. This idea is in line with clinical and neuroimaging data showing a cerebellar role in subvocal rehearsal for printed verbalizable material and with computational "internal models" of the cerebellum suggesting its role in inner speech (i.e. covert speech without mouthing the words). However, studies examining a possible cerebellar role in the suppression of articulatory movements during silent reading acquisition in children are lacking. Here, we report clinical evidence that the cerebellum plays a part in this transition. Reading performances were compared between a group of 17 paediatric patients treated for benign cerebellar tumours and a group of controls matched for age, gender, and parental socio-educational level. The patients scored significantly lower on all reading, but the most striking difference concerned silent reading, perfectly acquired by almost all controls, contrasting with 41 % of the patients who were unable to read any item silently. Silent reading was correlated with the Working Memory Index. The present findings converge with previous reports on an implication of the cerebellum in inner speech and in the automatization of reading. This cerebellar implication is probably not specific to reading, as it also seems to affect non-reading tasks such as counting.

  6. The development of the Athens Emotional States Inventory (AESI): collection, validation and automatic processing of emotionally loaded sentences.

    PubMed

    Chaspari, Theodora; Soldatos, Constantin; Maragos, Petros

    2015-01-01

    The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. The resulting database contains 696 recorded utterances in Greek language by 20 native speakers and has a total duration of approximately 28 min. Speech classification results yield accuracy up to 75.15% for automatically recognizing the emotions in AESI. These results indicate the usefulness of our approach for collecting emotional data with reliable content, balanced across classes and with reduced environmental variability.

  7. System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech

    DOEpatents

    Burnett, Greg C [Livermore, CA; Holzrichter, John F [Berkeley, CA; Ng, Lawrence C [Danville, CA

    2006-08-08

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  8. System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech

    DOEpatents

    Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.

    2004-03-23

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  9. System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech

    DOEpatents

    Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.

    2006-02-14

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  10. Effects of stimulus response compatibility on covert imitation of vowels.

    PubMed

    Adank, Patti; Nuttall, Helen; Bekkering, Harold; Maegherman, Gwijde

    2018-03-13

    When we observe someone else speaking, we tend to automatically activate the corresponding speech motor patterns. When listening, we therefore covertly imitate the observed speech. Simulation theories of speech perception propose that covert imitation of speech motor patterns supports speech perception. Covert imitation of speech has been studied with interference paradigms, including the stimulus-response compatibility paradigm (SRC). The SRC paradigm measures covert imitation by comparing articulation of a prompt following exposure to a distracter. Responses tend to be faster for congruent than for incongruent distracters; thus, showing evidence of covert imitation. Simulation accounts propose a key role for covert imitation in speech perception. However, covert imitation has thus far only been demonstrated for a select class of speech sounds, namely consonants, and it is unclear whether covert imitation extends to vowels. We aimed to demonstrate that covert imitation effects as measured with the SRC paradigm extend to vowels, in two experiments. We examined whether covert imitation occurs for vowels in a consonant-vowel-consonant context in visual, audio, and audiovisual modalities. We presented the prompt at four time points to examine how covert imitation varied over the distracter's duration. The results of both experiments clearly demonstrated covert imitation effects for vowels, thus supporting simulation theories of speech perception. Covert imitation was not affected by stimulus modality and was maximal for later time points.

  11. Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis

    NASA Astrophysics Data System (ADS)

    Büchler, Michael; Allegro, Silvia; Launer, Stefan; Dillier, Norbert

    2005-12-01

    A sound classification system for the automatic recognition of the acoustic environment in a hearing aid is discussed. The system distinguishes the four sound classes "clean speech," "speech in noise," "noise," and "music." A number of features that are inspired by auditory scene analysis are extracted from the sound signal. These features describe amplitude modulations, spectral profile, harmonicity, amplitude onsets, and rhythm. They are evaluated together with different pattern classifiers. Simple classifiers, such as rule-based and minimum-distance classifiers, are compared with more complex approaches, such as Bayes classifier, neural network, and hidden Markov model. Sounds from a large database are employed for both training and testing of the system. The achieved recognition rates are very high except for the class "speech in noise." Problems arise in the classification of compressed pop music, strongly reverberated speech, and tonal or fluctuating noises.

  12. Abnormal laughter-like vocalisations replacing speech in primary progressive aphasia

    PubMed Central

    Rohrer, Jonathan D.; Warren, Jason D.; Rossor, Martin N.

    2009-01-01

    We describe ten patients with a clinical diagnosis of primary progressive aphasia (PPA) (pathologically confirmed in three cases) who developed abnormal laughter-like vocalisations in the context of progressive speech output impairment leading to mutism. Failure of speech output was accompanied by increasing frequency of the abnormal vocalisations until ultimately they constituted the patient's only extended utterance. The laughter-like vocalisations did not show contextual sensitivity but occurred as an automatic vocal output that replaced speech. Acoustic analysis of the vocalisations in two patients revealed abnormal motor features including variable note duration and inter-note interval, loss of temporal symmetry of laugh notes and loss of the normal decrescendo. Abnormal laughter-like vocalisations may be a hallmark of a subgroup in the PPA spectrum with impaired control and production of nonverbal vocal behaviour due to disruption of fronto-temporal networks mediating vocalisation. PMID:19435636

  13. Abnormal laughter-like vocalisations replacing speech in primary progressive aphasia.

    PubMed

    Rohrer, Jonathan D; Warren, Jason D; Rossor, Martin N

    2009-09-15

    We describe ten patients with a clinical diagnosis of primary progressive aphasia (PPA) (pathologically confirmed in three cases) who developed abnormal laughter-like vocalisations in the context of progressive speech output impairment leading to mutism. Failure of speech output was accompanied by increasing frequency of the abnormal vocalisations until ultimately they constituted the patient's only extended utterance. The laughter-like vocalisations did not show contextual sensitivity but occurred as an automatic vocal output that replaced speech. Acoustic analysis of the vocalisations in two patients revealed abnormal motor features including variable note duration and inter-note interval, loss of temporal symmetry of laugh notes and loss of the normal decrescendo. Abnormal laughter-like vocalisations may be a hallmark of a subgroup in the PPA spectrum with impaired control and production of nonverbal vocal behaviour due to disruption of fronto-temporal networks mediating vocalisation.

  14. Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting.

    PubMed

    Wöllmer, Martin; Marchi, Erik; Squartini, Stefano; Schuller, Björn

    2011-09-01

    Highly spontaneous, conversational, and potentially emotional and noisy speech is known to be a challenge for today's automatic speech recognition (ASR) systems, which highlights the need for advanced algorithms that improve speech features and models. Histogram Equalization is an efficient method to reduce the mismatch between clean and noisy conditions by normalizing all moments of the probability distribution of the feature vector components. In this article, we propose to combine histogram equalization and multi-condition training for robust keyword detection in noisy speech. To better cope with conversational speaking styles, we show how contextual information can be effectively exploited in a multi-stream ASR framework that dynamically models context-sensitive phoneme estimates generated by a long short-term memory neural network. The proposed techniques are evaluated on the SEMAINE database-a corpus containing emotionally colored conversations with a cognitive system for "Sensitive Artificial Listening".

  15. A chimpanzee recognizes synthetic speech with significantly reduced acoustic cues to phonetic content.

    PubMed

    Heimbauer, Lisa A; Beran, Michael J; Owren, Michael J

    2011-07-26

    A long-standing debate concerns whether humans are specialized for speech perception, which some researchers argue is demonstrated by the ability to understand synthetic speech with significantly reduced acoustic cues to phonetic content. We tested a chimpanzee (Pan troglodytes) that recognizes 128 spoken words, asking whether she could understand such speech. Three experiments presented 48 individual words, with the animal selecting a corresponding visuographic symbol from among four alternatives. Experiment 1 tested spectrally reduced, noise-vocoded (NV) synthesis, originally developed to simulate input received by human cochlear-implant users. Experiment 2 tested "impossibly unspeechlike" sine-wave (SW) synthesis, which reduces speech to just three moving tones. Although receiving only intermittent and noncontingent reward, the chimpanzee performed well above chance level, including when hearing synthetic versions for the first time. Recognition of SW words was least accurate but improved in experiment 3 when natural words in the same session were rewarded. The chimpanzee was more accurate with NV than SW versions, as were 32 human participants hearing these items. The chimpanzee's ability to spontaneously recognize acoustically reduced synthetic words suggests that experience rather than specialization is critical for speech-perception capabilities that some have suggested are uniquely human. Copyright © 2011 Elsevier Ltd. All rights reserved.

  16. Analysis of speech sounds is left-hemisphere predominant at 100-150ms after sound onset.

    PubMed

    Rinne, T; Alho, K; Alku, P; Holi, M; Sinkkonen, J; Virtanen, J; Bertrand, O; Näätänen, R

    1999-04-06

    Hemispheric specialization of human speech processing has been found in brain imaging studies using fMRI and PET. Due to the restricted time resolution, these methods cannot, however, determine the stage of auditory processing at which this specialization first emerges. We used a dense electrode array covering the whole scalp to record the mismatch negativity (MMN), an event-related brain potential (ERP) automatically elicited by occasional changes in sounds, which ranged from non-phonetic (tones) to phonetic (vowels). MMN can be used to probe auditory central processing on a millisecond scale with no attention-dependent task requirements. Our results indicate that speech processing occurs predominantly in the left hemisphere at the early, pre-attentive level of auditory analysis.

  17. System And Method For Characterizing Voiced Excitations Of Speech And Acoustic Signals, Removing Acoustic Noise From Speech, And Synthesizi

    DOEpatents

    Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.

    2006-04-25

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  18. Speech recognition features for EEG signal description in detection of neonatal seizures.

    PubMed

    Temko, A; Boylan, G; Marnane, W; Lightbody, G

    2010-01-01

    In this work, features which are usually employed in automatic speech recognition (ASR) are used for the detection of neonatal seizures in newborn EEG. Three conventional ASR feature sets are compared to the feature set which has been previously developed for this task. The results indicate that the thoroughly-studied spectral envelope based ASR features perform reasonably well on their own. Additionally, the SVM Recursive Feature Elimination routine is applied to all extracted features pooled together. It is shown that ASR features consistently appear among the top-rank features.

  19. Implementation and Performance Exploration of a Cross-Genre Part of Speech Tagging Methodology to Determine Dialog Act Tags in the Chat Domain

    DTIC Science & Technology

    2010-09-01

    the flies.”) or a present tense verb when describing what an airplane does (“An airplane flies.”) This disambiguation is, in general, computationally...as part-of-speech and dialog-act tagging, and yet the volume of data created makes human analysis impractical. We present a cross-genre part-of...acceptable automatic dialog-act determination. Furthermore, we show that a simple naı̈ve Bayes classifier achieves the same performance in a fraction of

  20. Perceptual analysis of speech following traumatic brain injury in childhood.

    PubMed

    Cahill, Louise M; Murdoch, Bruce E; Theodoros, Deborah G

    2002-05-01

    To investigate perceptually the speech dimensions, oromotor function, and speech intelligibility of a group of individuals with traumatic brain injury (TBI) acquired in childhood. The speech of 24 children with TBI was analysed perceptually and compared with that of a group of non-neurologically impaired children matched for age and sex. The 16 dysarthric TBI subjects were significantly less intelligible than the control subjects, and demonstrated significant impairment in 12 of the 33 speech dimensions rated. In addition, the eight non-dysarthric TBI subjects were significantly impaired in many areas of oromotor function on the Frenchay Dysarthria Assessment, indicating some degree of pre-clinical speech impairment. The results of the perceptual analysis are discussed in terms of the possible underlying pathophysiological bases of the deviant speech features identified, and the need for a comprehensive instrumental assessment, to more accurately determine the level of breakdown in the speech production mechanism in children following TBI.

  1. A Generative Model of Speech Production in Broca’s and Wernicke’s Areas

    PubMed Central

    Price, Cathy J.; Crinion, Jenny T.; MacSweeney, Mairéad

    2011-01-01

    Speech production involves the generation of an auditory signal from the articulators and vocal tract. When the intended auditory signal does not match the produced sounds, subsequent articulatory commands can be adjusted to reduce the difference between the intended and produced sounds. This requires an internal model of the intended speech output that can be compared to the produced speech. The aim of this functional imaging study was to identify brain activation related to the internal model of speech production after activation related to vocalization, auditory feedback, and movement in the articulators had been controlled. There were four conditions: silent articulation of speech, non-speech mouth movements, finger tapping, and visual fixation. In the speech conditions, participants produced the mouth movements associated with the words “one” and “three.” We eliminated auditory feedback from the spoken output by instructing participants to articulate these words without producing any sound. The non-speech mouth movement conditions involved lip pursing and tongue protrusions to control for movement in the articulators. The main difference between our speech and non-speech mouth movement conditions is that prior experience producing speech sounds leads to the automatic and covert generation of auditory and phonological associations that may play a role in predicting auditory feedback. We found that, relative to non-speech mouth movements, silent speech activated Broca’s area in the left dorsal pars opercularis and Wernicke’s area in the left posterior superior temporal sulcus. We discuss these results in the context of a generative model of speech production and propose that Broca’s and Wernicke’s areas may be involved in predicting the speech output that follows articulation. These predictions could provide a mechanism by which rapid movement of the articulators is precisely matched to the intended speech outputs during future articulations. PMID:21954392

  2. Alterations in attention capture to auditory emotional stimuli in job burnout: an event-related potential study.

    PubMed

    Sokka, Laura; Huotilainen, Minna; Leinikka, Marianne; Korpela, Jussi; Henelius, Andreas; Alain, Claude; Müller, Kiti; Pakarinen, Satu

    2014-12-01

    Job burnout is a significant cause of work absenteeism. Evidence from behavioral studies and patient reports suggests that job burnout is associated with impairments of attention and decreased working capacity, and it has overlapping elements with depression, anxiety and sleep disturbances. Here, we examined the electrophysiological correlates of automatic sound change detection and involuntary attention allocation in job burnout using scalp recordings of event-related potentials (ERP). Volunteers with job burnout symptoms but without severe depression and anxiety disorders and their non-burnout controls were presented with natural speech sound stimuli (standard and nine deviants), as well as three rarely occurring speech sounds with strong emotional prosody. All stimuli elicited mismatch negativity (MMN) responses that were comparable in both groups. The groups differed with respect to the P3a, an ERP component reflecting involuntary shift of attention: job burnout group showed a shorter P3a latency in response to the emotionally negative stimulus, and a longer latency in response to the positive stimulus. Results indicate that in job burnout, automatic speech sound discrimination is intact, but there is an attention capture tendency that is faster for negative, and slower to positive information compared to that of controls. Copyright © 2014 Elsevier B.V. All rights reserved.

  3. Rhythm in disguise: why singing may not hold the key to recovery from aphasia

    PubMed Central

    Kotz, Sonja A.; Henseler, Ilona; Turner, Robert; Geyer, Stefan

    2011-01-01

    The question of whether singing may be helpful for stroke patients with non-fluent aphasia has been debated for many years. However, the role of rhythm in speech recovery appears to have been neglected. In the current lesion study, we aimed to assess the relative importance of melody and rhythm for speech production in 17 non-fluent aphasics. Furthermore, we systematically alternated the lyrics to test for the influence of long-term memory and preserved motor automaticity in formulaic expressions. We controlled for vocal frequency variability, pitch accuracy, rhythmicity, syllable duration, phonetic complexity and other relevant factors, such as learning effects or the acoustic setting. Contrary to some opinion, our data suggest that singing may not be decisive for speech production in non-fluent aphasics. Instead, our results indicate that rhythm may be crucial, particularly for patients with lesions including the basal ganglia. Among the patients we studied, basal ganglia lesions accounted for more than 50% of the variance related to rhythmicity. Our findings therefore suggest that benefits typically attributed to melodic intoning in the past could actually have their roots in rhythm. Moreover, our data indicate that lyric production in non-fluent aphasics may be strongly mediated by long-term memory and motor automaticity, irrespective of whether lyrics are sung or spoken. PMID:21948939

  4. A hypothetical neurological association between dehumanization and human rights abuses

    PubMed Central

    Murrow, Gail B.; Murrow, Richard

    2015-01-01

    Dehumanization is anecdotally and historically associated with reduced empathy for the pain of dehumanized individuals and groups and with psychological and legal denial of their human rights and extreme violence against them. We hypothesize that ‘empathy’ for the pain and suffering of dehumanized social groups is automatically reduced because, as the research we review suggests, an individual's neural mechanisms of pain empathy best respond to (or produce empathy for) the pain of people whom the individual automatically or implicitly associates with her or his own species. This theory has implications for the philosophical conception of ‘human’ and of ‘legal personhood’ in human rights jurisprudence. It further has implications for First Amendment free speech jurisprudence, including the doctrine of ‘corporate personhood’ and consideration of the potential harm caused by dehumanizing hate speech. We suggest that the new, social neuroscience of empathy provides evidence that both the vagaries of the legal definition or legal fiction of ‘personhood’ and hate speech that explicitly and implicitly dehumanizes may (in their respective capacities to artificially humanize or dehumanize) manipulate the neural mechanisms of pain empathy in ways that could pose more of a true threat to human rights and rights-based democracy than previously appreciated. PMID:27774198

  5. The software for automatic creation of the formal grammars used by speech recognition, computer vision, editable text conversion systems, and some new functions

    NASA Astrophysics Data System (ADS)

    Kardava, Irakli; Tadyszak, Krzysztof; Gulua, Nana; Jurga, Stefan

    2017-02-01

    For more flexibility of environmental perception by artificial intelligence it is needed to exist the supporting software modules, which will be able to automate the creation of specific language syntax and to make a further analysis for relevant decisions based on semantic functions. According of our proposed approach, of which implementation it is possible to create the couples of formal rules of given sentences (in case of natural languages) or statements (in case of special languages) by helping of computer vision, speech recognition or editable text conversion system for further automatic improvement. In other words, we have developed an approach, by which it can be achieved to significantly improve the training process automation of artificial intelligence, which as a result will give us a higher level of self-developing skills independently from us (from users). At the base of our approach we have developed a software demo version, which includes the algorithm and software code for the entire above mentioned component's implementation (computer vision, speech recognition and editable text conversion system). The program has the ability to work in a multi - stream mode and simultaneously create a syntax based on receiving information from several sources.

  6. Getting What You Want: Accurate Document Filtering in a Terabyte World

    DTIC Science & Technology

    2002-11-01

    models are used widely in speech recognition and have shown promise for ad-hoc information retrieval (Ponte and Croft, 1998; Lafferty and Zhai, 2001...tasks is focused on developing techniques similar to those used in speech recognition. However the differing requirements of speech recognition and...Conference on Research and Development in Information Retrieval. ACM. 6. T.Ault, and Y. Yang. (2001.) kNN at TREC-9: A failure analysis. In

  7. Oral Articulatory Control in Childhood Apraxia of Speech

    PubMed Central

    Moss, Aviva; Lu, Ying

    2015-01-01

    Purpose The purpose of this research was to examine spatial and temporal aspects of articulatory control in children with childhood apraxia of speech (CAS), children with speech delay characterized by an articulation/phonological impairment (SD), and controls with typical development (TD) during speech tasks that increased in word length. Method The participants included 33 children (11 CAS, 11 SD, and 11 TD) between 3 and 7 years of age. A motion capture system was used to track jaw, lower lip, and upper lip movement during a naming task. Movement duration, velocity, displacement, and variability were measured from accurate word productions. Results Movement variability was significantly higher in the children with CAS compared with participants in the SD and TD groups. Differences in temporal control were seen between both groups of children with speech impairment and the controls with TD during accurate word productions. As word length increased, movement duration and variability differed between the children with CAS and those with SD. Conclusions These findings provide evidence that movement variability distinguishes children with CAS from speakers with SD. Kinematic differences between the participants with CAS and those with SD suggest that these groups respond differently to linguistic challenges. PMID:25951237

  8. Validation of the second version of the LittlEARS® Early Speech Production Questionnaire (LEESPQ) in German-speaking children with normal hearing.

    PubMed

    Keilmann, Annerose; Friese, Barbara; Lässig, Anne; Hoffmann, Vanessa

    2018-04-01

    The introduction of neonatal hearing screening and the increasingly early age at which children can receive a cochlear implant has intensified the need for a validated questionnaire to assess the speech production of children aged 0‒18. Such a questionnaire has been created, the LittlEARS ® Early Speech Production Questionnaire (LEESPQ). This study aimed to validate a second, revised edition of the LEESPQ. Questionnaires were returned for 362 children with normal hearing. Completed questionnaires were analysed to determine if the LEESPQ is reliable, prognostically accurate, internally consistent, and if gender or multilingualism affects total scores. Total scores correlated positively with age. The LEESPQ is reliable, accurate, and consistent, and independent of gender or lingual status. A norm curve was created. This second version of the LEESPQ is a valid tool to assess the speech production development of children with normal hearing, aged 0‒18, regardless of their gender. As such, the LEESPQ may be a useful tool to monitor the development of paediatric hearing device users. The second version of the LEESPQ is a valid instrument for assessing early speech production of children aged 0‒18 months.

  9. Extent of palatal lengthening after cleft palate repair as a contributing factor to the speech outcome.

    PubMed

    Bae, Yong-Chan; Choi, Soo-Jong; Lee, Jae-Woo; Seo, Hyoung-Joon

    2015-03-01

    Operative techniques in performing cleft palate repair have gradually evolved to achieve better speech ability with its main focus on palatal lengthening and accurate approximation of the velar musculature. The authors doubted whether the extent of palatal lengthening would be directly proportional to the speech outcome. Patients with incomplete cleft palates who went into surgery before 18 months of age were intended for this study. Cases with associated syndromes, mental retardation, hearing loss, or presence of postoperative complications were excluded from the analysis. Palatal length was measured by the authors' devised method before and immediately after the cleft palate repair. Postoperative speech outcome was evaluated around 4 years by a definite pronunciation scoring system. Statistical analysis was carried out between the extent of palatal lengthening and the postoperative pronunciation score by Spearman correlation coefficient method. However, the authors could not find any significant correlation. Although the need for additional research on other variables affecting speech outcome is unequivocal, we carefully conclude that other intraoperative constituents such as accurate reapproximation of the velar musculature should be emphasized more in cleft palate repair rather than palatal lengthening itself.

  10. Tone classification of syllable-segmented Thai speech based on multilayer perception

    NASA Astrophysics Data System (ADS)

    Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman

    2002-05-01

    Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.

  11. A novel probabilistic framework for event-based speech recognition

    NASA Astrophysics Data System (ADS)

    Juneja, Amit; Espy-Wilson, Carol

    2003-10-01

    One of the reasons for unsatisfactory performance of the state-of-the-art automatic speech recognition (ASR) systems is the inferior acoustic modeling of low-level acoustic-phonetic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal, but such a system for continuous speech recognition (CSR) is not known to exist. A probabilistic and statistical framework for CSR based on the idea of the representation of speech sounds by bundles of binary valued articulatory phonetic features is proposed. Multiple probabilistic sequences of linguistically motivated landmarks are obtained using binary classifiers of manner phonetic features-syllabic, sonorant and continuant-and the knowledge-based acoustic parameters (APs) that are acoustic correlates of those features. The landmarks are then used for the extraction of knowledge-based APs for source and place phonetic features and their binary classification. Probabilistic landmark sequences are constrained using manner class language models for isolated or connected word recognition. The proposed method could overcome the disadvantages encountered by the early acoustic-phonetic knowledge-based systems that led the ASR community to switch to systems highly dependent on statistical pattern analysis methods and probabilistic language or grammar models.

  12. Feasibility of automated speech sample collection with stuttering children using interactive voice response (IVR) technology.

    PubMed

    Vogel, Adam P; Block, Susan; Kefalianos, Elaina; Onslow, Mark; Eadie, Patricia; Barth, Ben; Conway, Laura; Mundt, James C; Reilly, Sheena

    2015-04-01

    To investigate the feasibility of adopting automated interactive voice response (IVR) technology for remotely capturing standardized speech samples from stuttering children. Participants were 10 6-year-old stuttering children. Their parents called a toll-free number from their homes and were prompted to elicit speech from their children using a standard protocol involving conversation, picture description and games. The automated IVR system was implemented using an off-the-shelf telephony software program and delivered by a standard desktop computer. The software infrastructure utilizes voice over internet protocol. Speech samples were automatically recorded during the calls. Video recordings were simultaneously acquired in the home at the time of the call to evaluate the fidelity of the telephone collected samples. Key outcome measures included syllables spoken, percentage of syllables stuttered and an overall rating of stuttering severity using a 10-point scale. Data revealed a high level of relative reliability in terms of intra-class correlation between the video and telephone acquired samples on all outcome measures during the conversation task. Findings were less consistent for speech samples during picture description and games. Results suggest that IVR technology can be used successfully to automate remote capture of child speech samples.

  13. Automatic detection of obstructive sleep apnea using speech signals.

    PubMed

    Goldshtein, Evgenia; Tarasiuk, Ariel; Zigel, Yaniv

    2011-05-01

    Obstructive sleep apnea (OSA) is a common disorder associated with anatomical abnormalities of the upper airways that affects 5% of the population. Acoustic parameters may be influenced by the vocal tract structure and soft tissue properties. We hypothesize that speech signal properties of OSA patients will be different than those of control subjects not having OSA. Using speech signal processing techniques, we explored acoustic speech features of 93 subjects who were recorded using a text-dependent speech protocol and a digital audio recorder immediately prior to polysomnography study. Following analysis of the study, subjects were divided into OSA (n=67) and non-OSA (n=26) groups. A Gaussian mixture model-based system was developed to model and classify between the groups; discriminative features such as vocal tract length and linear prediction coefficients were selected using feature selection technique. Specificity and sensitivity of 83% and 79% were achieved for the male OSA and 86% and 84% for the female OSA patients, respectively. We conclude that acoustic features from speech signals during wakefulness can detect OSA patients with good specificity and sensitivity. Such a system can be used as a basis for future development of a tool for OSA screening. © 2011 IEEE

  14. Speech rhythm alterations in Spanish-speaking individuals with Alzheimer's disease.

    PubMed

    Martínez-Sánchez, Francisco; Meilán, Juan J G; Vera-Ferrandiz, Juan Antonio; Carro, Juan; Pujante-Valverde, Isabel M; Ivanova, Olga; Carcavilla, Nuria

    2017-07-01

    Rhythm is the speech property related to the temporal organization of sounds. Considerable evidence is now available for suggesting that dementia of Alzheimer's type is associated with impairments in speech rhythm. The aim of this study is to assess the use of an automatic computerized system for measuring speech rhythm characteristics in an oral reading task performed by 45 patients with Alzheimer's disease (AD) compared with those same characteristics among 82 healthy older adults without a diagnosis of dementia, and matched by age, sex and cultural background. Ranges of rhythmic-metric and clinical measurements were applied. The results show rhythmic differences between the groups, with higher variability of syllabic intervals in AD patients. Signal processing algorithms applied to oral reading recordings prove to be capable of differentiating between AD patients and older adults without dementia with an accuracy of 87% (specificity 81.7%, sensitivity 82.2%), based on the standard deviation of the duration of syllabic intervals. Experimental results show that the syllabic variability measurements extracted from the speech signal can be used to distinguish between older adults without a diagnosis of dementia and those with AD, and may be useful as a tool for the objective study and quantification of speech deficits in AD.

  15. When speaker identity is unavoidable: Neural processing of speaker identity cues in natural speech.

    PubMed

    Tuninetti, Alba; Chládková, Kateřina; Peter, Varghese; Schiller, Niels O; Escudero, Paola

    2017-11-01

    Speech sound acoustic properties vary largely across speakers and accents. When perceiving speech, adult listeners normally disregard non-linguistic variation caused by speaker or accent differences, in order to comprehend the linguistic message, e.g. to correctly identify a speech sound or a word. Here we tested whether the process of normalizing speaker and accent differences, facilitating the recognition of linguistic information, is found at the level of neural processing, and whether it is modulated by the listeners' native language. In a multi-deviant oddball paradigm, native and nonnative speakers of Dutch were exposed to naturally-produced Dutch vowels varying in speaker, sex, accent, and phoneme identity. Unexpectedly, the analysis of mismatch negativity (MMN) amplitudes elicited by each type of change shows a large degree of early perceptual sensitivity to non-linguistic cues. This finding on perception of naturally-produced stimuli contrasts with previous studies examining the perception of synthetic stimuli wherein adult listeners automatically disregard acoustic cues to speaker identity. The present finding bears relevance to speech normalization theories, suggesting that at an unattended level of processing, listeners are indeed sensitive to changes in fundamental frequency in natural speech tokens. Copyright © 2017 Elsevier Inc. All rights reserved.

  16. Intelligibility Evaluation of Pathological Speech through Multigranularity Feature Extraction and Optimization.

    PubMed

    Fang, Chunying; Li, Haifeng; Ma, Lin; Zhang, Mancai

    2017-01-01

    Pathological speech usually refers to speech distortion resulting from illness or other biological insults. The assessment of pathological speech plays an important role in assisting the experts, while automatic evaluation of speech intelligibility is difficult because it is usually nonstationary and mutational. In this paper, we carry out an independent innovation of feature extraction and reduction, and we describe a multigranularity combined feature scheme which is optimized by the hierarchical visual method. A novel method of generating feature set based on S -transform and chaotic analysis is proposed. There are BAFS (430, basic acoustics feature), local spectral characteristics MSCC (84, Mel S -transform cepstrum coefficients), and chaotic features (12). Finally, radar chart and F -score are proposed to optimize the features by the hierarchical visual fusion. The feature set could be optimized from 526 to 96 dimensions based on NKI-CCRT corpus and 104 dimensions based on SVD corpus. The experimental results denote that new features by support vector machine (SVM) have the best performance, with a recognition rate of 84.4% on NKI-CCRT corpus and 78.7% on SVD corpus. The proposed method is thus approved to be effective and reliable for pathological speech intelligibility evaluation.

  17. Elizabeth Usher Memorial Lecture: How do we change our profession? Using the lens of behavioural economics to improve evidence-based practice in speech-language pathology.

    PubMed

    McCabe, Patricia J

    2018-06-01

    Evidence-based practice (EBP) is a well-accepted theoretical framework around which speech-language pathologists strive to build their clinical decisions. The profession's conceptualisation of EBP has been evolving over the last 20 years with the practice of EBP now needing to balance research evidence, clinical data and informed patient choices. However, although EBP is not a new concept, as a profession, we seem to be no closer to closing the gap between research evidence and practice than we were at the start of the movement toward EBP in the late 1990s. This paper examines why speech-language pathologists find it difficult to change our own practice when we are experts in changing the behaviour of others. Using the lens of behavioural economics to examine the heuristics and cognitive processes which facilitate and inhibit change, the paper explores research showing how inconsistency of belief and action, or cognitive dissonance, is inevitable unless we act reflectively instead of automatically. The paper argues that heuristics that prevent us changing our practice toward EBP include the sunk cost fallacy, loss aversion, social desirability bias, choice overload and inertia. These automatic cognitive processes work to inhibit change and may partially account for the slow translation of research into practice. Fortunately, understanding and using other heuristics such as the framing effect, reciprocity, social proof, consistency and commitment may help us to understand our own behaviour as speech-language pathologists and help the profession, and those we work with, move towards EBP.

  18. Speech waveform perturbation analysis: a perceptual-acoustical comparison of seven measures.

    PubMed

    Askenfelt, A G; Hammarberg, B

    1986-03-01

    The performance of seven acoustic measures of cycle-to-cycle variations (perturbations) in the speech waveform was compared. All measures were calculated automatically and applied on running speech. Three of the measures refer to the frequency of occurrence and severity of waveform perturbations in special selected parts of the speech, identified by means of the rate of change in the fundamental frequency. Three other measures refer to statistical properties of the distribution of the relative frequency differences between adjacent pitch periods. One perturbation measure refers to the percentage of consecutive pitch period differences with alternating signs. The acoustic measures were tested on tape recorded speech samples from 41 voice patients, before and after successful therapy. Scattergrams of acoustic waveform perturbation data versus an average of perceived deviant voice qualities, as rated by voice clinicians, are presented. The perturbation measures were compared with regard to the acoustic-perceptual correlation and their ability to discriminate between normal and pathological voice status. The standard deviation of the distribution of the relative frequency differences was suggested as the most useful acoustic measure of waveform perturbations for clinical applications.

  19. Language Experiences. Developmental Skills Series, Booklet IV.

    ERIC Educational Resources Information Center

    University City School District, MO.

    GRADES OR AGES: Not specified. It appears to be for kindergarten and primary grades. SUBJECT MATTER: Language and speech, including language patterns, accurate expression of ideas, creative expression of ideas, connection of sound with symbols, and speech improvement. ORGANIZATION AND PHYSICAL APPEARANCE: The guide is divided into five sections,…

  20. Speech intonation and melodic contour recognition in children with cochlear implants and with normal hearing.

    PubMed

    See, Rachel L; Driscoll, Virginia D; Gfeller, Kate; Kliethermes, Stephanie; Oleson, Jacob

    2013-04-01

    Cochlear implant (CI) users have difficulty perceiving some intonation cues in speech and melodic contours because of poor frequency selectivity in the cochlear implant signal. To assess perceptual accuracy of normal hearing (NH) children and pediatric CI users on speech intonation (prosody), melodic contour, and pitch ranking, and to determine potential predictors of outcomes. Does perceptual accuracy for speech intonation or melodic contour differ as a function of auditory status (NH, CI), perceptual category (falling versus rising intonation/contour), pitch perception, or individual differences (e.g., age, hearing history)? NH and CI groups were tested on recognition of falling intonation/contour versus rising intonation/contour presented in both spoken and melodic (sung) conditions. Pitch ranking was also tested. Outcomes were correlated with variables of age, hearing history, HINT, and CNC scores. The CI group was significantly less accurate than the NH group in spoken (CI, M = 63.1%; NH, M = 82.1%) and melodic (CI, M = 61.6%; NH, M = 84.2%) conditions. The CI group was more accurate in recognizing rising contour in the melodic condition compared with rising intonation in the spoken condition. Pitch ranking was a significant predictor of outcome for both groups in falling intonation and rising melodic contour; age at testing and hearing history variables were not predictive of outcomes. Children with CIs were less accurate than NH children in perception of speech intonation, melodic contour, and pitch ranking. However, the larger pitch excursions of the melodic condition may assist in recognition of the rising inflection associated with the interrogative form.

  1. Speech Intonation and Melodic Contour Recognition in Children with Cochlear Implants and with Normal Hearing

    PubMed Central

    See, Rachel L.; Driscoll, Virginia D.; Gfeller, Kate; Kliethermes, Stephanie; Oleson, Jacob

    2013-01-01

    Background Cochlear implant (CI) users have difficulty perceiving some intonation cues in speech and melodic contours because of poor frequency selectivity in the cochlear implant signal. Objectives To assess perceptual accuracy of normal hearing (NH) children and pediatric CI users on speech intonation (prosody), melodic contour, and pitch ranking, and to determine potential predictors of outcomes. Hypothesis Does perceptual accuracy for speech intonation or melodic contour differ as a function of auditory status (NH, CI), perceptual category (falling vs. rising intonation/contour), pitch perception, or individual differences (e.g., age, hearing history)? Method NH and CI groups were tested on recognition of falling intonation/contour vs. rising intonation/contour presented in both spoken and melodic (sung) conditions. Pitch ranking was also tested. Outcomes were correlated with variables of age, hearing history, HINT, and CNC scores. Results The CI group was significantly less accurate than the NH group in spoken (CI, M=63.1 %; NH, M=82.1%) and melodic (CI, M=61.6%; NH, M=84.2%) conditions. The CI group was more accurate in recognizing rising contour in the melodic condition compared with rising intonation in the spoken condition. Pitch ranking was a significant predictor of outcome for both groups in falling intonation and rising melodic contour; age at testing and hearing history variables were not predictive of outcomes. Conclusions Children with CIs were less accurate than NH children in perception of speech intonation, melodic contour, and pitch ranking. However, the larger pitch excursions of the melodic condition may assist in recognition of the rising inflection associated with the interrogative form. PMID:23442568

  2. Approximated mutual information training for speech recognition using myoelectric signals.

    PubMed

    Guo, Hua J; Chan, A D C

    2006-01-01

    A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.

  3. Highlight summarization in golf videos using audio signals

    NASA Astrophysics Data System (ADS)

    Kim, Hyoung-Gook; Kim, Jin Young

    2008-01-01

    In this paper, we present an automatic summarization of highlights in golf videos based on audio information alone without video information. The proposed highlight summarization system is carried out based on semantic audio segmentation and detection on action units from audio signals. Studio speech, field speech, music, and applause are segmented by means of sound classification. Swing is detected by the methods of impulse onset detection. Sounds like swing and applause form a complete action unit, while studio speech and music parts are used to anchor the program structure. With the advantage of highly precise detection of applause, highlights are extracted effectively. Our experimental results obtain high classification precision on 18 golf games. It proves that the proposed system is very effective and computationally efficient to apply the technology to embedded consumer electronic devices.

  4. A new time-adaptive discrete bionic wavelet transform for enhancing speech from adverse noise environment

    NASA Astrophysics Data System (ADS)

    Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui

    2012-04-01

    Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.

  5. Speech Sound Processing Deficits and Training-Induced Neural Plasticity in Rats with Dyslexia Gene Knockdown

    PubMed Central

    Centanni, Tracy M.; Chen, Fuyi; Booker, Anne M.; Engineer, Crystal T.; Sloan, Andrew M.; Rennaker, Robert L.; LoTurco, Joseph J.; Kilgard, Michael P.

    2014-01-01

    In utero RNAi of the dyslexia-associated gene Kiaa0319 in rats (KIA-) degrades cortical responses to speech sounds and increases trial-by-trial variability in onset latency. We tested the hypothesis that KIA- rats would be impaired at speech sound discrimination. KIA- rats needed twice as much training in quiet conditions to perform at control levels and remained impaired at several speech tasks. Focused training using truncated speech sounds was able to normalize speech discrimination in quiet and background noise conditions. Training also normalized trial-by-trial neural variability and temporal phase locking. Cortical activity from speech trained KIA- rats was sufficient to accurately discriminate between similar consonant sounds. These results provide the first direct evidence that assumed reduced expression of the dyslexia-associated gene KIAA0319 can cause phoneme processing impairments similar to those seen in dyslexia and that intensive behavioral therapy can eliminate these impairments. PMID:24871331

  6. Attention Is Required for Knowledge-Based Sequential Grouping: Insights from the Integration of Syllables into Words.

    PubMed

    Ding, Nai; Pan, Xunyi; Luo, Cheng; Su, Naifei; Zhang, Wen; Zhang, Jianfeng

    2018-01-31

    How the brain groups sequential sensory events into chunks is a fundamental question in cognitive neuroscience. This study investigates whether top-down attention or specific tasks are required for the brain to apply lexical knowledge to group syllables into words. Neural responses tracking the syllabic and word rhythms of a rhythmic speech sequence were concurrently monitored using electroencephalography (EEG). The participants performed different tasks, attending to either the rhythmic speech sequence or a distractor, which was another speech stream or a nonlinguistic auditory/visual stimulus. Attention to speech, but not a lexical-meaning-related task, was required for reliable neural tracking of words, even when the distractor was a nonlinguistic stimulus presented cross-modally. Neural tracking of syllables, however, was reliably observed in all tested conditions. These results strongly suggest that neural encoding of individual auditory events (i.e., syllables) is automatic, while knowledge-based construction of temporal chunks (i.e., words) crucially relies on top-down attention. SIGNIFICANCE STATEMENT Why we cannot understand speech when not paying attention is an old question in psychology and cognitive neuroscience. Speech processing is a complex process that involves multiple stages, e.g., hearing and analyzing the speech sound, recognizing words, and combining words into phrases and sentences. The current study investigates which speech-processing stage is blocked when we do not listen carefully. We show that the brain can reliably encode syllables, basic units of speech sounds, even when we do not pay attention. Nevertheless, when distracted, the brain cannot group syllables into multisyllabic words, which are basic units for speech meaning. Therefore, the process of converting speech sound into meaning crucially relies on attention. Copyright © 2018 the authors 0270-6474/18/381178-11$15.00/0.

  7. Methods for eliciting, annotating, and analyzing databases for child speech development.

    PubMed

    Beckman, Mary E; Plummer, Andrew R; Munson, Benjamin; Reidy, Patrick F

    2017-09-01

    Methods from automatic speech recognition (ASR), such as segmentation and forced alignment, have facilitated the rapid annotation and analysis of very large adult speech databases and databases of caregiver-infant interaction, enabling advances in speech science that were unimaginable just a few decades ago. This paper centers on two main problems that must be addressed in order to have analogous resources for developing and exploiting databases of young children's speech. The first problem is to understand and appreciate the differences between adult and child speech that cause ASR models developed for adult speech to fail when applied to child speech. These differences include the fact that children's vocal tracts are smaller than those of adult males and also changing rapidly in size and shape over the course of development, leading to between-talker variability across age groups that dwarfs the between-talker differences between adult men and women. Moreover, children do not achieve fully adult-like speech motor control until they are young adults, and their vocabularies and phonological proficiency are developing as well, leading to considerably more within-talker variability as well as more between-talker variability. The second problem then is to determine what annotation schemas and analysis techniques can most usefully capture relevant aspects of this variability. Indeed, standard acoustic characterizations applied to child speech reveal that adult-centered annotation schemas fail to capture phenomena such as the emergence of covert contrasts in children's developing phonological systems, while also revealing children's nonuniform progression toward community speech norms as they acquire the phonological systems of their native languages. Both problems point to the need for more basic research into the growth and development of the articulatory system (as well as of the lexicon and phonological system) that is oriented explicitly toward the construction of age-appropriate computational models.

  8. An automatic speech recognition system with speaker-independent identification support

    NASA Astrophysics Data System (ADS)

    Caranica, Alexandru; Burileanu, Corneliu

    2015-02-01

    The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.

  9. Research Directory for Manpower, Personnel, Training, and Human Factors.

    DTIC Science & Technology

    1991-01-01

    Enhance Automatic Recognition of Speech in Noisy, Highly Stressful Environments Cofod R* Lica Systems Inc 703-359-0996 Smart Contract Preparation...Lab 301-278-2946 Smart Contract Preparation Expediter Frezell T LTCOL Human Engineering Lab 301-278-5998 Impulse Noise Hazard Information Processing R&D

  10. Variable frame rate transmission - A review of methodology and application to narrow-band LPC speech coding

    NASA Astrophysics Data System (ADS)

    Viswanathan, V. R.; Makhoul, J.; Schwartz, R. M.; Huggins, A. W. F.

    1982-04-01

    The variable frame rate (VFR) transmission methodology developed, implemented, and tested in the years 1973-1978 for efficiently transmitting linear predictive coding (LPC) vocoder parameters extracted from the input speech at a fixed frame rate is reviewed. With the VFR method, parameters are transmitted only when their values have changed sufficiently over the interval since their preceding transmission. Two distinct approaches to automatic implementation of the VFR method are discussed. The first bases the transmission decisions on comparisons between the parameter values of the present frame and the last transmitted frame. The second, which is based on a functional perceptual model of speech, compares the parameter values of all the frames that lie in the interval between the present frame and the last transmitted frame against a linear model of parameter variation over that interval. Also considered is the application of VFR transmission to the design of narrow-band LPC speech coders with average bit rates of 2000-2400 bts/s.

  11. Sound and speech detection and classification in a Health Smart Home.

    PubMed

    Fleury, A; Noury, N; Vacher, M; Glasson, H; Seri, J F

    2008-01-01

    Improvements in medicine increase life expectancy in the world and create a new bottleneck at the entrance of specialized and equipped institutions. To allow elderly people to stay at home, researchers work on ways to monitor them in their own environment, with non-invasive sensors. To meet this goal, smart homes, equipped with lots of sensors, deliver information on the activities of the person and can help detect distress situations. In this paper, we present a global speech and sound recognition system that can be set-up in a flat. We placed eight microphones in the Health Smart Home of Grenoble (a real living flat of 47m(2)) and we automatically analyze and sort out the different sounds recorded in the flat and the speech uttered (to detect normal or distress french sentences). We introduce the methods for the sound and speech recognition, the post-processing of the data and finally the experimental results obtained in real conditions in the flat.

  12. Study of wavelet packet energy entropy for emotion classification in speech and glottal signals

    NASA Astrophysics Data System (ADS)

    He, Ling; Lech, Margaret; Zhang, Jing; Ren, Xiaomei; Deng, Lihua

    2013-07-01

    The automatic speech emotion recognition has important applications in human-machine communication. Majority of current research in this area is focused on finding optimal feature parameters. In recent studies, several glottal features were examined as potential cues for emotion differentiation. In this study, a new type of feature parameter is proposed, which calculates energy entropy on values within selected Wavelet Packet frequency bands. The modeling and classification tasks are conducted using the classical GMM algorithm. The experiments use two data sets: the Speech Under Simulated Emotion (SUSE) data set annotated with three different emotions (angry, neutral and soft) and Berlin Emotional Speech (BES) database annotated with seven different emotions (angry, bored, disgust, fear, happy, sad and neutral). The average classification accuracy achieved for the SUSE data (74%-76%) is significantly higher than the accuracy achieved for the BES data (51%-54%). In both cases, the accuracy was significantly higher than the respective random guessing levels (33% for SUSE and 14.3% for BES).

  13. Velopharyngeal Dysfunction Evaluation and Treatment.

    PubMed

    Meier, Jeremy D; Muntz, Harlan R

    2016-11-01

    Velopharyngeal dysfunction (VPD) can significantly impair a child's quality of life and may have lasting consequences if inadequately treated. This article reviews the work-up and management options for patients with VPD. An accurate perceptual speech analysis, nasometry, and nasal endoscopy are helpful to appropriately evaluate patients with VPD. Treatment options include nonsurgical management with speech therapy or a speech bulb and surgical approaches including double-opposing Z-plasty, sphincter pharyngoplasty, pharyngeal flap, or posterior wall augmentation. Copyright © 2016 Elsevier Inc. All rights reserved.

  14. Modelling Relations between Sensory Processing, Speech Perception, Orthographic and Phonological Ability, and Literacy Achievement

    ERIC Educational Resources Information Center

    Boets, Bart; Wouters, Jan; van Wieringen, Astrid; De Smedt, Bert; Ghesquiere, Pol

    2008-01-01

    The general magnocellular theory postulates that dyslexia is the consequence of a multimodal deficit in the processing of transient and dynamic stimuli. In the auditory modality, this deficit has been hypothesized to interfere with accurate speech perception, and subsequently disrupt the development of phonological and later reading and spelling…

  15. Helium Speech: An Application of Standing Waves

    ERIC Educational Resources Information Center

    Wentworth, Christopher D.

    2011-01-01

    Taking a breath of helium gas and then speaking or singing to the class is a favorite demonstration for an introductory physics course, as it usually elicits appreciative laughter, which serves to energize the class session. Students will usually report that the helium speech "raises the frequency" of the voice. A more accurate description of the…

  16. Perception of Spectral Contrast by Hearing-Impaired Listeners

    ERIC Educational Resources Information Center

    Dreisbach, Laura E.; Leek, Marjorie R.; Lentz, Jennifer J.

    2005-01-01

    The ability to discriminate the spectral shapes of complex sounds is critical to accurate speech perception. Part of the difficulty experienced by listeners with hearing loss in understanding speech sounds in noise may be related to a smearing of the internal representation of the spectral peaks and valleys because of the loss of sensitivity and…

  17. Speech Recognition in Fluctuating and Continuous Maskers: Effects of Hearing Loss and Presentation Level.

    ERIC Educational Resources Information Center

    Summers, Van; Molis, Michelle R.

    2004-01-01

    Listeners with normal-hearing sensitivity recognize speech more accurately in the presence of fluctuating background sounds, such as a single competing voice, than in unmodulated noise at the same overall level. These performance differences ore greatly reduced in listeners with hearing impairment, who generally receive little benefit from…

  18. Speech production in children with Down's syndrome: The effects of reading, naming and imitation.

    PubMed

    Knight, Rachael-Anne; Kurtz, Scilla; Georgiadou, Ioanna

    2015-01-01

    People with DS are known to have difficulties with expressive language, and often have difficulties with intelligibility. They often have stronger visual than verbal short-term memory skills and, therefore, reading has often been suggested as an intervention for speech and language in this population. However, there is as yet no firm evidence that reading can improve speech outcomes. This study aimed to compare reading, picture naming and repetition for the same 10 words, to identify if the speech of eight children with DS (aged 11-14 years) was more accurate, consistent and intelligible when reading. Results show that children were slightly, yet significantly, more accurate and intelligible when they read words compared with when they produced those words in naming or imitation conditions although the reduction in inconsistency was non-significant. The results of this small-scale study provide tentative support for previous claims about the benefits of reading for children with DS. The mechanisms behind a facilitatory effect of reading are considered, and directions are identified for future research.

  19. Digitised evaluation of speech intelligibility using vowels in maxillectomy patients.

    PubMed

    Sumita, Y I; Hattori, M; Murase, M; Elbashti, M E; Taniguchi, H

    2018-03-01

    Among the functional disabilities that patients face following maxillectomy, speech impairment is a major factor influencing quality of life. Proper rehabilitation of speech, which may include prosthodontic and surgical treatments and speech therapy, requires accurate evaluation of speech intelligibility (SI). A simple, less time-consuming yet accurate evaluation is desirable both for maxillectomy patients and the various clinicians providing maxillofacial treatment. This study sought to determine the utility of digital acoustic analysis of vowels for the prediction of SI in maxillectomy patients, based on a comprehensive understanding of speech production in the vocal tract of maxillectomy patients and its perception. Speech samples were collected from 33 male maxillectomy patients (mean age 57.4 years) in two conditions, without and with a maxillofacial prosthesis, and formant data for the vowels /a/,/e/,/i/,/o/, and /u/ were calculated based on linear predictive coding. The frequency range of formant 2 (F2) was determined by differences between the minimum and maximum frequency. An SI test was also conducted to reveal the relationship between SI score and F2 range. Statistical analyses were applied. F2 range and SI score were significantly different between the two conditions without and with a prosthesis (both P < .0001). F2 range was significantly correlated with SI score in both the conditions (Spearman's r = .843, P < .0001; r = .832, P < .0001, respectively). These findings indicate that calculating the F2 range from 5 vowels has clinical utility for the prediction of SI after maxillectomy. © 2017 John Wiley & Sons Ltd.

  20. The interlanguage speech intelligibility benefit for native speakers of Mandarin: Production and perception of English word-final voicing contrasts

    PubMed Central

    Hayes-Harb, Rachel; Smith, Bruce L.; Bent, Tessa; Bradlow, Ann R.

    2009-01-01

    This study investigated the intelligibility of native and Mandarin-accented English speech for native English and native Mandarin listeners. The word-final voicing contrast was considered (as in minimal pairs such as `cub' and `cup') in a forced-choice word identification task. For these particular talkers and listeners, there was evidence of an interlanguage speech intelligibility benefit for listeners (i.e., native Mandarin listeners were more accurate than native English listeners at identifying Mandarin-accented English words). However, there was no evidence of an interlanguage speech intelligibility benefit for talkers (i.e., native Mandarin listeners did not find Mandarin-accented English speech more intelligible than native English speech). When listener and talker phonological proficiency (operationalized as accentedness) was taken into account, it was found that the interlanguage speech intelligibility benefit for listeners held only for the low phonological proficiency listeners and low phonological proficiency speech. The intelligibility data were also considered in relation to various temporal-acoustic properties of native English and Mandarin-accented English speech in effort to better understand the properties of speech that may contribute to the interlanguage speech intelligibility benefit. PMID:19606271

  1. Language-independent talker-specificity in first-language and second-language speech production by bilingual talkers: L1 speaking rate predicts L2 speaking rate

    PubMed Central

    Bradlow, Ann R.; Kim, Midam; Blasingame, Michael

    2017-01-01

    Second-language (L2) speech is consistently slower than first-language (L1) speech, and L1 speaking rate varies within- and across-talkers depending on many individual, situational, linguistic, and sociolinguistic factors. It is asked whether speaking rate is also determined by a language-independent talker-specific trait such that, across a group of bilinguals, L1 speaking rate significantly predicts L2 speaking rate. Two measurements of speaking rate were automatically extracted from recordings of read and spontaneous speech by English monolinguals (n = 27) and bilinguals from ten L1 backgrounds (n = 86): speech rate (syllables/second), and articulation rate (syllables/second excluding silent pauses). Replicating prior work, L2 speaking rates were significantly slower than L1 speaking rates both across-groups (monolinguals' L1 English vs bilinguals' L2 English), and across L1 and L2 within bilinguals. Critically, within the bilingual group, L1 speaking rate significantly predicted L2 speaking rate, suggesting that a significant portion of inter-talker variation in L2 speech is derived from inter-talker variation in L1 speech, and that individual variability in L2 spoken language production may be best understood within the context of individual variability in L1 spoken language production. PMID:28253679

  2. Restoring the missing features of the corrupted speech using linear interpolation methods

    NASA Astrophysics Data System (ADS)

    Rassem, Taha H.; Makbol, Nasrin M.; Hasan, Ali Muttaleb; Zaki, Siti Syazni Mohd; Girija, P. N.

    2017-10-01

    One of the main challenges in the Automatic Speech Recognition (ASR) is the noise. The performance of the ASR system reduces significantly if the speech is corrupted by noise. In spectrogram representation of a speech signal, after deleting low Signal to Noise Ratio (SNR) elements, the incomplete spectrogram is obtained. In this case, the speech recognizer should make modifications to the spectrogram in order to restore the missing elements, which is one direction. In another direction, speech recognizer should be able to restore the missing elements due to deleting low SNR elements before performing the recognition. This is can be done using different spectrogram reconstruction methods. In this paper, the geometrical spectrogram reconstruction methods suggested by some researchers are implemented as a toolbox. In these geometrical reconstruction methods, the linear interpolation along time or frequency methods are used to predict the missing elements between adjacent observed elements in the spectrogram. Moreover, a new linear interpolation method using time and frequency together is presented. The CMU Sphinx III software is used in the experiments to test the performance of the linear interpolation reconstruction method. The experiments are done under different conditions such as different lengths of the window and different lengths of utterances. Speech corpus consists of 20 males and 20 females; each one has two different utterances are used in the experiments. As a result, 80% recognition accuracy is achieved with 25% SNR ratio.

  3. Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem

    PubMed Central

    Liu, Xunying; Zhang, Chao; Woodland, Phil; Fonteneau, Elisabeth

    2017-01-01

    There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain. PMID:28945744

  4. Technological evaluation of gesture and speech interfaces for enabling dismounted soldier-robot dialogue

    NASA Astrophysics Data System (ADS)

    Kattoju, Ravi Kiran; Barber, Daniel J.; Abich, Julian; Harris, Jonathan

    2016-05-01

    With increasing necessity for intuitive Soldier-robot communication in military operations and advancements in interactive technologies, autonomous robots have transitioned from assistance tools to functional and operational teammates able to service an array of military operations. Despite improvements in gesture and speech recognition technologies, their effectiveness in supporting Soldier-robot communication is still uncertain. The purpose of the present study was to evaluate the performance of gesture and speech interface technologies to facilitate Soldier-robot communication during a spatial-navigation task with an autonomous robot. Gesture and speech semantically based spatial-navigation commands leveraged existing lexicons for visual and verbal communication from the U.S Army field manual for visual signaling and a previously established Squad Level Vocabulary (SLV). Speech commands were recorded by a Lapel microphone and Microsoft Kinect, and classified by commercial off-the-shelf automatic speech recognition (ASR) software. Visual signals were captured and classified using a custom wireless gesture glove and software. Participants in the experiment commanded a robot to complete a simulated ISR mission in a scaled down urban scenario by delivering a sequence of gesture and speech commands, both individually and simultaneously, to the robot. Performance and reliability of gesture and speech hardware interfaces and recognition tools were analyzed and reported. Analysis of experimental results demonstrated the employed gesture technology has significant potential for enabling bidirectional Soldier-robot team dialogue based on the high classification accuracy and minimal training required to perform gesture commands.

  5. Intensive Treatment with Ultrasound Visual Feedback for Speech Sound Errors in Childhood Apraxia

    PubMed Central

    Preston, Jonathan L.; Leece, Megan C.; Maas, Edwin

    2016-01-01

    Ultrasound imaging is an adjunct to traditional speech therapy that has shown to be beneficial in the remediation of speech sound errors. Ultrasound biofeedback can be utilized during therapy to provide clients with additional knowledge about their tongue shapes when attempting to produce sounds that are erroneous. The additional feedback may assist children with childhood apraxia of speech (CAS) in stabilizing motor patterns, thereby facilitating more consistent and accurate productions of sounds and syllables. However, due to its specialized nature, ultrasound visual feedback is a technology that is not widely available to clients. Short-term intensive treatment programs are one option that can be utilized to expand access to ultrasound biofeedback. Schema-based motor learning theory suggests that short-term intensive treatment programs (massed practice) may assist children in acquiring more accurate motor patterns. In this case series, three participants ages 10–14 years diagnosed with CAS attended 16 h of speech therapy over a 2-week period to address residual speech sound errors. Two participants had distortions on rhotic sounds, while the third participant demonstrated lateralization of sibilant sounds. During therapy, cues were provided to assist participants in obtaining a tongue shape that facilitated a correct production of the erred sound. Additional practice without ultrasound was also included. Results suggested that all participants showed signs of acquisition of sounds in error. Generalization and retention results were mixed. One participant showed generalization and retention of sounds that were treated; one showed generalization but limited retention; and the third showed no evidence of generalization or retention. Individual characteristics that may facilitate generalization are discussed. Short-term intensive treatment programs using ultrasound biofeedback may result in the acquisition of more accurate motor patterns and improved articulation of sounds previously in error, with varying levels of generalization and retention. PMID:27625603

  6. Drinking behavior in nursery pigs: Determining the accuracy between an automatic water meter versus human observers

    USDA-ARS?s Scientific Manuscript database

    Assimilating accurate behavioral events over a long period can be labor intensive and relatively expensive. If an automatic device could accurately record the duration and frequency for a given behavioral event, it would be a valuable alternative to the traditional use of human observers for behavio...

  7. Interactive segmentation of tongue contours in ultrasound video sequences using quality maps

    NASA Astrophysics Data System (ADS)

    Ghrenassia, Sarah; Ménard, Lucie; Laporte, Catherine

    2014-03-01

    Ultrasound (US) imaging is an effective and non invasive way of studying the tongue motions involved in normal and pathological speech, and the results of US studies are of interest for the development of new strategies in speech therapy. State-of-the-art tongue shape analysis techniques based on US images depend on semi-automated tongue segmentation and tracking techniques. Recent work has mostly focused on improving the accuracy of the tracking techniques themselves. However, occasional errors remain inevitable, regardless of the technique used, and the tongue tracking process must thus be supervised by a speech scientist who will correct these errors manually or semi-automatically. This paper proposes an interactive framework to facilitate this process. In this framework, the user is guided towards potentially problematic portions of the US image sequence by a segmentation quality map that is based on the normalized energy of an active contour model and automatically produced during tracking. When a problematic segmentation is identified, corrections to the segmented contour can be made on one image and propagated both forward and backward in the problematic subsequence, thereby improving the user experience. The interactive tools were tested in combination with two different tracking algorithms. Preliminary results illustrate the potential of the proposed framework, suggesting that the proposed framework generally improves user interaction time, with little change in segmentation repeatability.

  8. Modeling the Perceptual Learning of Novel Dialect Features

    ERIC Educational Resources Information Center

    Tatman, Rachael

    2017-01-01

    All language use reflects the user's social identity in systematic ways. While humans can easily adapt to this sociolinguistic variation, automatic speech recognition (ASR) systems continue to struggle with it. This dissertation makes three main contributions. The first is to provide evidence that modern state-of-the-art commercial ASR systems…

  9. Morphosyntactic Neural Analysis for Generalized Lexical Normalization

    ERIC Educational Resources Information Center

    Leeman-Munk, Samuel Paul

    2016-01-01

    The phenomenal growth of social media, web forums, and online reviews has spurred a growing interest in automated analysis of user-generated text. At the same time, a proliferation of voice recordings and efforts to archive culture heritage documents are fueling demand for effective automatic speech recognition (ASR) and optical character…

  10. Automatic Intention Recognition in Conversation Processing

    ERIC Educational Resources Information Center

    Holtgraves, Thomas

    2008-01-01

    A fundamental assumption of many theories of conversation is that comprehension of a speaker's utterance involves recognition of the speaker's intention in producing that remark. However, the nature of intention recognition is not clear. One approach is to conceptualize a speaker's intention in terms of speech acts [Searle, J. (1969). "Speech…

  11. Multilingual Videos for MOOCs and OER

    ERIC Educational Resources Information Center

    Valor Miró, Juan Daniel; Baquero-Arnal, Pau; Civera, Jorge; Turró, Carlos; Juan, Alfons

    2018-01-01

    Massive Open Online Courses (MOOCs) and Open Educational Resources (OER) are rapidly growing, but are not usually offered in multiple languages due to the lack of cost-effective solutions to translate the different objects comprising them and particularly videos. However, current state-of-the-art automatic speech recognition (ASR) and machine…

  12. A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks.

    PubMed

    Shahamiri, Seyed Reza; Salim, Siti Salwah Binti

    2014-09-01

    Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of motor speech articulators. Although a few attempts have been made to apply ASR technologies to sufferers of dysarthria, previous studies show that such ASR systems have not attained an adequate level of performance. In this study, a dysarthric multi-networks speech recognizer (DM-NSR) model is provided using a realization of multi-views multi-learners approach called multi-nets artificial neural networks, which tolerates variability of dysarthric speech. In particular, the DM-NSR model employs several ANNs (as learners) to approximate the likelihood of ASR vocabulary words and to deal with the complexity of dysarthric speech. The proposed DM-NSR approach was presented as both speaker-dependent and speaker-independent paradigms. In order to highlight the performance of the proposed model over legacy models, multi-views single-learner models of the DM-NSRs were also provided and their efficiencies were compared in detail. Moreover, a comparison among the prominent dysarthric ASR methods and the proposed one is provided. The results show that the DM-NSR recorded improved recognition rate by up to 24.67% and the error rate was reduced by up to 8.63% over the reference model.

  13. Studies in automatic speech recognition and its application in aerospace

    NASA Astrophysics Data System (ADS)

    Taylor, Michael Robinson

    Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.

  14. Immediate integration of prosodic information from speech and visual information from pictures in the absence of focused attention: a mismatch negativity study.

    PubMed

    Li, X; Yang, Y; Ren, G

    2009-06-16

    Language is often perceived together with visual information. Recent experimental evidences indicated that, during spoken language comprehension, the brain can immediately integrate visual information with semantic or syntactic information from speech. Here we used the mismatch negativity to further investigate whether prosodic information from speech could be immediately integrated into a visual scene context or not, and especially the time course and automaticity of this integration process. Sixteen Chinese native speakers participated in the study. The materials included Chinese spoken sentences and picture pairs. In the audiovisual situation, relative to the concomitant pictures, the spoken sentence was appropriately accented in the standard stimuli, but inappropriately accented in the two kinds of deviant stimuli. In the purely auditory situation, the speech sentences were presented without pictures. It was found that the deviants evoked mismatch responses in both audiovisual and purely auditory situations; the mismatch negativity in the purely auditory situation peaked at the same time as, but was weaker than that evoked by the same deviant speech sounds in the audiovisual situation. This pattern of results suggested immediate integration of prosodic information from speech and visual information from pictures in the absence of focused attention.

  15. Auditory Perceptual Learning for Speech Perception Can be Enhanced by Audiovisual Training.

    PubMed

    Bernstein, Lynne E; Auer, Edward T; Eberhardt, Silvio P; Jiang, Jintao

    2013-01-01

    Speech perception under audiovisual (AV) conditions is well known to confer benefits to perception such as increased speed and accuracy. Here, we investigated how AV training might benefit or impede auditory perceptual learning of speech degraded by vocoding. In Experiments 1 and 3, participants learned paired associations between vocoded spoken nonsense words and nonsense pictures. In Experiment 1, paired-associates (PA) AV training of one group of participants was compared with audio-only (AO) training of another group. When tested under AO conditions, the AV-trained group was significantly more accurate than the AO-trained group. In addition, pre- and post-training AO forced-choice consonant identification with untrained nonsense words showed that AV-trained participants had learned significantly more than AO participants. The pattern of results pointed to their having learned at the level of the auditory phonetic features of the vocoded stimuli. Experiment 2, a no-training control with testing and re-testing on the AO consonant identification, showed that the controls were as accurate as the AO-trained participants in Experiment 1 but less accurate than the AV-trained participants. In Experiment 3, PA training alternated AV and AO conditions on a list-by-list basis within participants, and training was to criterion (92% correct). PA training with AO stimuli was reliably more effective than training with AV stimuli. We explain these discrepant results in terms of the so-called "reverse hierarchy theory" of perceptual learning and in terms of the diverse multisensory and unisensory processing resources available to speech perception. We propose that early AV speech integration can potentially impede auditory perceptual learning; but visual top-down access to relevant auditory features can promote auditory perceptual learning.

  16. Auditory Perceptual Learning for Speech Perception Can be Enhanced by Audiovisual Training

    PubMed Central

    Bernstein, Lynne E.; Auer, Edward T.; Eberhardt, Silvio P.; Jiang, Jintao

    2013-01-01

    Speech perception under audiovisual (AV) conditions is well known to confer benefits to perception such as increased speed and accuracy. Here, we investigated how AV training might benefit or impede auditory perceptual learning of speech degraded by vocoding. In Experiments 1 and 3, participants learned paired associations between vocoded spoken nonsense words and nonsense pictures. In Experiment 1, paired-associates (PA) AV training of one group of participants was compared with audio-only (AO) training of another group. When tested under AO conditions, the AV-trained group was significantly more accurate than the AO-trained group. In addition, pre- and post-training AO forced-choice consonant identification with untrained nonsense words showed that AV-trained participants had learned significantly more than AO participants. The pattern of results pointed to their having learned at the level of the auditory phonetic features of the vocoded stimuli. Experiment 2, a no-training control with testing and re-testing on the AO consonant identification, showed that the controls were as accurate as the AO-trained participants in Experiment 1 but less accurate than the AV-trained participants. In Experiment 3, PA training alternated AV and AO conditions on a list-by-list basis within participants, and training was to criterion (92% correct). PA training with AO stimuli was reliably more effective than training with AV stimuli. We explain these discrepant results in terms of the so-called “reverse hierarchy theory” of perceptual learning and in terms of the diverse multisensory and unisensory processing resources available to speech perception. We propose that early AV speech integration can potentially impede auditory perceptual learning; but visual top-down access to relevant auditory features can promote auditory perceptual learning. PMID:23515520

  17. Factors that influence the performance of experienced speech recognition users.

    PubMed

    Koester, Heidi Horstmann

    2006-01-01

    Performance on automatic speech recognition (ASR) systems for users with physical disabilities varies widely between individuals. The goal of this study was to discover some key factors that account for that variation. Using data from 23 experienced ASR users with physical disabilities, the effect of 20 different independent variables on recognition accuracy and text entry rate with ASR was measured using bivariate and multivariate analyses. The results show that use of appropriate correction strategies had the strongest influence on user performance with ASR. The amount of time the user spent on his or her computer, the user's manual typing speed, and the speed with which the ASR system recognized speech were all positively associated with better performance. The amount or perceived adequacy of ASR training did not have a significant impact on performance for this user group.

  18. Effects of a metronome on the filled pauses of fluent speakers.

    PubMed

    Christenfeld, N

    1996-12-01

    Filled pauses (the "ums" and "uhs" that litter spontaneous speech) seem to be a product of the speaker paying deliberate attention to the normally automatic act of talking. This is the same sort of explanation that has been offered for stuttering. In this paper we explore whether a manipulation that has long been known to decrease stuttering, synchronizing speech to the beats of a metronome, will then also decrease filled pauses. Two experiments indicate that a metronome has a dramatic effect on the production of filled pauses. This effect is not due to any simplification or slowing of the speech and supports the view that a metronome causes speakers to attend more to how they are talking and less to what they are saying. It also lends support to the connection between stutters and filled pauses.

  19. Neural Recruitment for the Production of Native and Novel Speech Sounds

    PubMed Central

    Moser, Dana; Fridriksson, Julius; Bonilha, Leonardo; Healy, Eric W.; Baylis, Gordon; Baker, Julie; Rorden, Chris

    2010-01-01

    Two primary areas of damage have been implicated in apraxia of speech (AOS) based on the time post-stroke: (1) the left inferior frontal gyrus (IFG) in acute patients, and (2) the left anterior insula (aIns) in chronic patients. While AOS is widely characterized as a disorder in motor speech planning, little is known about the specific contributions of each of these regions in speech. The purpose of this study was to investigate cortical activation during speech production with a specific focus on the aIns and the IFG in normal adults. While undergoing sparse fMRI, 30 normal adults completed a 30-minute speech-repetition task consisting of three-syllable nonwords that contained either (a) English (native) syllables or (b) Non-English (novel) syllables. When the novel syllable productions were compared to the native syllable productions, greater neural activation was observed in the aIns and IFG, particularly during the first 10 minutes of the task when novelty was the greatest. Although activation in the aIns remained high throughout the task for novel productions, greater activation was clearly demonstrated when the initial 10 minutes were compared to the final 10 minutes of the task. These results suggest increased activity within an extensive neural network, including the aIns and IFG, when the motor speech system is taxed, such as during the production of novel speech. We speculate that the amount of left aIns recruitment during speech production may be related to the internal construction of the motor speech unit such that the degree of novelty/automaticity would result in more or less demands respectively. The role of the IFG as a storehouse and integrative processor for previously acquired routines is also discussed. PMID:19385020

  20. Multi-sensory learning and learning to read.

    PubMed

    Blomert, Leo; Froyen, Dries

    2010-09-01

    The basis of literacy acquisition in alphabetic orthographies is the learning of the associations between the letters and the corresponding speech sounds. In spite of this primacy in learning to read, there is only scarce knowledge on how this audiovisual integration process works and which mechanisms are involved. Recent electrophysiological studies of letter-speech sound processing have revealed that normally developing readers take years to automate these associations and dyslexic readers hardly exhibit automation of these associations. It is argued that the reason for this effortful learning may reside in the nature of the audiovisual process that is recruited for the integration of in principle arbitrarily linked elements. It is shown that letter-speech sound integration does not resemble the processes involved in the integration of natural audiovisual objects such as audiovisual speech. The automatic symmetrical recruitment of the assumedly uni-sensory visual and auditory cortices in audiovisual speech integration does not occur for letter and speech sound integration. It is also argued that letter-speech sound integration only partly resembles the integration of arbitrarily linked unfamiliar audiovisual objects. Letter-sound integration and artificial audiovisual objects share the necessity of a narrow time window for integration to occur. However, they differ from these artificial objects, because they constitute an integration of partly familiar elements which acquire meaning through the learning of an orthography. Although letter-speech sound pairs share similarities with audiovisual speech processing as well as with unfamiliar, arbitrary objects, it seems that letter-speech sound pairs develop into unique audiovisual objects that furthermore have to be processed in a unique way in order to enable fluent reading and thus very likely recruit other neurobiological learning mechanisms than the ones involved in learning natural or arbitrary unfamiliar audiovisual associations. Copyright 2010 Elsevier B.V. All rights reserved.

  1. The shadow of a doubt? Evidence for perceptuo-motor linkage during auditory and audiovisual close-shadowing

    PubMed Central

    Scarbel, Lucie; Beautemps, Denis; Schwartz, Jean-Luc; Sato, Marc

    2014-01-01

    One classical argument in favor of a functional role of the motor system in speech perception comes from the close-shadowing task in which a subject has to identify and to repeat as quickly as possible an auditory speech stimulus. The fact that close-shadowing can occur very rapidly and much faster than manual identification of the speech target is taken to suggest that perceptually induced speech representations are already shaped in a motor-compatible format. Another argument is provided by audiovisual interactions often interpreted as referring to a multisensory-motor framework. In this study, we attempted to combine these two paradigms by testing whether the visual modality could speed motor response in a close-shadowing task. To this aim, both oral and manual responses were evaluated during the perception of auditory and audiovisual speech stimuli, clear or embedded in white noise. Overall, oral responses were faster than manual ones, but it also appeared that they were less accurate in noise, which suggests that motor representations evoked by the speech input could be rough at a first processing stage. In the presence of acoustic noise, the audiovisual modality led to both faster and more accurate responses than the auditory modality. No interaction was however, observed between modality and response. Altogether, these results are interpreted within a two-stage sensory-motor framework, in which the auditory and visual streams are integrated together and with internally generated motor representations before a final decision may be available. PMID:25009512

  2. Automated Electroglottographic Inflection Events Detection. A Pilot Study.

    PubMed

    Codino, Juliana; Torres, María Eugenia; Rubin, Adam; Jackson-Menaldi, Cristina

    2016-11-01

    Vocal-fold vibration can be analyzed in a noninvasive way by registering impedance changes within the glottis, through electroglottography. The morphology of the electroglottographic (EGG) signal is related to different vibratory patterns. In the literature, a characteristic knee in the descending portion of the signal has been reported. Some EGG signals do not exhibit this particular knee and have other types of events (inflection events) throughout the ascending and/or descending portion of the vibratory cycle. The goal of this work is to propose an automatic method to identify and classify these events. A computational algorithm was developed based on the mathematical properties of the EGG signal, which detects and reports events throughout the contact phase. Retrospective analysis of EGG signals obtained during routine voice evaluation of adult individuals with a variety of voice disorders was performed using the algorithm as well as human raters. Two judges, both experts in clinical voice analysis, and three general speech pathologists performed manual and visual evaluation of the sample set. The results obtained by the automatic method were compared with those of the human raters. Statistical analysis revealed a significant level of agreement. This automatic tool could allow professionals in the clinical setting to obtain an automatic quantitative and qualitative report of such events present in a voice sample, without having to manually analyze the whole EGG signal. In addition, it might provide the speech pathologist with more information that would complement the standard voice evaluation. It could also be a valuable tool in voice research. Copyright © 2016 The Voice Foundation. Published by Elsevier Inc. All rights reserved.

  3. Automatic intraaortic balloon pump timing using an intrabeat dicrotic notch prediction algorithm.

    PubMed

    Schreuder, Jan J; Castiglioni, Alessandro; Donelli, Andrea; Maisano, Francesco; Jansen, Jos R C; Hanania, Ramzi; Hanlon, Pat; Bovelander, Jan; Alfieri, Ottavio

    2005-03-01

    The efficacy of intraaortic balloon counterpulsation (IABP) during arrhythmic episodes is questionable. A novel algorithm for intrabeat prediction of the dicrotic notch was used for real time IABP inflation timing control. A windkessel model algorithm was used to calculate real-time aortic flow from aortic pressure. The dicrotic notch was predicted using a percentage of calculated peak flow. Automatic inflation timing was set at intrabeat predicted dicrotic notch and was combined with automatic IAB deflation. Prophylactic IABP was applied in 27 patients with low ejection fraction (< 35%) undergoing cardiac surgery. Analysis of IABP at a 1:4 ratio revealed that IAB inflation occurred at a mean of 0.6 +/- 5 ms from the dicrotic notch. In all patients accurate automatic timing at a 1:1 assist ratio was performed. Seventeen patients had episodes of severe arrhythmia, the novel IABP inflation algorithm accurately assisted 318 of 320 arrhythmic beats at a 1:1 ratio. The novel real-time intrabeat IABP inflation timing algorithm performed accurately in all patients during both regular rhythms and severe arrhythmia, allowing fully automatic intrabeat IABP timing.

  4. The Use of Artificial Neural Networks to Estimate Speech Intelligibility from Acoustic Variables: A Preliminary Analysis.

    ERIC Educational Resources Information Center

    Metz, Dale Evan; And Others

    1992-01-01

    A preliminary scheme for estimating the speech intelligibility of hearing-impaired speakers from acoustic parameters, using a computerized artificial neural network to process mathematically the acoustic input variables, is outlined. Tests with 60 hearing-impaired speakers found the scheme to be highly accurate in identifying speakers separated by…

  5. A Data-Processing System for Quantitative Analysis in Speech Production. CLCS Occasional Paper No. 17.

    ERIC Educational Resources Information Center

    Chasaide, Ailbhe Ni; Davis, Eugene

    The data processing system used at Trinity College's Centre for Language and Communication Studies (Ireland) enables computer-automated collection and analysis of phonetic data and has many advantages for research on speech production. The system allows accurate handling of large quantities of data, eliminates many of the limitations of manual…

  6. Assessing Disfluencies in School-Age Children Who Stutter: How Much Speech Is Enough?

    ERIC Educational Resources Information Center

    Gregg, Brent A.; Sawyer, Jean

    2015-01-01

    The question of what size speech sample is sufficient to accurately identify stuttering and its myriad characteristics is a valid one. Short samples have a risk of over- or underrepresenting disfluency types or characteristics. In recent years, there has been a trend toward using shorter samples because they are less time-consuming for…

  7. Dual Key Speech Encryption Algorithm Based Underdetermined BSS

    PubMed Central

    Zhao, Huan; Chen, Zuo; Zhang, Xixiang

    2014-01-01

    When the number of the mixed signals is less than that of the source signals, the underdetermined blind source separation (BSS) is a significant difficult problem. Due to the fact that the great amount data of speech communications and real-time communication has been required, we utilize the intractability of the underdetermined BSS problem to present a dual key speech encryption method. The original speech is mixed with dual key signals which consist of random key signals (one-time pad) generated by secret seed and chaotic signals generated from chaotic system. In the decryption process, approximate calculation is used to recover the original speech signals. The proposed algorithm for speech signals encryption can resist traditional attacks against the encryption system, and owing to approximate calculation, decryption becomes faster and more accurate. It is demonstrated that the proposed method has high level of security and can recover the original signals quickly and efficiently yet maintaining excellent audio quality. PMID:24955430

  8. A hardware experimental platform for neural circuits in the auditory cortex

    NASA Astrophysics Data System (ADS)

    Rodellar-Biarge, Victoria; García-Dominguez, Pablo; Ruiz-Rizaldos, Yago; Gómez-Vilda, Pedro

    2011-05-01

    Speech processing in the human brain is a very complex process far from being fully understood although much progress has been done recently. Neuromorphic Speech Processing is a new research orientation in bio-inspired systems approach to find solutions to automatic treatment of specific problems (recognition, synthesis, segmentation, diarization, etc) which can not be adequately solved using classical algorithms. In this paper a neuromorphic speech processing architecture is presented. The systematic bottom-up synthesis of layered structures reproduce the dynamic feature detection of speech related to plausible neural circuits which work as interpretation centres located in the Auditory Cortex. The elementary model is based on Hebbian neuron-like units. For the computation of the architecture a flexible framework is proposed in the environment of Matlab®/Simulink®/HDL, which allows building models in different description styles, complexity and implementation levels. It provides a flexible platform for experimenting on the influence of the number of neurons and interconnections, in the precision of the results and in performance evaluation. The experimentation with different architecture configurations may help both in better understanding how neural circuits may work in the brain as well as in how speech processing can benefit from this understanding.

  9. Computational validation of the motor contribution to speech perception.

    PubMed

    Badino, Leonardo; D'Ausilio, Alessandro; Fadiga, Luciano; Metta, Giorgio

    2014-07-01

    Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion information onto observers' motor representations. This process of matching the actions of others onto our own sensorimotor repertoire is thought to be important for action recognition, providing a non-mediated "motor perception" based on a bidirectional flow of information along the mirror parieto-frontal circuits. State-of-the-art machine learning strategies for hand action identification have shown better performances when sensorimotor data, as opposed to visual information only, are available during learning. As speech is a particular type of action (with acoustic targets), it is expected to activate a mirror neuron mechanism. Indeed, in speech perception, motor centers have been shown to be causally involved in the discrimination of speech sounds. In this paper, we review recent neurophysiological and machine learning-based studies showing (a) the specific contribution of the motor system to speech perception and (b) that automatic phone recognition is significantly improved when motor data are used during training of classifiers (as opposed to learning from purely auditory data). Copyright © 2014 Cognitive Science Society, Inc.

  10. Performance in noise: Impact of reduced speech intelligibility on Sailor performance in a Navy command and control environment.

    PubMed

    Keller, M David; Ziriax, John M; Barns, William; Sheffield, Benjamin; Brungart, Douglas; Thomas, Tony; Jaeger, Bobby; Yankaskas, Kurt

    2017-06-01

    Noise, hearing loss, and electronic signal distortion, which are common problems in military environments, can impair speech intelligibility and thereby jeopardize mission success. The current study investigated the impact that impaired communication has on operational performance in a command and control environment by parametrically degrading speech intelligibility in a simulated shipborne Combat Information Center. Experienced U.S. Navy personnel served as the study participants and were required to monitor information from multiple sources and respond appropriately to communications initiated by investigators playing the roles of other personnel involved in a realistic Naval scenario. In each block of the scenario, an adaptive intelligibility modification system employing automatic gain control was used to adjust the signal-to-noise ratio to achieve one of four speech intelligibility levels on a Modified Rhyme Test: No Loss, 80%, 60%, or 40%. Objective and subjective measures of operational performance suggested that performance systematically degraded with decreasing speech intelligibility, with the largest drop occurring between 80% and 60%. These results confirm the importance of noise reduction, good communication design, and effective hearing conservation programs to maximize the operational effectiveness of military personnel. Published by Elsevier B.V.

  11. Development of coffee maker service robot using speech and face recognition systems using POMDP

    NASA Astrophysics Data System (ADS)

    Budiharto, Widodo; Meiliana; Santoso Gunawan, Alexander Agung

    2016-07-01

    There are many development of intelligent service robot in order to interact with user naturally. This purpose can be done by embedding speech and face recognition ability on specific tasks to the robot. In this research, we would like to propose Intelligent Coffee Maker Robot which the speech recognition is based on Indonesian language and powered by statistical dialogue systems. This kind of robot can be used in the office, supermarket or restaurant. In our scenario, robot will recognize user's face and then accept commands from the user to do an action, specifically in making a coffee. Based on our previous work, the accuracy for speech recognition is about 86% and face recognition is about 93% in laboratory experiments. The main problem in here is to know the intention of user about how sweetness of the coffee. The intelligent coffee maker robot should conclude the user intention through conversation under unreliable automatic speech in noisy environment. In this paper, this spoken dialog problem is treated as a partially observable Markov decision process (POMDP). We describe how this formulation establish a promising framework by empirical results. The dialog simulations are presented which demonstrate significant quantitative outcome.

  12. Automatic segmentation of time-lapse microscopy images depicting a live Dharma embryo.

    PubMed

    Zacharia, Eleni; Bondesson, Maria; Riu, Anne; Ducharme, Nicole A; Gustafsson, Jan-Åke; Kakadiaris, Ioannis A

    2011-01-01

    Biological inferences about the toxicity of chemicals reached during experiments on the zebrafish Dharma embryo can be greatly affected by the analysis of the time-lapse microscopy images depicting the embryo. Among the stages of image analysis, automatic and accurate segmentation of the Dharma embryo is the most crucial and challenging. In this paper, an accurate and automatic segmentation approach for the segmentation of the Dharma embryo data obtained by fluorescent time-lapse microscopy is proposed. Experiments performed in four stacks of 3D images over time have shown promising results.

  13. An acoustic feature-based similarity scoring system for speech rehabilitation assistance.

    PubMed

    Syauqy, Dahnial; Wu, Chao-Min; Setyawati, Onny

    2016-08-01

    The purpose of this study is to develop a tool to assist speech therapy and rehabilitation, which focused on automatic scoring based on the comparison of the patient's speech with another normal speech on several aspects including pitch, vowel, voiced-unvoiced segments, strident fricative and sound intensity. The pitch estimation employed the use of cepstrum-based algorithm for its robustness; the vowel classification used multilayer perceptron (MLP) to classify vowel from pitch and formants; and the strident fricative detection was based on the major peak spectral intensity, location and the pitch existence in the segment. In order to evaluate the performance of the system, this study analyzed eight patient's speech recordings (four males, four females; 4-58-years-old), which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Taoyuan General Hospital. The experiment result on pitch algorithm showed that the cepstrum method had 5.3% of gross pitch error from a total of 2086 frames. On the vowel classification algorithm, MLP method provided 93% accuracy (men), 87% (women) and 84% (children). In total, the overall results showed that 156 tool's grading results (81%) were consistent compared to 192 audio and visual observations done by four experienced respondents. Implication for Rehabilitation Difficulties in communication may limit the ability of a person to transfer and exchange information. The fact that speech is one of the primary means of communication has encouraged the needs of speech diagnosis and rehabilitation. The advances of technology in computer-assisted speech therapy (CAST) improve the quality, time efficiency of the diagnosis and treatment of the disorders. The present study attempted to develop tool to assist speech therapy and rehabilitation, which provided simple interface to let the assessment be done even by the patient himself without the need of particular knowledge of speech processing while at the same time, also provided further deep analysis of the speech, which can be useful for the speech therapist.

  14. Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition

    NASA Astrophysics Data System (ADS)

    Malcangi, Mario

    2009-08-01

    Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.

  15. The Timing and Effort of Lexical Access in Natural and Degraded Speech

    PubMed Central

    Wagner, Anita E.; Toffanin, Paolo; Başkent, Deniz

    2016-01-01

    Understanding speech is effortless in ideal situations, and although adverse conditions, such as caused by hearing impairment, often render it an effortful task, they do not necessarily suspend speech comprehension. A prime example of this is speech perception by cochlear implant users, whose hearing prostheses transmit speech as a significantly degraded signal. It is yet unknown how mechanisms of speech processing deal with such degraded signals, and whether they are affected by effortful processing of speech. This paper compares the automatic process of lexical competition between natural and degraded speech, and combines gaze fixations, which capture the course of lexical disambiguation, with pupillometry, which quantifies the mental effort involved in processing speech. Listeners’ ocular responses were recorded during disambiguation of lexical embeddings with matching and mismatching durational cues. Durational cues were selected due to their substantial role in listeners’ quick limitation of the number of lexical candidates for lexical access in natural speech. Results showed that lexical competition increased mental effort in processing natural stimuli in particular in presence of mismatching cues. Signal degradation reduced listeners’ ability to quickly integrate durational cues in lexical selection, and delayed and prolonged lexical competition. The effort of processing degraded speech was increased overall, and because it had its sources at the pre-lexical level this effect can be attributed to listening to degraded speech rather than to lexical disambiguation. In sum, the course of lexical competition was largely comparable for natural and degraded speech, but showed crucial shifts in timing, and different sources of increased mental effort. We argue that well-timed progress of information from sensory to pre-lexical and lexical stages of processing, which is the result of perceptual adaptation during speech development, is the reason why in ideal situations speech is perceived as an undemanding task. Degradation of the signal or the receiver channel can quickly bring this well-adjusted timing out of balance and lead to increase in mental effort. Incomplete and effortful processing at the early pre-lexical stages has its consequences on lexical processing as it adds uncertainty to the forming and revising of lexical hypotheses. PMID:27065901

  16. Speech and language support: How physicians can identify and treat speech and language delays in the office setting.

    PubMed

    Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo

    2014-01-01

    Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society's Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children's speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation.

  17. Influence of phonetic context on the dysphonic event: contribution of new methodologies for the analysis of pathological voice.

    PubMed

    Revis, J; Galant, C; Fredouille, C; Ghio, A; Giovanni, A

    2012-01-01

    Widely studied in terms of perception, acoustics or aerodynamics, dysphonia stays nevertheless a speech phenomenon, closely related to the phonetic composition of the message conveyed by the voice. In this paper, we present a series of three works with the aim to understand the implications of the phonetic manifestation of dysphonia. Our first study proposes a new approach to the perceptual analysis of dysphonia (the phonetic labeling), which principle is to listen and evaluate each phoneme in a sentence separately. This study confirms the hypothesis of Laver that the dysphonia is not a constant noise added to the speech signal, but a discontinuous phenomenon, occurring on certain phonemes, based on the phonetic context. However, the burden of executing the task has led us to look to the techniques of automatic speaker recognition (ASR) to automate the procedure. With the collaboration of the LIA, we have developed a system for automatic classification of dysphonia from the techniques of ASR. This is the subject of our second study. The first results obtained with this system suggest that the unvoiced consonants show predominant performance in the task of automatic classification of dysphonia. This result is surprising since it is often assumed that dysphonia occurs only on laryngeal vibration. We started looking for explanations of this phenomenon and we present our assumptions and experiences in the third work we present.

  18. Perception and performance in flight simulators: The contribution of vestibular, visual, and auditory information

    NASA Technical Reports Server (NTRS)

    1979-01-01

    The pilot's perception and performance in flight simulators is examined. The areas investigated include: vestibular stimulation, flight management and man cockpit information interfacing, and visual perception in flight simulation. The effects of higher levels of rotary acceleration on response time to constant acceleration, tracking performance, and thresholds for angular acceleration are examined. Areas of flight management examined are cockpit display of traffic information, work load, synthetic speech call outs during the landing phase of flight, perceptual factors in the use of a microwave landing system, automatic speech recognition, automation of aircraft operation, and total simulation of flight training.

  19. Subliminal speech priming.

    PubMed

    Kouider, Sid; Dupoux, Emmanuel

    2005-08-01

    We present a novel subliminal priming technique that operates in the auditory modality. Masking is achieved by hiding a spoken word within a stream of time-compressed speechlike sounds with similar spectral characteristics. Participants were unable to consciously identify the hidden words, yet reliable repetition priming was found. This effect was unaffected by a change in the speaker's voice and remained restricted to lexical processing. The results show that the speech modality, like the written modality, involves the automatic extraction of abstract word-form representations that do not include nonlinguistic details. In both cases, priming operates at the level of discrete and abstract lexical entries and is little influenced by overlap in form or semantics.

  20. The Roles of Suprasegmental Features in Predicting English Oral Proficiency with an Automated System

    ERIC Educational Resources Information Center

    Kang, Okim; Johnson, David

    2018-01-01

    Suprasegmental features have received growing attention in the field of oral assessment. In this article we describe a set of computer algorithms that automatically scores the oral proficiency of non-native speakers using unconstrained English speech. The algorithms employ machine learning and 11 suprasegmental measures divided into four groups…

  1. Spoken Grammar Practice and Feedback in an ASR-Based CALL System

    ERIC Educational Resources Information Center

    de Vries, Bart Penning; Cucchiarini, Catia; Bodnar, Stephen; Strik, Helmer; van Hout, Roeland

    2015-01-01

    Speaking practice is important for learners of a second language. Computer assisted language learning (CALL) systems can provide attractive opportunities for speaking practice when combined with automatic speech recognition (ASR) technology. In this paper, we present a CALL system that offers spoken practice of word order, an important aspect of…

  2. Effectiveness of Feedback for Enhancing English Pronunciation in an ASR-Based CALL System

    ERIC Educational Resources Information Center

    Wang, Y.-H.; Young, S. S.-C.

    2015-01-01

    This paper presents a study on implementing the ASR-based CALL (computer-assisted language learning based upon automatic speech recognition) system embedded with both formative and summative feedback approaches and using implicit and explicit strategies to enhance adult and young learners' English pronunciation. Two groups of learners including 18…

  3. The Influence of Anticipation of Word Misrecognition on the Likelihood of Stuttering

    ERIC Educational Resources Information Center

    Brocklehurst, Paul H.; Lickley, Robin J.; Corley, Martin

    2012-01-01

    This study investigates whether the experience of stuttering can result from the speaker's anticipation of his words being misrecognized. Twelve adults who stutter (AWS) repeated single words into what appeared to be an automatic speech-recognition system. Following each iteration of each word, participants provided a self-rating of whether they…

  4. Exploration of Acoustic Features for Automatic Vowel Discrimination in Spontaneous Speech

    ERIC Educational Resources Information Center

    Tyson, Na'im R.

    2012-01-01

    In an attempt to understand what acoustic/auditory feature sets motivated transcribers towards certain labeling decisions, I built machine learning models that were capable of discriminating between canonical and non-canonical vowels excised from the Buckeye Corpus. Specifically, I wanted to model when the dictionary form and the transcribed-form…

  5. Detecting buried explosive hazards with handheld GPR and deep learning

    NASA Astrophysics Data System (ADS)

    Besaw, Lance E.

    2016-05-01

    Buried explosive hazards (BEHs), including traditional landmines and homemade improvised explosives, have proven difficult to detect and defeat during and after conflicts around the world. Despite their various sizes, shapes and construction material, ground penetrating radar (GPR) is an excellent phenomenology for detecting BEHs due to its ability to sense localized differences in electromagnetic properties. Handheld GPR detectors are common equipment for detecting BEHs because of their flexibility (in part due to the human operator) and effectiveness in cluttered environments. With modern digital electronics and positioning systems, handheld GPR sensors can sense and map variation in electromagnetic properties while searching for BEHs. Additionally, large-scale computers have demonstrated an insatiable appetite for ingesting massive datasets and extracting meaningful relationships. This is no more evident than the maturation of deep learning artificial neural networks (ANNs) for image and speech recognition now commonplace in industry and academia. This confluence of sensing, computing and pattern recognition technologies offers great potential to develop automatic target recognition techniques to assist GPR operators searching for BEHs. In this work deep learning ANNs are used to detect BEHs and discriminate them from harmless clutter. We apply these techniques to a multi-antennae, handheld GPR with centimeter-accurate positioning system that was used to collect data over prepared lanes containing a wide range of BEHs. This work demonstrates that deep learning ANNs can automatically extract meaningful information from complex GPR signatures, complementing existing GPR anomaly detection and classification techniques.

  6. Rapid and automatic speech-specific learning mechanism in human neocortex.

    PubMed

    Kimppa, Lilli; Kujala, Teija; Leminen, Alina; Vainio, Martti; Shtyrov, Yury

    2015-09-01

    A unique feature of human communication system is our ability to rapidly acquire new words and build large vocabularies. However, its neurobiological foundations remain largely unknown. In an electrophysiological study optimally designed to probe this rapid formation of new word memory circuits, we employed acoustically controlled novel word-forms incorporating native and non-native speech sounds, while manipulating the subjects' attention on the input. We found a robust index of neurolexical memory-trace formation: a rapid enhancement of the brain's activation elicited by novel words during a short (~30min) perceptual exposure, underpinned by fronto-temporal cortical networks, and, importantly, correlated with behavioural learning outcomes. Crucially, this neural memory trace build-up took place regardless of focused attention on the input or any pre-existing or learnt semantics. Furthermore, it was found only for stimuli with native-language phonology, but not for acoustically closely matching non-native words. These findings demonstrate a specialised cortical mechanism for rapid, automatic and phonology-dependent formation of neural word memory circuits. Copyright © 2015. Published by Elsevier Inc.

  7. Call recognition and individual identification of fish vocalizations based on automatic speech recognition: An example with the Lusitanian toadfish.

    PubMed

    Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C

    2015-12-01

    The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types.

  8. The influence of tone inventory on ERP without focal attention: a cross-language study.

    PubMed

    Zheng, Hong-Ying; Peng, Gang; Chen, Jian-Yong; Zhang, Caicai; Minett, James W; Wang, William S-Y

    2014-01-01

    This study investigates the effect of tone inventories on brain activities underlying pitch without focal attention. We find that the electrophysiological responses to across-category stimuli are larger than those to within-category stimuli when the pitch contours are superimposed on nonspeech stimuli; however, there is no electrophysiological response difference associated with category status in speech stimuli. Moreover, this category effect in nonspeech stimuli is stronger for Cantonese speakers. Results of previous and present studies lead us to conclude that brain activities to the same native lexical tone contrasts are modulated by speakers' language experiences not only in active phonological processing but also in automatic feature detection without focal attention. In contrast to the condition with focal attention, where phonological processing is stronger for speech stimuli, the feature detection (pitch contours in this study) without focal attention as shaped by language background is superior in relatively regular stimuli, that is, the nonspeech stimuli. The results suggest that Cantonese listeners outperform Mandarin listeners in automatic detection of pitch features because of the denser Cantonese tone system.

  9. Second-language learning effects on automaticity of speech processing of Japanese phonetic contrasts: An MEG study.

    PubMed

    Hisagi, Miwako; Shafer, Valerie L; Miyagawa, Shigeru; Kotek, Hadas; Sugawara, Ayaka; Pantazis, Dimitrios

    2016-12-01

    We examined discrimination of a second-language (L2) vowel duration contrast in English learners of Japanese (JP) with different amounts of experience using the magnetoencephalography mismatch field (MMF) component. Twelve L2 learners were tested before and after a second semester of college-level JP; half attended a regular rate course and half an accelerated course with more hours per week. Results showed no significant change in MMF for either the regular or accelerated learning group from beginning to end of the course. We also compared these groups against nine L2 learners who had completed four semesters of college-level JP. These 4-semester learners did not significantly differ from 2-semester learners, in that only a difference in hemisphere activation (interacting with time) between the two groups approached significance. These findings suggest that targeted training of L2 phonology may be necessary to allow for changes in processing of L2 speech contrasts at an early, automatic level. Copyright © 2016 Elsevier B.V. All rights reserved.

  10. Intra-oral pressure-based voicing control of electrolaryngeal speech with intra-oral vibrator.

    PubMed

    Takahashi, Hirokazu; Nakao, Masayuki; Kikuchi, Yataro; Kaga, Kimitaka

    2008-07-01

    In normal speech, coordinated activities of intrinsic laryngeal muscles suspend a glottal sound at utterance of voiceless consonants, automatically realizing a voicing control. In electrolaryngeal speech, however, the lack of voicing control is one of the causes of unclear voice, voiceless consonants tending to be misheard as the corresponding voiced consonants. In the present work, we developed an intra-oral vibrator with an intra-oral pressure sensor that detected utterance of voiceless phonemes during the intra-oral electrolaryngeal speech, and demonstrated that an intra-oral pressure-based voicing control could improve the intelligibility of the speech. The test voices were obtained from one electrolaryngeal speaker and one normal speaker. We first investigated on the speech analysis software how a voice onset time (VOT) and first formant (F1) transition of the test consonant-vowel syllables contributed to voiceless/voiced contrasts, and developed an adequate voicing control strategy. We then compared the intelligibility of consonant-vowel syllables among the intra-oral electrolaryngeal speech with and without online voicing control. The increase of intra-oral pressure, typically with a peak ranging from 10 to 50 gf/cm2, could reliably identify utterance of voiceless consonants. The speech analysis and intelligibility test then demonstrated that a short VOT caused the misidentification of the voiced consonants due to a clear F1 transition. Finally, taking these results together, the online voicing control, which suspended the prosthetic tone while the intra-oral pressure exceeded 2.5 gf/cm2 and during the 35 milliseconds that followed, proved efficient to improve the voiceless/voiced contrast.

  11. Fricative Contrast and Coarticulation in Children With and Without Speech Sound Disorders

    PubMed Central

    Mailend, Marja-Liisa

    2017-01-01

    Purpose The purpose of this study was, first, to expand our understanding of typical speech development regarding segmental contrast and anticipatory coarticulation, and second, to explore the potential diagnostic utility of acoustic measures of fricative contrast and anticipatory coarticulation in children with speech sound disorders (SSD). Method In a cross-sectional design, 10 adults, 17 typically developing children, and 11 children with SSD repeated carrier phrases with novel words with fricatives (/s/, /ʃ/). Dependent measures were 2 ratios derived from spectral mean, obtained from perceptually accurate tokens. Group analyses compared adults and typically developing children; individual children with SSD were compared to their respective typically developing peers. Results Typically developing children demonstrated smaller fricative acoustic contrast than adults but similar coarticulatory patterns. Three children with SSD showed smaller fricative acoustic contrast than their typically developing peers, and 2 children showed abnormal coarticulation. The 2 children with abnormal coarticulation both had a clinical diagnosis of childhood apraxia of speech; no clear pattern was evident regarding SSD subtype for smaller fricative contrast. Conclusions Children have not reached adult-like speech motor control for fricative production by age 10 even when fricatives are perceptually accurate. Present findings also suggest that abnormal coarticulation but not reduced fricative contrast is SSD-subtype–specific. Supplemental Materials S1: https://doi.org/10.23641/asha.5103070. S2 and S3: https://doi.org/10.23641/asha.5106508 PMID:28654946

  12. Pathological speech signal analysis and classification using empirical mode decomposition.

    PubMed

    Kaleem, Muhammad; Ghoraani, Behnaz; Guergachi, Aziz; Krishnan, Sridhar

    2013-07-01

    Automated classification of normal and pathological speech signals can provide an objective and accurate mechanism for pathological speech diagnosis, and is an active area of research. A large part of this research is based on analysis of acoustic measures extracted from sustained vowels. However, sustained vowels do not reflect real-world attributes of voice as effectively as continuous speech, which can take into account important attributes of speech such as rapid voice onset and termination, changes in voice frequency and amplitude, and sudden discontinuities in speech. This paper presents a methodology based on empirical mode decomposition (EMD) for classification of continuous normal and pathological speech signals obtained from a well-known database. EMD is used to decompose randomly chosen portions of speech signals into intrinsic mode functions, which are then analyzed to extract meaningful temporal and spectral features, including true instantaneous features which can capture discriminative information in signals hidden at local time-scales. A total of six features are extracted, and a linear classifier is used with the feature vector to classify continuous speech portions obtained from a database consisting of 51 normal and 161 pathological speakers. A classification accuracy of 95.7 % is obtained, thus demonstrating the effectiveness of the methodology.

  13. An evaluation of talker localization based on direction of arrival estimation and statistical sound source identification

    NASA Astrophysics Data System (ADS)

    Nishiura, Takanobu; Nakamura, Satoshi

    2002-11-01

    It is very important to capture distant-talking speech for a hands-free speech interface with high quality. A microphone array is an ideal candidate for this purpose. However, this approach requires localizing the target talker. Conventional talker localization algorithms in multiple sound source environments not only have difficulty localizing the multiple sound sources accurately, but also have difficulty localizing the target talker among known multiple sound source positions. To cope with these problems, we propose a new talker localization algorithm consisting of two algorithms. One is DOA (direction of arrival) estimation algorithm for multiple sound source localization based on CSP (cross-power spectrum phase) coefficient addition method. The other is statistical sound source identification algorithm based on GMM (Gaussian mixture model) for localizing the target talker position among localized multiple sound sources. In this paper, we particularly focus on the talker localization performance based on the combination of these two algorithms with a microphone array. We conducted evaluation experiments in real noisy reverberant environments. As a result, we confirmed that multiple sound signals can be identified accurately between ''speech'' or ''non-speech'' by the proposed algorithm. [Work supported by ATR, and MEXT of Japan.

  14. Sensory Intelligence for Extraction of an Abstract Auditory Rule: A Cross-Linguistic Study.

    PubMed

    Guo, Xiao-Tao; Wang, Xiao-Dong; Liang, Xiu-Yuan; Wang, Ming; Chen, Lin

    2018-02-21

    In a complex linguistic environment, while speech sounds can greatly vary, some shared features are often invariant. These invariant features constitute so-called abstract auditory rules. Our previous study has shown that with auditory sensory intelligence, the human brain can automatically extract the abstract auditory rules in the speech sound stream, presumably serving as the neural basis for speech comprehension. However, whether the sensory intelligence for extraction of abstract auditory rules in speech is inherent or experience-dependent remains unclear. To address this issue, we constructed a complex speech sound stream using auditory materials in Mandarin Chinese, in which syllables had a flat lexical tone but differed in other acoustic features to form an abstract auditory rule. This rule was occasionally and randomly violated by the syllables with the rising, dipping or falling tone. We found that both Chinese and foreign speakers detected the violations of the abstract auditory rule in the speech sound stream at a pre-attentive stage, as revealed by the whole-head recordings of mismatch negativity (MMN) in a passive paradigm. However, MMNs peaked earlier in Chinese speakers than in foreign speakers. Furthermore, Chinese speakers showed different MMN peak latencies for the three deviant types, which paralleled recognition points. These findings indicate that the sensory intelligence for extraction of abstract auditory rules in speech sounds is innate but shaped by language experience. Copyright © 2018 IBRO. Published by Elsevier Ltd. All rights reserved.

  15. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    NASA Astrophysics Data System (ADS)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  16. Dynamic relation between working memory capacity and speech recognition in noise during the first 6 months of hearing aid use.

    PubMed

    Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker

    2014-11-23

    The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.

  17. An Innovative Speech-Based User Interface for Smarthomes and IoT Solutions to Help People with Speech and Motor Disabilities.

    PubMed

    Malavasi, Massimiliano; Turri, Enrico; Atria, Jose Joaquin; Christensen, Heidi; Marxer, Ricard; Desideri, Lorenzo; Coy, Andre; Tamburini, Fabio; Green, Phil

    2017-01-01

    A better use of the increasing functional capabilities of home automation systems and Internet of Things (IoT) devices to support the needs of users with disability, is the subject of a research project currently conducted by Area Ausili (Assistive Technology Area), a department of Polo Tecnologico Regionale Corte Roncati of the Local Health Trust of Bologna (Italy), in collaboration with AIAS Ausilioteca Assistive Technology (AT) Team. The main aim of the project is to develop experimental low cost systems for environmental control through simplified and accessible user interfaces. Many of the activities are focused on automatic speech recognition and are developed in the framework of the CloudCAST project. In this paper we report on the first technical achievements of the project and discuss future possible developments and applications within and outside CloudCAST.

  18. Automatic mechanisms for measuring subjective unit of discomfort.

    PubMed

    Hartanto, D W I; Kang, Ni; Brinkman, Willem-Paul; Kampmann, Isabel L; Morina, Nexhmedin; Emmelkamp, Paul G M; Neerincx, Mark A

    2012-01-01

    Current practice in Virtual Reality Exposure Therapy (VRET) is that therapists ask patients about their anxiety level by means of the Subjective Unit of Discomfort (SUD) scale. With an aim of developing a home-based VRET system, this measurement ideally should be done using speech technology. In a VRET system for social phobia with scripted avatar-patient dialogues, the timing of asking patients to give their SUD score becomes relevant. This study examined three timing mechanisms: (1) dialogue dependent (i.e. naturally in the flow of the dialogue); (2) speech dependent (i.e. when both patient and avatar are silent); and (3) context independent (i.e. randomly). Results of an experiment with non-patients (n = 24) showed a significant effect for the timing mechanisms on the perceived dialogue flow, user preference, reported presence and user dialog replies. Overall, dialogue dependent timing mechanism seems superior followed by the speech dependent and context independent timing mechanism.

  19. Rapid tuning shifts in human auditory cortex enhance speech intelligibility

    PubMed Central

    Holdgraf, Christopher R.; de Heer, Wendy; Pasley, Brian; Rieger, Jochem; Crone, Nathan; Lin, Jack J.; Knight, Robert T.; Theunissen, Frédéric E.

    2016-01-01

    Experience shapes our perception of the world on a moment-to-moment basis. This robust perceptual effect of experience parallels a change in the neural representation of stimulus features, though the nature of this representation and its plasticity are not well-understood. Spectrotemporal receptive field (STRF) mapping describes the neural response to acoustic features, and has been used to study contextual effects on auditory receptive fields in animal models. We performed a STRF plasticity analysis on electrophysiological data from recordings obtained directly from the human auditory cortex. Here, we report rapid, automatic plasticity of the spectrotemporal response of recorded neural ensembles, driven by previous experience with acoustic and linguistic information, and with a neurophysiological effect in the sub-second range. This plasticity reflects increased sensitivity to spectrotemporal features, enhancing the extraction of more speech-like features from a degraded stimulus and providing the physiological basis for the observed ‘perceptual enhancement' in understanding speech. PMID:27996965

  20. Degraded neural and behavioral processing of speech sounds in a rat model of Rett syndrome

    PubMed Central

    Engineer, Crystal T.; Rahebi, Kimiya C.; Borland, Michael S.; Buell, Elizabeth P.; Centanni, Tracy M.; Fink, Melyssa K.; Im, Kwok W.; Wilson, Linda G.; Kilgard, Michael P.

    2015-01-01

    Individuals with Rett syndrome have greatly impaired speech and language abilities. Auditory brainstem responses to sounds are normal, but cortical responses are highly abnormal. In this study, we used the novel rat Mecp2 knockout model of Rett syndrome to document the neural and behavioral processing of speech sounds. We hypothesized that both speech discrimination ability and the neural response to speech sounds would be impaired in Mecp2 rats. We expected that extensive speech training would improve speech discrimination ability and the cortical response to speech sounds. Our results reveal that speech responses across all four auditory cortex fields of Mecp2 rats were hyperexcitable, responded slower, and were less able to follow rapidly presented sounds. While Mecp2 rats could accurately perform consonant and vowel discrimination tasks in quiet, they were significantly impaired at speech sound discrimination in background noise. Extensive speech training improved discrimination ability. Training shifted cortical responses in both Mecp2 and control rats to favor the onset of speech sounds. While training increased the response to low frequency sounds in control rats, the opposite occurred in Mecp2 rats. Although neural coding and plasticity are abnormal in the rat model of Rett syndrome, extensive therapy appears to be effective. These findings may help to explain some aspects of communication deficits in Rett syndrome and suggest that extensive rehabilitation therapy might prove beneficial. PMID:26321676

  1. Flexible, rapid and automatic neocortical word form acquisition mechanism in children as revealed by neuromagnetic brain response dynamics.

    PubMed

    Partanen, Eino; Leminen, Alina; de Paoli, Stine; Bundgaard, Anette; Kingo, Osman Skjold; Krøjgaard, Peter; Shtyrov, Yury

    2017-07-15

    Children learn new words and word forms with ease, often acquiring a new word after very few repetitions. Recent neurophysiological research on word form acquisition in adults indicates that novel words can be acquired within minutes of repetitive exposure to them, regardless of the individual's focused attention on the speech input. Although it is well-known that children surpass adults in language acquisition, the developmental aspects of such rapid and automatic neural acquisition mechanisms remain unexplored. To address this open question, we used magnetoencephalography (MEG) to scrutinise brain dynamics elicited by spoken words and word-like sounds in healthy monolingual (Danish) children throughout a 20-min repetitive passive exposure session. We found rapid neural dynamics manifested as an enhancement of early (~100ms) brain activity over the short exposure session, with distinct spatiotemporal patterns for different novel sounds. For novel Danish word forms, signs of such enhancement were seen in the left temporal regions only, suggesting reliance on pre-existing language circuits for acquisition of novel word forms with native phonology. In contrast, exposure both to novel word forms with non-native phonology and to novel non-speech sounds led to activity enhancement in both left and right hemispheres, suggesting that more wide-spread cortical networks contribute to the build-up of memory traces for non-native and non-speech sounds. Similar studies in adults have previously reported more sluggish (~15-25min, as opposed to 4min in the present study) or non-existent neural dynamics for non-native sound acquisition, which might be indicative of a higher degree of plasticity in the children's brain. Overall, the results indicate a rapid and highly plastic mechanism for a dynamic build-up of memory traces for novel acoustic information in the children's brain that operates automatically and recruits bilateral temporal cortical circuits. Copyright © 2017 Elsevier Inc. All rights reserved.

  2. Automatic conversational scene analysis in children with Asperger syndrome/high-functioning autism and typically developing peers.

    PubMed

    Tavano, Alessandro; Pesarin, Anna; Murino, Vittorio; Cristani, Marco

    2014-01-01

    Individuals with Asperger syndrome/High Functioning Autism fail to spontaneously attribute mental states to the self and others, a life-long phenotypic characteristic known as mindblindness. We hypothesized that mindblindness would affect the dynamics of conversational interaction. Using generative models, in particular Gaussian mixture models and observed influence models, conversations were coded as interacting Markov processes, operating on novel speech/silence patterns, termed Steady Conversational Periods (SCPs). SCPs assume that whenever an agent's process changes state (e.g., from silence to speech), it causes a general transition of the entire conversational process, forcing inter-actant synchronization. SCPs fed into observed influence models, which captured the conversational dynamics of children and adolescents with Asperger syndrome/High Functioning Autism, and age-matched typically developing participants. Analyzing the parameters of the models by means of discriminative classifiers, the dialogs of patients were successfully distinguished from those of control participants. We conclude that meaning-free speech/silence sequences, reflecting inter-actant synchronization, at least partially encode typical and atypical conversational dynamics. This suggests a direct influence of theory of mind abilities onto basic speech initiative behavior.

  3. Deep-Learning-Enabled On-Demand Design of Chiral Metamaterials.

    PubMed

    Ma, Wei; Cheng, Feng; Liu, Yongmin

    2018-06-11

    Deep-learning framework has significantly impelled the development of modern machine learning technology by continuously pushing the limit of traditional recognition and processing of images, speech, and videos. In the meantime, it starts to penetrate other disciplines, such as biology, genetics, materials science, and physics. Here, we report a deep-learning-based model, comprising two bidirectional neural networks assembled by a partial stacking strategy, to automatically design and optimize three-dimensional chiral metamaterials with strong chiroptical responses at predesignated wavelengths. The model can help to discover the intricate, nonintuitive relationship between a metamaterial structure and its optical responses from a number of training examples, which circumvents the time-consuming, case-by-case numerical simulations in conventional metamaterial designs. This approach not only realizes the forward prediction of optical performance much more accurately and efficiently but also enables one to inversely retrieve designs from given requirements. Our results demonstrate that such a data-driven model can be applied as a very powerful tool in studying complicated light-matter interactions and accelerating the on-demand design of nanophotonic devices, systems, and architectures for real world applications.

  4. Sensory-motor relationships in speech production in post-lingually deaf cochlear-implanted adults and normal-hearing seniors: Evidence from phonetic convergence and speech imitation.

    PubMed

    Scarbel, Lucie; Beautemps, Denis; Schwartz, Jean-Luc; Sato, Marc

    2017-07-01

    Speech communication can be viewed as an interactive process involving a functional coupling between sensory and motor systems. One striking example comes from phonetic convergence, when speakers automatically tend to mimic their interlocutor's speech during communicative interaction. The goal of this study was to investigate sensory-motor linkage in speech production in postlingually deaf cochlear implanted participants and normal hearing elderly adults through phonetic convergence and imitation. To this aim, two vowel production tasks, with or without instruction to imitate an acoustic vowel, were proposed to three groups of young adults with normal hearing, elderly adults with normal hearing and post-lingually deaf cochlear-implanted patients. Measure of the deviation of each participant's f 0 from their own mean f 0 was measured to evaluate the ability to converge to each acoustic target. showed that cochlear-implanted participants have the ability to converge to an acoustic target, both intentionally and unintentionally, albeit with a lower degree than young and elderly participants with normal hearing. By providing evidence for phonetic convergence and speech imitation, these results suggest that, as in young adults, perceptuo-motor relationships are efficient in elderly adults with normal hearing and that cochlear-implanted adults recovered significant perceptuo-motor abilities following cochlear implantation. Copyright © 2017 Elsevier Ltd. All rights reserved.

  5. Long Term Suboxone™ Emotional Reactivity As Measured by Automatic Detection in Speech

    PubMed Central

    Hill, Edward; Han, David; Dumouchel, Pierre; Dehak, Najim; Quatieri, Thomas; Moehs, Charles; Oscar-Berman, Marlene; Giordano, John; Simpatico, Thomas; Blum, Kenneth

    2013-01-01

    Addictions to illicit drugs are among the nation’s most critical public health and societal problems. The current opioid prescription epidemic and the need for buprenorphine/naloxone (Suboxone®; SUBX) as an opioid maintenance substance, and its growing street diversion provided impetus to determine affective states (“true ground emotionality”) in long-term SUBX patients. Toward the goal of effective monitoring, we utilized emotion-detection in speech as a measure of “true” emotionality in 36 SUBX patients compared to 44 individuals from the general population (GP) and 33 members of Alcoholics Anonymous (AA). Other less objective studies have investigated emotional reactivity of heroin, methadone and opioid abstinent patients. These studies indicate that current opioid users have abnormal emotional experience, characterized by heightened response to unpleasant stimuli and blunted response to pleasant stimuli. However, this is the first study to our knowledge to evaluate “true ground” emotionality in long-term buprenorphine/naloxone combination (Suboxone™). We found in long-term SUBX patients a significantly flat affect (p<0.01), and they had less self-awareness of being happy, sad, and anxious compared to both the GP and AA groups. We caution definitive interpretation of these seemingly important results until we compare the emotional reactivity of an opioid abstinent control using automatic detection in speech. These findings encourage continued research strategies in SUBX patients to target the specific brain regions responsible for relapse prevention of opioid addiction. PMID:23874860

  6. Quasi-closed phase forward-backward linear prediction analysis of speech for accurate formant detection and estimation.

    PubMed

    Gowda, Dhananjaya; Airaksinen, Manu; Alku, Paavo

    2017-09-01

    Recently, a quasi-closed phase (QCP) analysis of speech signals for accurate glottal inverse filtering was proposed. However, the QCP analysis which belongs to the family of temporally weighted linear prediction (WLP) methods uses the conventional forward type of sample prediction. This may not be the best choice especially in computing WLP models with a hard-limiting weighting function. A sample selective minimization of the prediction error in WLP reduces the effective number of samples available within a given window frame. To counter this problem, a modified quasi-closed phase forward-backward (QCP-FB) analysis is proposed, wherein each sample is predicted based on its past as well as future samples thereby utilizing the available number of samples more effectively. Formant detection and estimation experiments on synthetic vowels generated using a physical modeling approach as well as natural speech utterances show that the proposed QCP-FB method yields statistically significant improvements over the conventional linear prediction and QCP methods.

  7. Speech and language support: How physicians can identify and treat speech and language delays in the office setting

    PubMed Central

    Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo

    2014-01-01

    Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. The tool aimed to help physicians achieve three main goals: early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society’s Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children’s speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation. PMID:24627648

  8. The speech perception skills of children with and without speech sound disorder.

    PubMed

    Hearnshaw, Stephanie; Baker, Elise; Munro, Natalie

    To investigate whether Australian-English speaking children with and without speech sound disorder (SSD) differ in their overall speech perception accuracy. Additionally, to investigate differences in the perception of specific phonemes and the association between speech perception and speech production skills. Twenty-five Australian-English speaking children aged 48-60 months participated in this study. The SSD group included 12 children and the typically developing (TD) group included 13 children. Children completed routine speech and language assessments in addition to an experimental Australian-English lexical and phonetic judgement task based on Rvachew's Speech Assessment and Interactive Learning System (SAILS) program (Rvachew, 2009). This task included eight words across four word-initial phonemes-/k, ɹ, ʃ, s/. Children with SSD showed significantly poorer perceptual accuracy on the lexical and phonetic judgement task compared with TD peers. The phonemes /ɹ/ and /s/ were most frequently perceived in error across both groups. Additionally, the phoneme /ɹ/ was most commonly produced in error. There was also a positive correlation between overall speech perception and speech production scores. Children with SSD perceived speech less accurately than their typically developing peers. The findings suggest that an Australian-English variation of a lexical and phonetic judgement task similar to the SAILS program is promising and worthy of a larger scale study. Copyright © 2017 Elsevier Inc. All rights reserved.

  9. Detection of negative and positive audience behaviours by socially anxious subjects.

    PubMed

    Veljaca, K A; Rapee, R M

    1998-03-01

    Nineteen subjects high in social anxiety and 20 subjects low in social anxiety were asked to give a 5-min speech in front of three audience members. Audience members were trained to provide indicators of positive evaluation (e.g., smiles) and negative evaluation (e.g. frowns) at irregular intervals during the speech. Subjects were instructed to indicate, by depressing one of two buttons, when they detected either positive or negative behaviours. Results indicated that subjects high in social anxiety were both more accurate at, and had a more liberal criterion for, detecting negative audience behaviours while subjects low in social anxiety were more accurate at detecting positive audience behaviours.

  10. Developing a bilingual "persian cued speech" website for parents and professionals of children with hearing impairment.

    PubMed

    Movallali, Guita; Sajedi, Firoozeh

    2014-03-01

    The use of the internet as a source of information gathering, self-help and support is becoming increasingly recognized. Parents and professionals of children with hearing impairment have been shown to seek information about different communication approaches online. Cued Speech is a very new approach to Persian speaking pupils. Our aim was to develop a useful website to give related information about Persian Cued Speech to parents and professionals of children with hearing impairment. All Cued Speech websites from different countries that fell within the first ten pages of Google and Yahoo search-engines were assessed. Main subjects and links were studied. All related information was gathered from the websites, textbooks, articles etc. Using a framework that combined several criteria for health-information websites, we developed the Persian Cued Speech website for three distinct audiences (parents, professionals and children). An accurate, complete, accessible and readable resource about Persian Cued Speech for parents and professionals is available now.

  11. Musical experience strengthens the neural representation of sounds important for communication in middle-aged adults

    PubMed Central

    Parbery-Clark, Alexandra; Anderson, Samira; Hittner, Emily; Kraus, Nina

    2012-01-01

    Older adults frequently complain that while they can hear a person talking, they cannot understand what is being said; this difficulty is exacerbated by background noise. Peripheral hearing loss cannot fully account for this age-related decline in speech-in-noise ability, as declines in central processing also contribute to this problem. Given that musicians have enhanced speech-in-noise perception, we aimed to define the effects of musical experience on subcortical responses to speech and speech-in-noise perception in middle-aged adults. Results reveal that musicians have enhanced neural encoding of speech in quiet and noisy settings. Enhancements include faster neural response timing, higher neural response consistency, more robust encoding of speech harmonics, and greater neural precision. Taken together, we suggest that musical experience provides perceptual benefits in an aging population by strengthening the underlying neural pathways necessary for the accurate representation of important temporal and spectral features of sound. PMID:23189051

  12. The a priori SDR Estimation Techniques with Reduced Speech Distortion for Acoustic Echo and Noise Suppression

    NASA Astrophysics Data System (ADS)

    Thoonsaengngam, Rattapol; Tangsangiumvisai, Nisachon

    This paper proposes an enhanced method for estimating the a priori Signal-to-Disturbance Ratio (SDR) to be employed in the Acoustic Echo and Noise Suppression (AENS) system for full-duplex hands-free communications. The proposed a priori SDR estimation technique is modified based upon the Two-Step Noise Reduction (TSNR) algorithm to suppress the background noise while preserving speech spectral components. In addition, a practical approach to determine accurately the Echo Spectrum Variance (ESV) is presented based upon the linear relationship assumption between the power spectrum of far-end speech and acoustic echo signals. The ESV estimation technique is then employed to alleviate the acoustic echo problem. The performance of the AENS system that employs these two proposed estimation techniques is evaluated through the Echo Attenuation (EA), Noise Attenuation (NA), and two speech distortion measures. Simulation results based upon real speech signals guarantee that our improved AENS system is able to mitigate efficiently the problem of acoustic echo and background noise, while preserving the speech quality and speech intelligibility.

  13. Cognitive Conflict in a Syllable Identification Task Causes Transient Activation of Speech Perception Area

    ERIC Educational Resources Information Center

    Saetrevik, Bjorn; Specht, Karsten

    2012-01-01

    It has previously been shown that task performance and frontal cortical activation increase after cognitive conflict. This has been argued to support a model of attention where the level of conflict automatically adjusts the amount of cognitive control applied. Conceivably, conflict could also modulate lower-level processing pathways, which would…

  14. Developing Swedish Spelling Exercises on the ICALL Platform Lärka

    ERIC Educational Resources Information Center

    Pijetlovic, Dijana; Volodina, Elena

    2013-01-01

    In this project we developed web services on the ICALL platform Lärka for automatic generation of Swedish spelling exercises using Text-To-Speech (TTS) technology which allows L2 learners to train their spelling and listening individually at home. The spelling exercises contain five different linguistic levels, whereby the language learner has the…

  15. The Game Embedded CALL System to Facilitate English Vocabulary Acquisition and Pronunciation

    ERIC Educational Resources Information Center

    Young, Shelley Shwu-Ching; Wang, Yi-Hsuan

    2014-01-01

    The aim of this study is to make a new attempt to explore the potential of integrating game strategies with automatic speech recognition technologies to provide learners with individual opportunities for English pronunciation learning. The study developed the Game Embedded CALL (GeCALL) system with two activities for on-line speaking practice. For…

  16. Frequency of Use Leads to Automaticity of Production: Evidence from Repair in Conversation

    ERIC Educational Resources Information Center

    Kapatsinski, Vsevolod

    2010-01-01

    In spontaneous speech, speakers sometimes replace a word they have just produced or started producing by another word. The present study reports that in these replacement repairs, low-frequency replaced words are more likely to be interrupted prior to completion than high-frequency words, providing support to the hypothesis that the production of…

  17. Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition

    ERIC Educational Resources Information Center

    Heintz, Ilana

    2010-01-01

    The goal of this dissertation is to introduce a method for deriving morphemes from Arabic words using stem patterns, a feature of Arabic morphology. The motivations are three-fold: modeling with morphemes rather than words should help address the out-of-vocabulary problem; working with stem patterns should prove to be a cross-dialectally valid…

  18. The Processing and Interpretation of Verb Phrase Ellipsis Constructions by Children at Normal and Slowed Speech Rates

    ERIC Educational Resources Information Center

    Callahan, Sarah M.; Walenski, Matthew; Love, Tracy

    2012-01-01

    Purpose: To examine children's comprehension of verb phrase (VP) ellipsis constructions in light of their automatic, online structural processing abilities and conscious, metalinguistic reflective skill. Method: Forty-two children ages 5 through 12 years listened to VP ellipsis constructions involving the strict/sloppy ambiguity (e.g., "The…

  19. Listening Difficulty Detection to Foster Second Language Listening with a Partial and Synchronized Caption System

    ERIC Educational Resources Information Center

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2017-01-01

    This study proposes a method to detect problematic speech segments automatically for second language (L2) listeners, considering both lexical and acoustic aspects. It introduces a tool, Partial and Synchronized Caption (PSC), which provides assistance for language learners and fosters L2 listening skills. PSC presents purposively selected words…

  20. Histogram equalization with Bayesian estimation for noise robust speech recognition.

    PubMed

    Suh, Youngjoo; Kim, Hoirin

    2018-02-01

    The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histogram equalization approach has been proposed. Although this approach showed substantial performance improvement under noise environments, it still suffers from performance degradation due to the overfitting problem when test data are insufficient. To address this issue, the proposed histogram equalization technique employs the Bayesian estimation method in the test cumulative distribution function estimation. It was reported in a previous study conducted on the Aurora-4 task that the proposed approach provided substantial performance gains in speech recognition systems based on the acoustic modeling of the Gaussian mixture model-hidden Markov model. In this work, the proposed approach was examined in speech recognition systems with deep neural network-hidden Markov model (DNN-HMM), the current mainstream speech recognition approach where it also showed meaningful performance improvement over the conventional maximum likelihood estimation-based method. The fusion of the proposed features with the mel-frequency cepstral coefficients provided additional performance gains in DNN-HMM systems, which otherwise suffer from performance degradation in the clean test condition.

  1. Comparison of Classification Methods for Detecting Emotion from Mandarin Speech

    NASA Astrophysics Data System (ADS)

    Pao, Tsang-Long; Chen, Yu-Te; Yeh, Jun-Heng

    It is said that technology comes out from humanity. What is humanity? The very definition of humanity is emotion. Emotion is the basis for all human expression and the underlying theme behind everything that is done, said, thought or imagined. Making computers being able to perceive and respond to human emotion, the human-computer interaction will be more natural. Several classifiers are adopted for automatically assigning an emotion category, such as anger, happiness or sadness, to a speech utterance. These classifiers were designed independently and tested on various emotional speech corpora, making it difficult to compare and evaluate their performance. In this paper, we first compared several popular classification methods and evaluated their performance by applying them to a Mandarin speech corpus consisting of five basic emotions, including anger, happiness, boredom, sadness and neutral. The extracted feature streams contain MFCC, LPCC, and LPC. The experimental results show that the proposed WD-MKNN classifier achieves an accuracy of 81.4% for the 5-class emotion recognition and outperforms other classification techniques, including KNN, MKNN, DW-KNN, LDA, QDA, GMM, HMM, SVM, and BPNN. Then, to verify the advantage of the proposed method, we compared these classifiers by applying them to another Mandarin expressive speech corpus consisting of two emotions. The experimental results still show that the proposed WD-MKNN outperforms others.

  2. Oral reading fluency analysis in patients with Alzheimer disease and asymptomatic control subjects.

    PubMed

    Martínez-Sánchez, F; Meilán, J J G; García-Sevilla, J; Carro, J; Arana, J M

    2013-01-01

    Many studies highlight that an impaired ability to communicate is one of the key clinical features of Alzheimer disease (AD). To study temporal organisation of speech in an oral reading task in patients with AD and in matched healthy controls using a semi-automatic method, and evaluate that method's ability to discriminate between the 2 groups. A test with an oral reading task was administered to 70 subjects, comprising 35 AD patients and 35 controls. Before speech samples were recorded, participants completed a battery of neuropsychological tests. There were no differences between groups with regard to age, sex, or educational level. All of the study variables showed impairment in the AD group. According to the results, AD patients' oral reading was marked by reduced speech and articulation rates, low effectiveness of phonation time, and increases in the number and proportion of pauses. Signal processing algorithms applied to reading fluency recordings were shown to be capable of differentiating between AD patients and controls with an accuracy of 80% (specificity 74.2%, sensitivity 77.1%) based on speech rate. Analysis of oral reading fluency may be useful as a tool for the objective study and quantification of speech deficits in AD. Copyright © 2012 Sociedad Española de Neurología. Published by Elsevier Espana. All rights reserved.

  3. The influence of ambient speech on adult speech productions through unintentional imitation.

    PubMed

    Delvaux, Véronique; Soquet, Alain

    2007-01-01

    This paper deals with the influence of ambient speech on individual speech productions. A methodological framework is defined to gather the experimental data necessary to feed computer models simulating self-organisation in phonological systems. Two experiments were carried out. Experiment 1 was run on French native speakers from two regiolects of Belgium: two from Liège and two from Brussels. When exposed to the way of speaking of the other regiolect via loudspeakers, the speakers of one regiolect produced vowels that were significantly different from their typical realisations, and significantly closer to the way of speaking specific of the other regiolect. Experiment 2 achieved a replication of the results for 8 Mons speakers hearing a Liège speaker. A significant part of the imitative effect remained up to 10 min after the end of the exposure to the other regiolect productions. As a whole, the results suggest that: (i) imitation occurs automatically and unintentionally, (ii) the modified realisations leave a memory trace, in which case the mechanism may be better defined as 'mimesis' than as 'imitation'. The potential effects of multiple imitative speech interactions on sound change are discussed in this paper, as well as the implications for a general theory of phonetic implementation and phonetic representation.

  4. Speech enhancement based on modified phase-opponency detectors

    NASA Astrophysics Data System (ADS)

    Deshmukh, Om D.; Espy-Wilson, Carol Y.

    2005-09-01

    A speech enhancement algorithm based on a neural model was presented by Deshmukh et al., [149th meeting of the Acoustical Society America, 2005]. The algorithm consists of a bank of Modified Phase Opponency (MPO) filter pairs tuned to different center frequencies. This algorithm is able to enhance salient spectral features in speech signals even at low signal-to-noise ratios. However, the algorithm introduces musical noise and sometimes misses a spectral peak that is close in frequency to a stronger spectral peak. Refinement in the design of the MPO filters was recently made that takes advantage of the falling spectrum of the speech signal in sonorant regions. The modified set of filters leads to better separation of the noise and speech signals, and more accurate enhancement of spectral peaks. The improvements also lead to a significant reduction in musical noise. Continuity algorithms based on the properties of speech signals are used to further reduce the musical noise effect. The efficiency of the proposed method in enhancing the speech signal when the level of the background noise is fluctuating will be demonstrated. The performance of the improved speech enhancement method will be compared with various spectral subtraction-based methods. [Work supported by NSF BCS0236707.

  5. Development of a Low-Cost, Noninvasive, Portable Visual Speech Recognition Program.

    PubMed

    Kohlberg, Gavriel D; Gal, Ya'akov Kobi; Lalwani, Anil K

    2016-09-01

    Loss of speech following tracheostomy and laryngectomy severely limits communication to simple gestures and facial expressions that are largely ineffective. To facilitate communication in these patients, we seek to develop a low-cost, noninvasive, portable, and simple visual speech recognition program (VSRP) to convert articulatory facial movements into speech. A Microsoft Kinect-based VSRP was developed to capture spatial coordinates of lip movements and translate them into speech. The articulatory speech movements associated with 12 sentences were used to train an artificial neural network classifier. The accuracy of the classifier was then evaluated on a separate, previously unseen set of articulatory speech movements. The VSRP was successfully implemented and tested in 5 subjects. It achieved an accuracy rate of 77.2% (65.0%-87.6% for the 5 speakers) on a 12-sentence data set. The mean time to classify an individual sentence was 2.03 milliseconds (1.91-2.16). We have demonstrated the feasibility of a low-cost, noninvasive, portable VSRP based on Kinect to accurately predict speech from articulation movements in clinically trivial time. This VSRP could be used as a novel communication device for aphonic patients. © The Author(s) 2016.

  6. Content-based TV sports video retrieval using multimodal analysis

    NASA Astrophysics Data System (ADS)

    Yu, Yiqing; Liu, Huayong; Wang, Hongbin; Zhou, Dongru

    2003-09-01

    In this paper, we propose content-based video retrieval, which is a kind of retrieval by its semantical contents. Because video data is composed of multimodal information streams such as video, auditory and textual streams, we describe a strategy of using multimodal analysis for automatic parsing sports video. The paper first defines the basic structure of sports video database system, and then introduces a new approach that integrates visual stream analysis, speech recognition, speech signal processing and text extraction to realize video retrieval. The experimental results for TV sports video of football games indicate that the multimodal analysis is effective for video retrieval by quickly browsing tree-like video clips or inputting keywords within predefined domain.

  7. Neural-scaled entropy predicts the effects of nonlinear frequency compression on speech perception

    PubMed Central

    Rallapalli, Varsha H.; Alexander, Joshua M.

    2015-01-01

    The Neural-Scaled Entropy (NSE) model quantifies information in the speech signal that has been altered beyond simple gain adjustments by sensorineural hearing loss (SNHL) and various signal processing. An extension of Cochlear-Scaled Entropy (CSE) [Stilp, Kiefte, Alexander, and Kluender (2010). J. Acoust. Soc. Am. 128(4), 2112–2126], NSE quantifies information as the change in 1-ms neural firing patterns across frequency. To evaluate the model, data from a study that examined nonlinear frequency compression (NFC) in listeners with SNHL were used because NFC can recode the same input information in multiple ways in the output, resulting in different outcomes for different speech classes. Overall, predictions were more accurate for NSE than CSE. The NSE model accurately described the observed degradation in recognition, and lack thereof, for consonants in a vowel-consonant-vowel context that had been processed in different ways by NFC. While NSE accurately predicted recognition of vowel stimuli processed with NFC, it underestimated them relative to a low-pass control condition without NFC. In addition, without modifications, it could not predict the observed improvement in recognition for word final /s/ and /z/. Findings suggest that model modifications that include information from slower modulations might improve predictions across a wider variety of conditions. PMID:26627780

  8. Temporal processing of speech in a time-feature space

    NASA Astrophysics Data System (ADS)

    Avendano, Carlos

    1997-09-01

    The performance of speech communication systems often degrades under realistic environmental conditions. Adverse environmental factors include additive noise sources, room reverberation, and transmission channel distortions. This work studies the processing of speech in the temporal-feature or modulation spectrum domain, aiming for alleviation of the effects of such disturbances. Speech reflects the geometry of the vocal organs, and the linguistically dominant component is in the shape of the vocal tract. At any given point in time, the shape of the vocal tract is reflected in the short-time spectral envelope of the speech signal. The rate of change of the vocal tract shape appears to be important for the identification of linguistic components. This rate of change, or the rate of change of the short-time spectral envelope can be described by the modulation spectrum, i.e. the spectrum of the time trajectories described by the short-time spectral envelope. For a wide range of frequency bands, the modulation spectrum of speech exhibits a maximum at about 4 Hz, the average syllabic rate. Disturbances often have modulation frequency components outside the speech range, and could in principle be attenuated without significantly affecting the range with relevant linguistic information. Early efforts for exploiting the modulation spectrum domain (temporal processing), such as the dynamic cepstrum or the RASTA processing, used ad hoc designed processing and appear to be suboptimal. As a major contribution, in this dissertation we aim for a systematic data-driven design of temporal processing. First we analytically derive and discuss some properties and merits of temporal processing for speech signals. We attempt to formalize the concept and provide a theoretical background which has been lacking in the field. In the experimental part we apply temporal processing to a number of problems including adaptive noise reduction in cellular telephone environments, reduction of reverberation for speech enhancement, and improvements on automatic recognition of speech degraded by linear distortions and reverberation.

  9. Experimental comparison between speech transmission index, rapid speech transmission index, and speech intelligibility index.

    PubMed

    Larm, Petra; Hongisto, Valtteri

    2006-02-01

    During the acoustical design of, e.g., auditoria or open-plan offices, it is important to know how speech can be perceived in various parts of the room. Different objective methods have been developed to measure and predict speech intelligibility, and these have been extensively used in various spaces. In this study, two such methods were compared, the speech transmission index (STI) and the speech intelligibility index (SII). Also the simplification of the STI, the room acoustics speech transmission index (RASTI), was considered. These quantities are all based on determining an apparent speech-to-noise ratio on selected frequency bands and summing them using a specific weighting. For comparison, some data were needed on the possible differences of these methods resulting from the calculation scheme and also measuring equipment. Their prediction accuracy was also of interest. Measurements were made in a laboratory having adjustable noise level and absorption, and in a real auditorium. It was found that the measurement equipment, especially the selection of the loudspeaker, can greatly affect the accuracy of the results. The prediction accuracy of the RASTI was found acceptable, if the input values for the prediction are accurately known, even though the studied space was not ideally diffuse.

  10. Development of the Russian matrix sentence test.

    PubMed

    Warzybok, Anna; Zokoll, Melanie; Wardenga, Nina; Ozimek, Edward; Boboshko, Maria; Kollmeier, Birger

    2015-01-01

    To develop the Russian matrix sentence test for speech intelligibility measurements in noise. Test development included recordings, optimization of speech material, and evaluation to investigate the equivalency of the test lists and training. For each of the 500 test items, the speech intelligibility function, speech reception threshold (SRT: signal-to-noise ratio, SNR, that provides 50% speech intelligibility), and slope was obtained. The speech material was homogenized by applying level corrections. In evaluation measurements, speech intelligibility was measured at two fixed SNRs to compare list-specific intelligibility functions. To investigate the training effect and establish reference data, speech intelligibility was measured adaptively. Overall, 77 normal-hearing native Russian listeners. The optimization procedure decreased the spread in SRTs across words from 2.8 to 0.6 dB. Evaluation measurements confirmed that the 16 test lists were equivalent, with a mean SRT of -9.5 ± 0.2 dB and a slope of 13.8 ± 1.6%/dB. The reference SRT, -8.8 ± 0.8 dB for the open-set and -9.4 ± 0.8 dB for the closed-set format, increased slightly for noise levels above 75 dB SPL. The Russian matrix sentence test is suitable for accurate and reliable speech intelligibility measurements in noise.

  11. Fundamental frequency discrimination and speech perception in noise in cochlear implant simulationsa)

    PubMed Central

    Carroll, Jeff; Zeng, Fan-Gang

    2007-01-01

    Increasing the number of channels at low frequencies improves discrimination of fundamental frequency (F0) in cochlear implants [Geurts and Wouters 2004]. We conducted three experiments to test whether improved F0 discrimination can be translated into increased speech intelligibility in noise in a cochlear implant simulation. The first experiment measured F0 discrimination and speech intelligibility in quiet as a function of channel density over different frequency regions. The results from this experiment showed a tradeoff in performance between F0 discrimination and speech intelligibility with a limited number of channels. The second experiment tested whether improved F0 discrimination and optimizing this tradeoff could improve speech performance with a competing talker. However, improved F0 discrimination did not improve speech intelligibility in noise. The third experiment identified the critical number of channels needed at low frequencies to improve speech intelligibility in noise. The result showed that, while 16 channels below 500 Hz were needed to observe any improvement in speech intelligibility in noise, even 32 channels did not achieve normal performance. Theoretically, these results suggest that without accurate spectral coding, F0 discrimination and speech perception in noise are two independent processes. Practically, the present results illustrate the need to increase the number of independent channels in cochlear implants. PMID:17604581

  12. Application of a Novel Semi-Automatic Technique for Determining the Bilateral Symmetry Plane of the Facial Skeleton of Normal Adult Males.

    PubMed

    Roumeliotis, Grayson; Willing, Ryan; Neuert, Mark; Ahluwalia, Romy; Jenkyn, Thomas; Yazdani, Arjang

    2015-09-01

    The accurate assessment of symmetry in the craniofacial skeleton is important for cosmetic and reconstructive craniofacial surgery. Although there have been several published attempts to develop an accurate system for determining the correct plane of symmetry, all are inaccurate and time consuming. Here, the authors applied a novel semi-automatic method for the calculation of craniofacial symmetry, based on principal component analysis and iterative corrective point computation, to a large sample of normal adult male facial computerized tomography scans obtained clinically (n = 32). The authors hypothesized that this method would generate planes of symmetry that would result in less error when one side of the face was compared to the other than a symmetry plane generated using a plane defined by cephalometric landmarks. When a three-dimensional model of one side of the face was reflected across the semi-automatic plane of symmetry there was less error than when reflected across the cephalometric plane. The semi-automatic plane was also more accurate when the locations of bilateral cephalometric landmarks (eg, frontozygomatic sutures) were compared across the face. The authors conclude that this method allows for accurate and fast measurements of craniofacial symmetry. This has important implications for studying the development of the facial skeleton, and clinical application for reconstruction.

  13. An automatic and accurate method of full heart segmentation from CT image based on linear gradient model

    NASA Astrophysics Data System (ADS)

    Yang, Zili

    2017-07-01

    Heart segmentation is an important auxiliary method in the diagnosis of many heart diseases, such as coronary heart disease and atrial fibrillation, and in the planning of tumor radiotherapy. Most of the existing methods for full heart segmentation treat the heart as a whole part and cannot accurately extract the bottom of the heart. In this paper, we propose a new method based on linear gradient model to segment the whole heart from the CT images automatically and accurately. Twelve cases were tested in order to test this method and accurate segmentation results were achieved and identified by clinical experts. The results can provide reliable clinical support.

  14. Semi Automatic Ontology Instantiation in the domain of Risk Management

    NASA Astrophysics Data System (ADS)

    Makki, Jawad; Alquier, Anne-Marie; Prince, Violaine

    One of the challenging tasks in the context of Ontological Engineering is to automatically or semi-automatically support the process of Ontology Learning and Ontology Population from semi-structured documents (texts). In this paper we describe a Semi-Automatic Ontology Instantiation method from natural language text, in the domain of Risk Management. This method is composed from three steps 1 ) Annotation with part-of-speech tags, 2) Semantic Relation Instances Extraction, 3) Ontology instantiation process. It's based on combined NLP techniques using human intervention between steps 2 and 3 for control and validation. Since it heavily relies on linguistic knowledge it is not domain dependent which is a good feature for portability between the different fields of risk management application. The proposed methodology uses the ontology of the PRIMA1 project (supported by the European community) as a Generic Domain Ontology and populates it via an available corpus. A first validation of the approach is done through an experiment with Chemical Fact Sheets from Environmental Protection Agency2.

  15. Four-Channel Biosignal Analysis and Feature Extraction for Automatic Emotion Recognition

    NASA Astrophysics Data System (ADS)

    Kim, Jonghwa; André, Elisabeth

    This paper investigates the potential of physiological signals as a reliable channel for automatic recognition of user's emotial state. For the emotion recognition, little attention has been paid so far to physiological signals compared to audio-visual emotion channels such as facial expression or speech. All essential stages of automatic recognition system using biosignals are discussed, from recording physiological dataset up to feature-based multiclass classification. Four-channel biosensors are used to measure electromyogram, electrocardiogram, skin conductivity and respiration changes. A wide range of physiological features from various analysis domains, including time/frequency, entropy, geometric analysis, subband spectra, multiscale entropy, etc., is proposed in order to search the best emotion-relevant features and to correlate them with emotional states. The best features extracted are specified in detail and their effectiveness is proven by emotion recognition results.

  16. What iconic gesture fragments reveal about gesture-speech integration: when synchrony is lost, memory can help.

    PubMed

    Obermeier, Christian; Holle, Henning; Gunter, Thomas C

    2011-07-01

    The present series of experiments explores several issues related to gesture-speech integration and synchrony during sentence processing. To be able to more precisely manipulate gesture-speech synchrony, we used gesture fragments instead of complete gestures, thereby avoiding the usual long temporal overlap of gestures with their coexpressive speech. In a pretest, the minimal duration of an iconic gesture fragment needed to disambiguate a homonym (i.e., disambiguation point) was therefore identified. In three subsequent ERP experiments, we then investigated whether the gesture information available at the disambiguation point has immediate as well as delayed consequences on the processing of a temporarily ambiguous spoken sentence, and whether these gesture-speech integration processes are susceptible to temporal synchrony. Experiment 1, which used asynchronous stimuli as well as an explicit task, showed clear N400 effects at the homonym as well as at the target word presented further downstream, suggesting that asynchrony does not prevent integration under explicit task conditions. No such effects were found when asynchronous stimuli were presented using a more shallow task (Experiment 2). Finally, when gesture fragment and homonym were synchronous, similar results as in Experiment 1 were found, even under shallow task conditions (Experiment 3). We conclude that when iconic gesture fragments and speech are in synchrony, their interaction is more or less automatic. When they are not, more controlled, active memory processes are necessary to be able to combine the gesture fragment and speech context in such a way that the homonym is disambiguated correctly.

  17. A physiologically-inspired model reproducing the speech intelligibility benefit in cochlear implant listeners with residual acoustic hearing.

    PubMed

    Zamaninezhad, Ladan; Hohmann, Volker; Büchner, Andreas; Schädler, Marc René; Jürgens, Tim

    2017-02-01

    This study introduces a speech intelligibility model for cochlear implant users with ipsilateral preserved acoustic hearing that aims at simulating the observed speech-in-noise intelligibility benefit when receiving simultaneous electric and acoustic stimulation (EA-benefit). The model simulates the auditory nerve spiking in response to electric and/or acoustic stimulation. The temporally and spatially integrated spiking patterns were used as the final internal representation of noisy speech. Speech reception thresholds (SRTs) in stationary noise were predicted for a sentence test using an automatic speech recognition framework. The model was employed to systematically investigate the effect of three physiologically relevant model factors on simulated SRTs: (1) the spatial spread of the electric field which co-varies with the number of electrically stimulated auditory nerves, (2) the "internal" noise simulating the deprivation of auditory system, and (3) the upper bound frequency limit of acoustic hearing. The model results show that the simulated SRTs increase monotonically with increasing spatial spread for fixed internal noise, and also increase with increasing the internal noise strength for a fixed spatial spread. The predicted EA-benefit does not follow such a systematic trend and depends on the specific combination of the model parameters. Beyond 300 Hz, the upper bound limit for preserved acoustic hearing is less influential on speech intelligibility of EA-listeners in stationary noise. The proposed model-predicted EA-benefits are within the range of EA-benefits shown by 18 out of 21 actual cochlear implant listeners with preserved acoustic hearing. Copyright © 2016 Elsevier B.V. All rights reserved.

  18. Distributed neural signatures of natural audiovisual speech and music in the human auditory cortex.

    PubMed

    Salmi, Juha; Koistinen, Olli-Pekka; Glerean, Enrico; Jylänki, Pasi; Vehtari, Aki; Jääskeläinen, Iiro P; Mäkelä, Sasu; Nummenmaa, Lauri; Nummi-Kuisma, Katarina; Nummi, Ilari; Sams, Mikko

    2017-08-15

    During a conversation or when listening to music, auditory and visual information are combined automatically into audiovisual objects. However, it is still poorly understood how specific type of visual information shapes neural processing of sounds in lifelike stimulus environments. Here we applied multi-voxel pattern analysis to investigate how naturally matching visual input modulates supratemporal cortex activity during processing of naturalistic acoustic speech, singing and instrumental music. Bayesian logistic regression classifiers with sparsity-promoting priors were trained to predict whether the stimulus was audiovisual or auditory, and whether it contained piano playing, speech, or singing. The predictive performances of the classifiers were tested by leaving one participant at a time for testing and training the model using the remaining 15 participants. The signature patterns associated with unimodal auditory stimuli encompassed distributed locations mostly in the middle and superior temporal gyrus (STG/MTG). A pattern regression analysis, based on a continuous acoustic model, revealed that activity in some of these MTG and STG areas were associated with acoustic features present in speech and music stimuli. Concurrent visual stimulus modulated activity in bilateral MTG (speech), lateral aspect of right anterior STG (singing), and bilateral parietal opercular cortex (piano). Our results suggest that specific supratemporal brain areas are involved in processing complex natural speech, singing, and piano playing, and other brain areas located in anterior (facial speech) and posterior (music-related hand actions) supratemporal cortex are influenced by related visual information. Those anterior and posterior supratemporal areas have been linked to stimulus identification and sensory-motor integration, respectively. Copyright © 2017 Elsevier Inc. All rights reserved.

  19. TRECVID: the utility of a content-based video retrieval evaluation

    NASA Astrophysics Data System (ADS)

    Hauptmann, Alexander G.

    2006-01-01

    TRECVID, an annual retrieval evaluation benchmark organized by NIST, encourages research in information retrieval from digital video. TRECVID benchmarking covers both interactive and manual searching by end users, as well as the benchmarking of some supporting technologies including shot boundary detection, extraction of semantic features, and the automatic segmentation of TV news broadcasts. Evaluations done in the context of the TRECVID benchmarks show that generally, speech transcripts and annotations provide the single most important clue for successful retrieval. However, automatically finding the individual images is still a tremendous and unsolved challenge. The evaluations repeatedly found that none of the multimedia analysis and retrieval techniques provide a significant benefit over retrieval using only textual information such as from automatic speech recognition transcripts or closed captions. In interactive systems, we do find significant differences among the top systems, indicating that interfaces can make a huge difference for effective video/image search. For interactive tasks efficient interfaces require few key clicks, but display large numbers of images for visual inspection by the user. The text search finds the right context region in the video in general, but to select specific relevant images we need good interfaces to easily browse the storyboard pictures. In general, TRECVID has motivated the video retrieval community to be honest about what we don't know how to do well (sometimes through painful failures), and has focused us to work on the actual task of video retrieval, as opposed to flashy demos based on technological capabilities.

  20. Speech recognition training for enhancing written language generation by a traumatic brain injury survivor.

    PubMed

    Manasse, N J; Hux, K; Rankin-Erickson, J L

    2000-11-01

    Impairments in motor functioning, language processing, and cognitive status may impact the written language performance of traumatic brain injury (TBI) survivors. One strategy to minimize the impact of these impairments is to use a speech recognition system. The purpose of this study was to explore the effect of mild dysarthria and mild cognitive-communication deficits secondary to TBI on a 19-year-old survivor's mastery and use of such a system-specifically, Dragon Naturally Speaking. Data included the % of the participant's words accurately perceived by the system over time, the participant's accuracy over time in using commands for navigation and error correction, and quantitative and qualitative changes in the participant's written texts generated with and without the use of the speech recognition system. Results showed that Dragon NaturallySpeaking was approximately 80% accurate in perceiving words spoken by the participant, and the participant quickly and easily mastered all navigation and error correction commands presented. Quantitatively, the participant produced a greater amount of text using traditional word processing and a standard keyboard than using the speech recognition system. Minimal qualitative differences appeared between writing samples. Discussion of factors that may have contributed to the obtained results and that may affect the generalization of the findings to other TBI survivors is provided.

  1. Robust fundamental frequency estimation in sustained vowels: Detailed algorithmic comparisons and information fusion with adaptive Kalman filtering

    PubMed Central

    Tsanas, Athanasios; Zañartu, Matías; Little, Max A.; Fox, Cynthia; Ramig, Lorraine O.; Clifford, Gari D.

    2014-01-01

    There has been consistent interest among speech signal processing researchers in the accurate estimation of the fundamental frequency (F0) of speech signals. This study examines ten F0 estimation algorithms (some well-established and some proposed more recently) to determine which of these algorithms is, on average, better able to estimate F0 in the sustained vowel /a/. Moreover, a robust method for adaptively weighting the estimates of individual F0 estimation algorithms based on quality and performance measures is proposed, using an adaptive Kalman filter (KF) framework. The accuracy of the algorithms is validated using (a) a database of 117 synthetic realistic phonations obtained using a sophisticated physiological model of speech production and (b) a database of 65 recordings of human phonations where the glottal cycles are calculated from electroglottograph signals. On average, the sawtooth waveform inspired pitch estimator and the nearly defect-free algorithms provided the best individual F0 estimates, and the proposed KF approach resulted in a ∼16% improvement in accuracy over the best single F0 estimation algorithm. These findings may be useful in speech signal processing applications where sustained vowels are used to assess vocal quality, when very accurate F0 estimation is required. PMID:24815269

  2. Temporal Sensitivity Measured Shortly After Cochlear Implantation Predicts 6-Month Speech Recognition Outcome.

    PubMed

    Erb, Julia; Ludwig, Alexandra Annemarie; Kunke, Dunja; Fuchs, Michael; Obleser, Jonas

    2018-04-24

    Psychoacoustic tests assessed shortly after cochlear implantation are useful predictors of the rehabilitative speech outcome. While largely independent, both spectral and temporal resolution tests are important to provide an accurate prediction of speech recognition. However, rapid tests of temporal sensitivity are currently lacking. Here, we propose a simple amplitude modulation rate discrimination (AMRD) paradigm that is validated by predicting future speech recognition in adult cochlear implant (CI) patients. In 34 newly implanted patients, we used an adaptive AMRD paradigm, where broadband noise was modulated at the speech-relevant rate of ~4 Hz. In a longitudinal study, speech recognition in quiet was assessed using the closed-set Freiburger number test shortly after cochlear implantation (t0) as well as the open-set Freiburger monosyllabic word test 6 months later (t6). Both AMRD thresholds at t0 (r = -0.51) and speech recognition scores at t0 (r = 0.56) predicted speech recognition scores at t6. However, AMRD and speech recognition at t0 were uncorrelated, suggesting that those measures capture partially distinct perceptual abilities. A multiple regression model predicting 6-month speech recognition outcome with deafness duration and speech recognition at t0 improved from adjusted R = 0.30 to adjusted R = 0.44 when AMRD threshold was added as a predictor. These findings identify AMRD thresholds as a reliable, nonredundant predictor above and beyond established speech tests for CI outcome. This AMRD test could potentially be developed into a rapid clinical temporal-resolution test to be integrated into the postoperative test battery to improve the reliability of speech outcome prognosis.

  3. The effect of simultaneous text on the recall of noise-degraded speech.

    PubMed

    Grossman, Irina; Rajan, Ramesh

    2017-05-01

    Written and spoken language utilize the same processing system, enabling text to modulate speech processing. We investigated how simultaneously presented text affected speech recall in babble noise using a retrospective recall task. Participants were presented with text-speech sentence pairs in multitalker babble noise and then prompted to recall what they heard or what they read. In Experiment 1, sentence pairs were either congruent or incongruent and they were presented in silence or at 1 of 4 noise levels. Audio and Visual control groups were also tested with sentences presented in only 1 modality. Congruent text facilitated accurate recall of degraded speech; incongruent text had no effect. Text and speech were seldom confused for each other. A consideration of the effects of the language background found that monolingual English speakers outperformed early multilinguals at recalling degraded speech; however the effects of text on speech processing were analogous. Experiment 2 considered if the benefit provided by matching text was maintained when the congruency of the text and speech becomes more ambiguous because of the addition of partially mismatching text-speech sentence pairs that differed only on their final keyword and because of the use of low signal-to-noise ratios. The experiment focused on monolingual English speakers; the results showed that even though participants commonly confused text-for-speech during incongruent text-speech pairings, these confusions could not fully account for the benefit provided by matching text. Thus, we uniquely demonstrate that congruent text benefits the recall of noise-degraded speech. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  4. Development of an Automatic Differentiation Version of the FPX Rotor Code

    NASA Technical Reports Server (NTRS)

    Hu, Hong

    1996-01-01

    The ADIFOR2.0 automatic differentiator is applied to the FPX rotor code along with the grid generator GRGN3. The FPX is an eXtended Full-Potential CFD code for rotor calculations. The automatic differentiation version of the code is obtained, which provides both non-geometry and geometry sensitivity derivatives. The sensitivity derivatives via automatic differentiation are presented and compared with divided difference generated derivatives. The study shows that automatic differentiation method gives accurate derivative values in an efficient manner.

  5. Intonation and dialog context as constraints for speech recognition.

    PubMed

    Taylor, P; King, S; Isard, S; Wright, H

    1998-01-01

    This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.

  6. Integrating Speech and Iconic Gestures in a Stroop-Like Task: Evidence for Automatic Processing

    ERIC Educational Resources Information Center

    Kelly, Spencer D.; Creigh, Peter; Bartolotti, James

    2010-01-01

    Previous research has demonstrated a link between language and action in the brain. The present study investigates the strength of this neural relationship by focusing on a potential interface between the two systems: cospeech iconic gesture. Participants performed a Stroop-like task in which they watched videos of a man and a woman speaking and…

  7. Automatic Activation of Phonological Templates for Native but Not Nonnative Phonemes: An Investigation of the Temporal Dynamics of Mu Activation

    ERIC Educational Resources Information Center

    Santos-Oliveira, Daniela Cristina

    2017-01-01

    Models of speech perception suggest a dorsal stream connecting the temporal and inferior parietal lobe with the inferior frontal gyrus. This stream is thought to involve an auditory motor loop that translates acoustic information into motor/articulatory commands and is further influenced by decision making processes that involve maintenance of…

  8. Towards the Development of a Comprehensive Pedagogical Framework for Pronunciation Training Based on Adapted Automatic Speech Recognition Systems

    ERIC Educational Resources Information Center

    Ali, Saandia

    2016-01-01

    This paper reports on the early stages of a locally funded research and development project taking place at Rennes 2 university. It aims at developing a comprehensive pedagogical framework for pronunciation training for adult learners of English. This framework will combine a direct approach to pronunciation training (face-to-face teaching) with…

  9. Automatic speech recognition in air-ground data link

    NASA Technical Reports Server (NTRS)

    Armstrong, Herbert B.

    1989-01-01

    In the present air traffic system, information presented to the transport aircraft cockpit crew may originate from a variety of sources and may be presented to the crew in visual or aural form, either through cockpit instrument displays or, most often, through voice communication. Voice radio communications are the most error prone method for air-ground data link. Voice messages can be misstated or misunderstood and radio frequency congestion can delay or obscure important messages. To prevent proliferation, a multiplexed data link display can be designed to present information from multiple data link sources on a shared cockpit display unit (CDU) or multi-function display (MFD) or some future combination of flight management and data link information. An aural data link which incorporates an automatic speech recognition (ASR) system for crew response offers several advantages over visual displays. The possibility of applying ASR to the air-ground data link was investigated. The first step was to review current efforts in ASR applications in the cockpit and in air traffic control and evaluated their possible data line application. Next, a series of preliminary research questions is to be developed for possible future collaboration.

  10. Measures of voiced frication for automatic classification

    NASA Astrophysics Data System (ADS)

    Jackson, Philip J. B.; Jesus, Luis M. T.; Shadle, Christine H.; Pincas, Jonathan

    2004-05-01

    As an approach to understanding the characteristics of the acoustic sources in voiced fricatives, it seems apt to draw on knowledge of vowels and voiceless fricatives, which have been relatively well studied. However, the presence of both phonation and frication in these mixed-source sounds offers the possibility of mutual interaction effects, with variations across place of articulation. This paper examines the acoustic and articulatory consequences of these interactions and explores automatic techniques for finding parametric and statistical descriptions of these phenomena. A reliable and consistent set of such acoustic cues could be used for phonetic classification or speech recognition. Following work on devoicing of European Portuguese voiced fricatives [Jesus and Shadle, in Mamede et al. (eds.) (Springer-Verlag, Berlin, 2003), pp. 1-8]. and the modulating effect of voicing on frication [Jackson and Shadle, J. Acoust. Soc. Am. 108, 1421-1434 (2000)], the present study focuses on three types of information: (i) sequences and durations of acoustic events in VC transitions, (ii) temporal, spectral and modulation measures from the periodic and aperiodic components of the acoustic signal, and (iii) voicing activity derived from simultaneous EGG data. Analysis of interactions observed in British/American English and European Portuguese speech corpora will be compared, and the principal findings discussed.

  11. Automated Vocal Analysis of Children with Hearing Loss and Their Typical and Atypical Peers

    PubMed Central

    VanDam, Mark; Oller, D. Kimbrough; Ambrose, Sophie E.; Gray, Sharmistha; Richards, Jeffrey A.; Xu, Dongxin; Gilkerson, Jill; Silbert, Noah H.; Moeller, Mary Pat

    2014-01-01

    Objectives This study investigated automatic assessment of vocal development in children with hearing loss as compared with children who are typically developing, have language delays, and autism spectrum disorder. Statistical models are examined for performance in a classification model and to predict age within the four groups of children. Design The vocal analysis system analyzed over 1900 whole-day, naturalistic acoustic recordings from 273 toddlers and preschoolers comprising children who were typically developing, hard of hearing, language delayed, or autistic. Results Samples from children who were hard-of-hearing patterned more similarly to those of typically-developing children than to the language-delayed or autistic samples. The statistical models were able to classify children from the four groups examined and estimate developmental age based on automated vocal analysis. Conclusions This work shows a broad similarity between children with hearing loss and typically developing children, although children with hearing loss show some delay in their production of speech. Automatic acoustic analysis can now be used to quantitatively compare vocal development in children with and without speech-related disorders. The work may serve to better distinguish among various developmental disorders and ultimately contribute to improved intervention. PMID:25587667

  12. Closed circuit TV system automatically guides welding arc

    NASA Technical Reports Server (NTRS)

    Stephans, D. L.; Wall, W. A., Jr.

    1968-01-01

    Closed circuit television /CCTV/ system automatically guides a welding torch to position the welding arc accurately along weld seams. Digital counting and logic techniques incorporated in the control circuitry, ensure performance reliability.

  13. A Causal Inference Model Explains Perception of the McGurk Effect and Other Incongruent Audiovisual Speech.

    PubMed

    Magnotti, John F; Beauchamp, Michael S

    2017-02-01

    Audiovisual speech integration combines information from auditory speech (talker's voice) and visual speech (talker's mouth movements) to improve perceptual accuracy. However, if the auditory and visual speech emanate from different talkers, integration decreases accuracy. Therefore, a key step in audiovisual speech perception is deciding whether auditory and visual speech have the same source, a process known as causal inference. A well-known illusion, the McGurk Effect, consists of incongruent audiovisual syllables, such as auditory "ba" + visual "ga" (AbaVga), that are integrated to produce a fused percept ("da"). This illusion raises two fundamental questions: first, given the incongruence between the auditory and visual syllables in the McGurk stimulus, why are they integrated; and second, why does the McGurk effect not occur for other, very similar syllables (e.g., AgaVba). We describe a simplified model of causal inference in multisensory speech perception (CIMS) that predicts the perception of arbitrary combinations of auditory and visual speech. We applied this model to behavioral data collected from 60 subjects perceiving both McGurk and non-McGurk incongruent speech stimuli. The CIMS model successfully predicted both the audiovisual integration observed for McGurk stimuli and the lack of integration observed for non-McGurk stimuli. An identical model without causal inference failed to accurately predict perception for either form of incongruent speech. The CIMS model uses causal inference to provide a computational framework for studying how the brain performs one of its most important tasks, integrating auditory and visual speech cues to allow us to communicate with others.

  14. Automatic Conversational Scene Analysis in Children with Asperger Syndrome/High-Functioning Autism and Typically Developing Peers

    PubMed Central

    Tavano, Alessandro; Pesarin, Anna; Murino, Vittorio; Cristani, Marco

    2014-01-01

    Individuals with Asperger syndrome/High Functioning Autism fail to spontaneously attribute mental states to the self and others, a life-long phenotypic characteristic known as mindblindness. We hypothesized that mindblindness would affect the dynamics of conversational interaction. Using generative models, in particular Gaussian mixture models and observed influence models, conversations were coded as interacting Markov processes, operating on novel speech/silence patterns, termed Steady Conversational Periods (SCPs). SCPs assume that whenever an agent's process changes state (e.g., from silence to speech), it causes a general transition of the entire conversational process, forcing inter-actant synchronization. SCPs fed into observed influence models, which captured the conversational dynamics of children and adolescents with Asperger syndrome/High Functioning Autism, and age-matched typically developing participants. Analyzing the parameters of the models by means of discriminative classifiers, the dialogs of patients were successfully distinguished from those of control participants. We conclude that meaning-free speech/silence sequences, reflecting inter-actant synchronization, at least partially encode typical and atypical conversational dynamics. This suggests a direct influence of theory of mind abilities onto basic speech initiative behavior. PMID:24489674

  15. Building Languages

    MedlinePlus

    ... Support Services Technology and Audiology Medical and Surgical Solutions Putting it all Together Building Language American Sign Language (ASL) Conceptually Accurate Signed English (CASE) Cued Speech Finger Spelling Listening/Auditory Training ...

  16. Predicting fundamental frequency from mel-frequency cepstral coefficients to enable speech reconstruction.

    PubMed

    Shao, Xu; Milner, Ben

    2005-08-01

    This work proposes a method to reconstruct an acoustic speech signal solely from a stream of mel-frequency cepstral coefficients (MFCCs) as may be encountered in a distributed speech recognition (DSR) system. Previous methods for speech reconstruction have required, in addition to the MFCC vectors, fundamental frequency and voicing components. In this work the voicing classification and fundamental frequency are predicted from the MFCC vectors themselves using two maximum a posteriori (MAP) methods. The first method enables fundamental frequency prediction by modeling the joint density of MFCCs and fundamental frequency using a single Gaussian mixture model (GMM). The second scheme uses a set of hidden Markov models (HMMs) to link together a set of state-dependent GMMs, which enables a more localized modeling of the joint density of MFCCs and fundamental frequency. Experimental results on speaker-independent male and female speech show that accurate voicing classification and fundamental frequency prediction is attained when compared to hand-corrected reference fundamental frequency measurements. The use of the predicted fundamental frequency and voicing for speech reconstruction is shown to give very similar speech quality to that obtained using the reference fundamental frequency and voicing.

  17. Monaural room acoustic parameters from music and speech.

    PubMed

    Kendrick, Paul; Cox, Trevor J; Li, Francis F; Zhang, Yonggang; Chambers, Jonathon A

    2008-07-01

    This paper compares two methods for extracting room acoustic parameters from reverberated speech and music. An approach which uses statistical machine learning, previously developed for speech, is extended to work with music. For speech, reverberation time estimations are within a perceptual difference limen of the true value. For music, virtually all early decay time estimations are within a difference limen of the true value. The estimation accuracy is not good enough in other cases due to differences between the simulated data set used to develop the empirical model and real rooms. The second method carries out a maximum likelihood estimation on decay phases at the end of notes or speech utterances. This paper extends the method to estimate parameters relating to the balance of early and late energies in the impulse response. For reverberation time and speech, the method provides estimations which are within the perceptual difference limen of the true value. For other parameters such as clarity, the estimations are not sufficiently accurate due to the natural reverberance of the excitation signals. Speech is a better test signal than music because of the greater periods of silence in the signal, although music is needed for low frequency measurement.

  18. Visual-auditory integration during speech imitation in autism.

    PubMed

    Williams, Justin H G; Massaro, Dominic W; Peel, Natalie J; Bosseler, Alexis; Suddendorf, Thomas

    2004-01-01

    Children with autistic spectrum disorder (ASD) may have poor audio-visual integration, possibly reflecting dysfunctional 'mirror neuron' systems which have been hypothesised to be at the core of the condition. In the present study, a computer program, utilizing speech synthesizer software and a 'virtual' head (Baldi), delivered speech stimuli for identification in auditory, visual or bimodal conditions. Children with ASD were poorer than controls at recognizing stimuli in the unimodal conditions, but once performance on this measure was controlled for, no group difference was found in the bimodal condition. A group of participants with ASD were also trained to develop their speech-reading ability. Training improved visual accuracy and this also improved the children's ability to utilize visual information in their processing of speech. Overall results were compared to predictions from mathematical models based on integration and non-integration, and were most consistent with the integration model. We conclude that, whilst they are less accurate in recognizing stimuli in the unimodal condition, children with ASD show normal integration of visual and auditory speech stimuli. Given that training in recognition of visual speech was effective, children with ASD may benefit from multi-modal approaches in imitative therapy and language training.

  19. Acoustical conditions for speech communication in active elementary school classrooms

    NASA Astrophysics Data System (ADS)

    Sato, Hiroshi; Bradley, John

    2005-04-01

    Detailed acoustical measurements were made in 34 active elementary school classrooms with typical rectangular room shape in schools near Ottawa, Canada. There was an average of 21 students in classrooms. The measurements were made to obtain accurate indications of the acoustical quality of conditions for speech communication during actual teaching activities. Mean speech and noise levels were determined from the distribution of recorded sound levels and the average speech-to-noise ratio was 11 dBA. Measured mid-frequency reverberation times (RT) during the same occupied conditions varied from 0.3 to 0.6 s, and were a little less than for the unoccupied rooms. RT values were not related to noise levels. Octave band speech and noise levels, useful-to-detrimental ratios, and Speech Transmission Index values were also determined. Key results included: (1) The average vocal effort of teachers corresponded to louder than Pearsons Raised voice level; (2) teachers increase their voice level to overcome ambient noise; (3) effective speech levels can be enhanced by up to 5 dB by early reflection energy; and (4) student activity is seen to be the dominant noise source, increasing average noise levels by up to 10 dBA during teaching activities. [Work supported by CLLRnet.

  20. The War Film: Historical Perspective or Simple Entertainment

    DTIC Science & Technology

    2001-06-01

    published in 1978, the effect of the image repair campaign was not evident.25 Dr. Suid is an extremely helpful source of DOD assistance and filmmaker ...single speech at any one particular venue.32 The filmmakers merged several snippets of Patton speeches into dialogue to provide dramatic effect and a... filmmakers could support it and that the historical 135 accurate scene was still a good story. If the story faltered, dramatic effect was inserted to

  1. An Adaptive Approach to a 2.4 kb/s LPC Speech Coding System.

    DTIC Science & Technology

    1985-07-01

    laryngeal cancer ). Spectral estimation is at the foundation of speech analysis for all these goals and accurate AR model estimation in noise is...S ,5 mWnL NrinKt ) o ,-G p (d va Rmea.imn flU: 5() WOM Lu M(G)INUNM 40 4KeemS! MU= 1 UD M5) SIGHSM A SO= WAGe . M. (d) I U NS maIm ( IW vis MAMA

  2. EMG-based speech recognition using hidden markov models with global control variables.

    PubMed

    Lee, Ki-Seung

    2008-03-01

    It is well known that a strong relationship exists between human voices and the movement of articulatory facial muscles. In this paper, we utilize this knowledge to implement an automatic speech recognition scheme which uses solely surface electromyogram (EMG) signals. The sequence of EMG signals for each word is modelled by a hidden Markov model (HMM) framework. The main objective of the work involves building a model for state observation density when multichannel observation sequences are given. The proposed model reflects the dependencies between each of the EMG signals, which are described by introducing a global control variable. We also develop an efficient model training method, based on a maximum likelihood criterion. In a preliminary study, 60 isolated words were used as recognition variables. EMG signals were acquired from three articulatory facial muscles. The findings indicate that such a system may have the capacity to recognize speech signals with an accuracy of up to 87.07%, which is superior to the independent probabilistic model.

  3. An efficient robust sound classification algorithm for hearing aids.

    PubMed

    Nordqvist, Peter; Leijon, Arne

    2004-06-01

    An efficient robust sound classification algorithm based on hidden Markov models is presented. The system would enable a hearing aid to automatically change its behavior for differing listening environments according to the user's preferences. This work attempts to distinguish between three listening environment categories: speech in traffic noise, speech in babble, and clean speech, regardless of the signal-to-noise ratio. The classifier uses only the modulation characteristics of the signal. The classifier ignores the absolute sound pressure level and the absolute spectrum shape, resulting in an algorithm that is robust against irrelevant acoustic variations. The measured classification hit rate was 96.7%-99.5% when the classifier was tested with sounds representing one of the three environment categories included in the classifier. False-alarm rates were 0.2%-1.7% in these tests. The algorithm is robust and efficient and consumes a small amount of instructions and memory. It is fully possible to implement the classifier in a DSP-based hearing instrument.

  4. Research in speech communication.

    PubMed

    Flanagan, J

    1995-10-24

    Advances in digital speech processing are now supporting application and deployment of a variety of speech technologies for human/machine communication. In fact, new businesses are rapidly forming about these technologies. But these capabilities are of little use unless society can afford them. Happily, explosive advances in microelectronics over the past two decades have assured affordable access to this sophistication as well as to the underlying computing technology. The research challenges in speech processing remain in the traditionally identified areas of recognition, synthesis, and coding. These three areas have typically been addressed individually, often with significant isolation among the efforts. But they are all facets of the same fundamental issue--how to represent and quantify the information in the speech signal. This implies deeper understanding of the physics of speech production, the constraints that the conventions of language impose, and the mechanism for information processing in the auditory system. In ongoing research, therefore, we seek more accurate models of speech generation, better computational formulations of language, and realistic perceptual guides for speech processing--along with ways to coalesce the fundamental issues of recognition, synthesis, and coding. Successful solution will yield the long-sought dictation machine, high-quality synthesis from text, and the ultimate in low bit-rate transmission of speech. It will also open the door to language-translating telephony, where the synthetic foreign translation can be in the voice of the originating talker.

  5. Speech Perception Deficits in Mandarin-Speaking School-Aged Children with Poor Reading Comprehension

    PubMed Central

    Liu, Huei-Mei; Tsao, Feng-Ming

    2017-01-01

    Previous studies have shown that children learning alphabetic writing systems who have language impairment or dyslexia exhibit speech perception deficits. However, whether such deficits exist in children learning logographic writing systems who have poor reading comprehension remains uncertain. To further explore this issue, the present study examined speech perception deficits in Mandarin-speaking children with poor reading comprehension. Two self-designed tasks, consonant categorical perception task and lexical tone discrimination task were used to compare speech perception performance in children (n = 31, age range = 7;4–10;2) with poor reading comprehension and an age-matched typically developing group (n = 31, age range = 7;7–9;10). Results showed that the children with poor reading comprehension were less accurate in consonant and lexical tone discrimination tasks and perceived speech contrasts less categorically than the matched group. The correlations between speech perception skills (i.e., consonant and lexical tone discrimination sensitivities and slope of consonant identification curve) and individuals’ oral language and reading comprehension were stronger than the correlations between speech perception ability and word recognition ability. In conclusion, the results revealed that Mandarin-speaking children with poor reading comprehension exhibit less-categorized speech perception, suggesting that imprecise speech perception, especially lexical tone perception, is essential to account for reading learning difficulties in Mandarin-speaking children. PMID:29312031

  6. Prosody's Contribution to Fluency: An Examination of the Theory of Automatic Information Processing

    ERIC Educational Resources Information Center

    Schrauben, Julie E.

    2010-01-01

    LaBerge and Samuels' (1974) theory of automatic information processing in reading offers a model that explains how and where the processing of information occurs and the degree to which processing of information occurs. These processes are dependent upon two criteria: accurate word decoding and automatic word recognition. However, LaBerge and…

  7. Early Detection of Severe Apnoea through Voice Analysis and Automatic Speaker Recognition Techniques

    NASA Astrophysics Data System (ADS)

    Fernández, Ruben; Blanco, Jose Luis; Díaz, David; Hernández, Luis A.; López, Eduardo; Alcázar, José

    This study is part of an on-going collaborative effort between the medical and the signal processing communities to promote research on applying voice analysis and Automatic Speaker Recognition techniques (ASR) for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based diagnosis could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we present and discuss the possibilities of using generative Gaussian Mixture Models (GMMs), generally used in ASR systems, to model distinctive apnoea voice characteristics (i.e. abnormal nasalization). Finally, we present experimental findings regarding the discriminative power of speaker recognition techniques applied to severe apnoea detection. We have achieved an 81.25 % correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  8. Natural speech algorithm applied to baseline interview data can predict which patients will respond to psilocybin for treatment-resistant depression.

    PubMed

    Carrillo, Facundo; Sigman, Mariano; Fernández Slezak, Diego; Ashton, Philip; Fitzgerald, Lily; Stroud, Jack; Nutt, David J; Carhart-Harris, Robin L

    2018-04-01

    Natural speech analytics has seen some improvements over recent years, and this has opened a window for objective and quantitative diagnosis in psychiatry. Here, we used a machine learning algorithm applied to natural speech to ask whether language properties measured before psilocybin for treatment-resistant can predict for which patients it will be effective and for which it will not. A baseline autobiographical memory interview was conducted and transcribed. Patients with treatment-resistant depression received 2 doses of psilocybin, 10 mg and 25 mg, 7 days apart. Psychological support was provided before, during and after all dosing sessions. Quantitative speech measures were applied to the interview data from 17 patients and 18 untreated age-matched healthy control subjects. A machine learning algorithm was used to classify between controls and patients and predict treatment response. Speech analytics and machine learning successfully differentiated depressed patients from healthy controls and identified treatment responders from non-responders with a significant level of 85% of accuracy (75% precision). Automatic natural language analysis was used to predict effective response to treatment with psilocybin, suggesting that these tools offer a highly cost-effective facility for screening individuals for treatment suitability and sensitivity. The sample size was small and replication is required to strengthen inferences on these results. Copyright © 2018 Elsevier B.V. All rights reserved.

  9. Processing of prosodic changes in natural speech stimuli in school-age children.

    PubMed

    Lindström, R; Lepistö, T; Makkonen, T; Kujala, T

    2012-12-01

    Speech prosody conveys information about important aspects of communication: the meaning of the sentence and the emotional state or intention of the speaker. The present study addressed processing of emotional prosodic changes in natural speech stimuli in school-age children (mean age 10 years) by recording the electroencephalogram, facial electromyography, and behavioral responses. The stimulus was a semantically neutral Finnish word uttered with four different emotional connotations: neutral, commanding, sad, and scornful. In the behavioral sound-discrimination task the reaction times were fastest for the commanding stimulus and longest for the scornful stimulus, and faster for the neutral than for the sad stimulus. EEG and EMG responses were measured during non-attentive oddball paradigm. Prosodic changes elicited a negative-going, fronto-centrally distributed neural response peaking at about 500 ms from the onset of the stimulus, followed by a fronto-central positive deflection, peaking at about 740 ms. For the commanding stimulus also a rapid negative deflection peaking at about 290 ms from stimulus onset was elicited. No reliable stimulus type specific rapid facial reactions were found. The results show that prosodic changes in natural speech stimuli activate pre-attentive neural change-detection mechanisms in school-age children. However, the results do not support the suggestion of automaticity of emotion specific facial muscle responses to non-attended emotional speech stimuli in children. Copyright © 2012 Elsevier B.V. All rights reserved.

  10. Ultrasound applicability in Speech Language Pathology and Audiology.

    PubMed

    Barberena, Luciana da Silva; Brasil, Brunah de Castro; Melo, Roberta Michelon; Mezzomo, Carolina Lisbôa; Mota, Helena Bolli; Keske-Soares, Márcia

    2014-01-01

    To present recent studies that used the ultrasound in the fields of Speech Language Pathology and Audiology, which evidence possibilities of the applicability of this technique in different subareas. A bibliographic research was carried out in the PubMed database, using the keywords "ultrasonic," "speech," "phonetics," "Speech, Language and Hearing Sciences," "voice," "deglutition," and "myofunctional therapy," comprising some areas of Speech Language Pathology and Audiology Sciences. The keywords "ultrasound," "ultrasonography," "swallow," "orofacial myofunctional therapy," and "orofacial myology" were also used in the search. Studies in humans from the past 5 years were selected. In the preselection, duplicated studies, articles not fully available, and those that did not present direct relation between ultrasound and Speech Language Pathology and Audiology Sciences were discarded. The data were analyzed descriptively and classified subareas of Speech Language Pathology and Audiology Sciences. The following items were considered: purposes, participants, procedures, and results. We selected 12 articles for ultrasound versus speech/phonetics subarea, 5 for ultrasound versus voice, 1 for ultrasound versus muscles of mastication, and 10 for ultrasound versus swallow. Studies relating "ultrasound" and "Speech Language Pathology and Audiology Sciences" in the past 5 years were not found. Different studies on the use of ultrasound in Speech Language Pathology and Audiology Sciences were found. Each of them, according to its purpose, confirms new possibilities of the use of this instrument in the several subareas, aiming at a more accurate diagnosis and new evaluative and therapeutic possibilities.

  11. Behavioral and Neural Discrimination of Speech Sounds After Moderate or Intense Noise Exposure in Rats

    PubMed Central

    Reed, Amanda C.; Centanni, Tracy M.; Borland, Michael S.; Matney, Chanel J.; Engineer, Crystal T.; Kilgard, Michael P.

    2015-01-01

    Objectives Hearing loss is a commonly experienced disability in a variety of populations including veterans and the elderly and can often cause significant impairment in the ability to understand spoken language. In this study, we tested the hypothesis that neural and behavioral responses to speech will be differentially impaired in an animal model after two forms of hearing loss. Design Sixteen female Sprague–Dawley rats were exposed to one of two types of broadband noise which was either moderate or intense. In nine of these rats, auditory cortex recordings were taken 4 weeks after noise exposure (NE). The other seven were pretrained on a speech sound discrimination task prior to NE and were then tested on the same task after hearing loss. Results Following intense NE, rats had few neural responses to speech stimuli. These rats were able to detect speech sounds but were no longer able to discriminate between speech sounds. Following moderate NE, rats had reorganized cortical maps and altered neural responses to speech stimuli but were still able to accurately discriminate between similar speech sounds during behavioral testing. Conclusions These results suggest that rats are able to adjust to the neural changes after moderate NE and discriminate speech sounds, but they are not able to recover behavioral abilities after intense NE. Animal models could help clarify the adaptive and pathological neural changes that contribute to speech processing in hearing-impaired populations and could be used to test potential behavioral and pharmacological therapies. PMID:25072238

  12. The effect of paired pitch, rhythm, and speech on working memory as measured by sequential digit recall.

    PubMed

    Silverman, Michael J

    2007-01-01

    Educational and therapeutic objectives are often paired with music to facilitate the recall of information. The purpose of this study was to isolate and determine the effect of paired pitch, rhythm, and speech on undergraduate's memory as measured by sequential digit recall performance. Participants (N = 120) listened to 4 completely counterbalanced treatment conditions each consisting of 9 randomized monosyllabic digits paired with speech, pitch, rhythm, and the combination of pitch and rhythm. No statistically significant learning or order effects were found across the 4 trials. A 3-way repeated-measures ANOVA indicated a statistically significant difference in digit recall performance across treatment conditions, positions, groups, and treatment by position. No other comparisons resulted in statistically significant differences. Participants were able to recall digits from the rhythm condition most accurately while recalling digits from the speech and pitch only conditions the least accurately. Consistent with previous research, the music major participants scored significantly higher than non-music major participants and the main effect associated with serial position indicated that recall performance was best during primacy and recency positions. Analyses indicated an interaction between serial position and treatment condition, also a result consistent with previous research. The results of this study suggest that pairing information with rhythm can facilitate recall but pairing information with pitch or the combination of pitch and rhythm may not enhance recall more than speech when participants listen to an unfamiliar musical selection only once. Implications for practice in therapy and education are made as well as suggestions for future research.

  13. Autonomic Nervous System Responses During Perception of Masked Speech may Reflect Constructs other than Subjective Listening Effort

    PubMed Central

    Francis, Alexander L.; MacPherson, Megan K.; Chandrasekaran, Bharath; Alvar, Ann M.

    2016-01-01

    Typically, understanding speech seems effortless and automatic. However, a variety of factors may, independently or interactively, make listening more effortful. Physiological measures may help to distinguish between the application of different cognitive mechanisms whose operation is perceived as effortful. In the present study, physiological and behavioral measures associated with task demand were collected along with behavioral measures of performance while participants listened to and repeated sentences. The goal was to measure psychophysiological reactivity associated with three degraded listening conditions, each of which differed in terms of the source of the difficulty (distortion, energetic masking, and informational masking), and therefore were expected to engage different cognitive mechanisms. These conditions were chosen to be matched for overall performance (keywords correct), and were compared to listening to unmasked speech produced by a natural voice. The three degraded conditions were: (1) Unmasked speech produced by a computer speech synthesizer, (2) Speech produced by a natural voice and masked byspeech-shaped noise and (3) Speech produced by a natural voice and masked by two-talker babble. Masked conditions were both presented at a -8 dB signal to noise ratio (SNR), a level shown in previous research to result in comparable levels of performance for these stimuli and maskers. Performance was measured in terms of proportion of key words identified correctly, and task demand or effort was quantified subjectively by self-report. Measures of psychophysiological reactivity included electrodermal (skin conductance) response frequency and amplitude, blood pulse amplitude and pulse rate. Results suggest that the two masked conditions evoked stronger psychophysiological reactivity than did the two unmasked conditions even when behavioral measures of listening performance and listeners’ subjective perception of task demand were comparable across the three degraded conditions. PMID:26973564

  14. Synthesized speech rate and pitch effects on intelligibility of warning messages for pilots

    NASA Technical Reports Server (NTRS)

    Simpson, C. A.; Marchionda-Frost, K.

    1984-01-01

    In civilian and military operations, a future threat-warning system with a voice display could warn pilots of other traffic, obstacles in the flight path, and/or terrain during low-altitude helicopter flights. The present study was conducted to learn whether speech rate and voice pitch of phoneme-synthesized speech affects pilot accuracy and response time to typical threat-warning messages. Helicopter pilots engaged in an attention-demanding flying task and listened for voice threat warnings presented in a background of simulated helicopter cockpit noise. Performance was measured by flying-task performance, threat-warning intelligibility, and response time. Pilot ratings were elicited for the different voice pitches and speech rates. Significant effects were obtained only for response time and for pilot ratings, both as a function of speech rate. For the few cases when pilots forgot to respond to a voice message, they remembered 90 percent of the messages accurately when queried for their response 8 to 10 sec later.

  15. Improving Understanding of Emotional Speech Acoustic Content

    NASA Astrophysics Data System (ADS)

    Tinnemore, Anna

    Children with cochlear implants show deficits in identifying emotional intent of utterances without facial or body language cues. A known limitation to cochlear implants is the inability to accurately portray the fundamental frequency contour of speech which carries the majority of information needed to identify emotional intent. Without reliable access to the fundamental frequency, other methods of identifying vocal emotion, if identifiable, could be used to guide therapies for training children with cochlear implants to better identify vocal emotion. The current study analyzed recordings of adults speaking neutral sentences with a set array of emotions in a child-directed and adult-directed manner. The goal was to identify acoustic cues that contribute to emotion identification that may be enhanced in child-directed speech, but are also present in adult-directed speech. Results of this study showed that there were significant differences in the variation of the fundamental frequency, the variation of intensity, and the rate of speech among emotions and between intended audiences.

  16. Current trends in small vocabulary speech recognition for equipment control

    NASA Astrophysics Data System (ADS)

    Doukas, Nikolaos; Bardis, Nikolaos G.

    2017-09-01

    Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.

  17. The development of visual speech perception in Mandarin Chinese-speaking children.

    PubMed

    Chen, Liang; Lei, Jianghua

    2017-01-01

    The present study aimed to investigate the development of visual speech perception in Chinese-speaking children. Children aged 7, 13 and 16 were asked to visually identify both consonant and vowel sounds in Chinese as quickly and accurately as possible. Results revealed (1) an increase in accuracy of visual speech perception between ages 7 and 13 after which the accuracy rate either stagnates or drops; and (2) a U-shaped development pattern in speed of perception with peak performance in 13-year olds. Results also showed that across all age groups, the overall levels of accuracy rose, whereas the response times fell for simplex finals, complex finals and initials. These findings suggest that (1) visual speech perception in Chinese is a developmental process that is acquired over time and is still fine-tuned well into late adolescence; (2) factors other than cross-linguistic differences in phonological complexity and degrees of reliance on visual information are involved in development of visual speech perception.

  18. Estimating spatial travel times using automatic vehicle identification data

    DOT National Transportation Integrated Search

    2001-01-01

    Prepared ca. 2001. The paper describes an algorithm that was developed for estimating reliable and accurate average roadway link travel times using Automatic Vehicle Identification (AVI) data. The algorithm presented is unique in two aspects. First, ...

  19. Automatic lumbar vertebrae detection based on feature fusion deep learning for partial occluded C-arm X-ray images.

    PubMed

    Yang Li; Wei Liang; Yinlong Zhang; Haibo An; Jindong Tan

    2016-08-01

    Automatic and accurate lumbar vertebrae detection is an essential step of image-guided minimally invasive spine surgery (IG-MISS). However, traditional methods still require human intervention due to the similarity of vertebrae, abnormal pathological conditions and uncertain imaging angle. In this paper, we present a novel convolutional neural network (CNN) model to automatically detect lumbar vertebrae for C-arm X-ray images. Training data is augmented by DRR and automatic segmentation of ROI is able to reduce the computational complexity. Furthermore, a feature fusion deep learning (FFDL) model is introduced to combine two types of features of lumbar vertebrae X-ray images, which uses sobel kernel and Gabor kernel to obtain the contour and texture of lumbar vertebrae, respectively. Comprehensive qualitative and quantitative experiments demonstrate that our proposed model performs more accurate in abnormal cases with pathologies and surgical implants in multi-angle views.

  20. Summer Institute in Biomedical Engineering, 1973

    NASA Technical Reports Server (NTRS)

    Deloatch, E. M.; Coble, A. J.

    1974-01-01

    Bioengineering of medical equipment is detailed. Equipment described includes: an environmental control system for a surgical suite; surface potential mapping for an electrode system; the use of speech-modulated-white-noise to differentiate hearers and feelers among the profoundly deaf; the design of an automatic weight scale for an isolette; and an internal tibial torsion correction study. Graphs and charts are included with design specifications of this equipment.

  1. Design Automation for Streaming Systems

    DTIC Science & Technology

    2005-12-16

    which are FIFO buffered channels. We develop a process network model for streaming sys - tems (TDFPN) and a hardware description language with built in...and may include an automatic address generator. A complete synthesis sys - tem would provide separate segment operator implementations for every...Acoustics, Speech, and Signal Processing (ICASSP ’89), pages 988– 991, 1989. [Luk et al., 1997] Wayne Luk, Nabeel Shirazi, and Peter Y. K. Cheung

  2. Foreign Language Analysis and Recognition (FLARE) Progress

    DTIC Science & Technology

    2015-02-01

    Copies may be obtained from the Defense Technical Information Center (DTIC) (http://www.dtic.mil). AFRL- RH -WP-TR-2015-0007 HAS BEEN REVIEWED AND IS... retrieval (IR). 15. SUBJECT TERMS Automatic speech recognition (ASR), information retrieval (IR). 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF...to the Haystack Multilingual Multimedia Information Extraction and Retrieval (MMIER) system that was initially developed under a prior work unit

  3. Identifying Deceptive Speech Across Cultures

    DTIC Science & Technology

    2016-06-25

    34 Interspeech 2016. 2016. G. An, S. I. Levitan, R. Levitan, A. Rosenberg, M. Levine, J. Hirschberg, "Automatically Classifying Self -Rated Personality Scores from...law, no person shall be subject to any penalty for failing to comply with a collection of information if it does not display a currently valid OMB...correlations of deception ability with personality factors (extraversion, conscientiousness). Using acoustic-prosodic features, gender, ethnicity and

  4. Automation methodologies and large-scale validation for G W : Towards high-throughput G W calculations

    NASA Astrophysics Data System (ADS)

    van Setten, M. J.; Giantomassi, M.; Gonze, X.; Rignanese, G.-M.; Hautier, G.

    2017-10-01

    The search for new materials based on computational screening relies on methods that accurately predict, in an automatic manner, total energy, atomic-scale geometries, and other fundamental characteristics of materials. Many technologically important material properties directly stem from the electronic structure of a material, but the usual workhorse for total energies, namely density-functional theory, is plagued by fundamental shortcomings and errors from approximate exchange-correlation functionals in its prediction of the electronic structure. At variance, the G W method is currently the state-of-the-art ab initio approach for accurate electronic structure. It is mostly used to perturbatively correct density-functional theory results, but is, however, computationally demanding and also requires expert knowledge to give accurate results. Accordingly, it is not presently used in high-throughput screening: fully automatized algorithms for setting up the calculations and determining convergence are lacking. In this paper, we develop such a method and, as a first application, use it to validate the accuracy of G0W0 using the PBE starting point and the Godby-Needs plasmon-pole model (G0W0GN @PBE) on a set of about 80 solids. The results of the automatic convergence study utilized provide valuable insights. Indeed, we find correlations between computational parameters that can be used to further improve the automatization of G W calculations. Moreover, we find that G0W0GN @PBE shows a correlation between the PBE and the G0W0GN @PBE gaps that is much stronger than that between G W and experimental gaps. However, the G0W0GN @PBE gaps still describe the experimental gaps more accurately than a linear model based on the PBE gaps. With this paper, we hence show that G W can be made automatic and is more accurate than using an empirical correction of the PBE gap, but that, for accurate predictive results for a broad class of materials, an improved starting point or some type of self-consistency is necessary.

  5. An accurate method of extracting fat droplets in liver images for quantitative evaluation

    NASA Astrophysics Data System (ADS)

    Ishikawa, Masahiro; Kobayashi, Naoki; Komagata, Hideki; Shinoda, Kazuma; Yamaguchi, Masahiro; Abe, Tokiya; Hashiguchi, Akinori; Sakamoto, Michiie

    2015-03-01

    The steatosis in liver pathological tissue images is a promising indicator of nonalcoholic fatty liver disease (NAFLD) and the possible risk of hepatocellular carcinoma (HCC). The resulting values are also important for ensuring the automatic and accurate classification of HCC images, because the existence of many fat droplets is likely to create errors in quantifying the morphological features used in the process. In this study we propose a method that can automatically detect, and exclude regions with many fat droplets by using the feature values of colors, shapes and the arrangement of cell nuclei. We implement the method and confirm that it can accurately detect fat droplets and quantify the fat droplet ratio of actual images. This investigation also clarifies the effective characteristics that contribute to accurate detection.

  6. A Window into the Intoxicated Mind? Speech as an Index of Psychoactive Drug Effects

    PubMed Central

    Bedi, Gillinder; Cecchi, Guillermo A; Slezak, Diego F; Carrillo, Facundo; Sigman, Mariano; de Wit, Harriet

    2014-01-01

    Abused drugs can profoundly alter mental states in ways that may motivate drug use. These effects are usually assessed with self-report, an approach that is vulnerable to biases. Analyzing speech during intoxication may present a more direct, objective measure, offering a unique ‘window' into the mind. Here, we employed computational analyses of speech semantic and topological structure after ±3,4-methylenedioxymethamphetamine (MDMA; ‘ecstasy') and methamphetamine in 13 ecstasy users. In 4 sessions, participants completed a 10-min speech task after MDMA (0.75 and 1.5 mg/kg), methamphetamine (20 mg), or placebo. Latent Semantic Analyses identified the semantic proximity between speech content and concepts relevant to drug effects. Graph-based analyses identified topological speech characteristics. Group-level drug effects on semantic distances and topology were assessed. Machine-learning analyses (with leave-one-out cross-validation) assessed whether speech characteristics could predict drug condition in the individual subject. Speech after MDMA (1.5 mg/kg) had greater semantic proximity than placebo to the concepts friend, support, intimacy, and rapport. Speech on MDMA (0.75 mg/kg) had greater proximity to empathy than placebo. Conversely, speech on methamphetamine was further from compassion than placebo. Classifiers discriminated between MDMA (1.5 mg/kg) and placebo with 88% accuracy, and MDMA (1.5 mg/kg) and methamphetamine with 84% accuracy. For the two MDMA doses, the classifier performed at chance. These data suggest that automated semantic speech analyses can capture subtle alterations in mental state, accurately discriminating between drugs. The findings also illustrate the potential for automated speech-based approaches to characterize clinically relevant alterations to mental state, including those occurring in psychiatric illness. PMID:24694926

  7. A window into the intoxicated mind? Speech as an index of psychoactive drug effects.

    PubMed

    Bedi, Gillinder; Cecchi, Guillermo A; Slezak, Diego F; Carrillo, Facundo; Sigman, Mariano; de Wit, Harriet

    2014-09-01

    Abused drugs can profoundly alter mental states in ways that may motivate drug use. These effects are usually assessed with self-report, an approach that is vulnerable to biases. Analyzing speech during intoxication may present a more direct, objective measure, offering a unique 'window' into the mind. Here, we employed computational analyses of speech semantic and topological structure after ±3,4-methylenedioxymethamphetamine (MDMA; 'ecstasy') and methamphetamine in 13 ecstasy users. In 4 sessions, participants completed a 10-min speech task after MDMA (0.75 and 1.5 mg/kg), methamphetamine (20 mg), or placebo. Latent Semantic Analyses identified the semantic proximity between speech content and concepts relevant to drug effects. Graph-based analyses identified topological speech characteristics. Group-level drug effects on semantic distances and topology were assessed. Machine-learning analyses (with leave-one-out cross-validation) assessed whether speech characteristics could predict drug condition in the individual subject. Speech after MDMA (1.5 mg/kg) had greater semantic proximity than placebo to the concepts friend, support, intimacy, and rapport. Speech on MDMA (0.75 mg/kg) had greater proximity to empathy than placebo. Conversely, speech on methamphetamine was further from compassion than placebo. Classifiers discriminated between MDMA (1.5 mg/kg) and placebo with 88% accuracy, and MDMA (1.5 mg/kg) and methamphetamine with 84% accuracy. For the two MDMA doses, the classifier performed at chance. These data suggest that automated semantic speech analyses can capture subtle alterations in mental state, accurately discriminating between drugs. The findings also illustrate the potential for automated speech-based approaches to characterize clinically relevant alterations to mental state, including those occurring in psychiatric illness.

  8. Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

    PubMed Central

    Zhu, Lianzhang; Chen, Leiming; Zhao, Dehai

    2017-01-01

    Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed. PMID:28737705

  9. Research in speech communication.

    PubMed Central

    Flanagan, J

    1995-01-01

    Advances in digital speech processing are now supporting application and deployment of a variety of speech technologies for human/machine communication. In fact, new businesses are rapidly forming about these technologies. But these capabilities are of little use unless society can afford them. Happily, explosive advances in microelectronics over the past two decades have assured affordable access to this sophistication as well as to the underlying computing technology. The research challenges in speech processing remain in the traditionally identified areas of recognition, synthesis, and coding. These three areas have typically been addressed individually, often with significant isolation among the efforts. But they are all facets of the same fundamental issue--how to represent and quantify the information in the speech signal. This implies deeper understanding of the physics of speech production, the constraints that the conventions of language impose, and the mechanism for information processing in the auditory system. In ongoing research, therefore, we seek more accurate models of speech generation, better computational formulations of language, and realistic perceptual guides for speech processing--along with ways to coalesce the fundamental issues of recognition, synthesis, and coding. Successful solution will yield the long-sought dictation machine, high-quality synthesis from text, and the ultimate in low bit-rate transmission of speech. It will also open the door to language-translating telephony, where the synthetic foreign translation can be in the voice of the originating talker. Images Fig. 1 Fig. 2 Fig. 5 Fig. 8 Fig. 11 Fig. 12 Fig. 13 PMID:7479806

  10. Talker identification across source mechanisms: experiments with laryngeal and electrolarynx speech.

    PubMed

    Perrachione, Tyler K; Stepp, Cara E; Hillman, Robert E; Wong, Patrick C M

    2014-10-01

    The purpose of this study was to determine listeners' ability to learn talker identity from speech produced with an electrolarynx, explore source and filter differentiation in talker identification, and describe acoustic-phonetic changes associated with electrolarynx use. Healthy adult control listeners learned to identify talkers from speech recordings produced using talkers' normal laryngeal vocal source or an electrolarynx. Listeners' abilities to identify talkers from the trained vocal source (Experiment 1) and generalize this knowledge to the untrained source (Experiment 2) were assessed. Acoustic-phonetic measurements of spectral differences between source mechanisms were performed. Additional listeners attempted to match recordings from different source mechanisms to a single talker (Experiment 3). Listeners successfully learned talker identity from electrolarynx speech but less accurately than from laryngeal speech. Listeners were unable to generalize talker identity to the untrained source mechanism. Electrolarynx use resulted in vowels with higher F1 frequencies compared with laryngeal speech. Listeners matched recordings from different sources to a single talker better than chance. Electrolarynx speech, although lacking individual differences in voice quality, nevertheless conveys sufficient indexical information related to the vocal filter and articulation for listeners to identify individual talkers. Psychologically, perception of talker identity arises from a "gestalt" of the vocal source and filter.

  11. Teachers' perceptions of students with speech sound disorders: a quantitative and qualitative analysis.

    PubMed

    Overby, Megan; Carrell, Thomas; Bernthal, John

    2007-10-01

    This study examined 2nd-grade teachers' perceptions of the academic, social, and behavioral competence of students with speech sound disorders (SSDs). Forty-eight 2nd-grade teachers listened to 2 groups of sentences differing by intelligibility and pitch but spoken by a single 2nd grader. For each sentence group, teachers rated the speaker's academic, social, and behavioral competence using an adapted version of the Teacher Rating Scale of the Self-Perception Profile for Children (S. Harter, 1985) and completed 3 open-ended questions. The matched-guise design controlled for confounding speaker and stimuli variables that were inherent in prior studies. Statistically significant differences in teachers' expectations of children's academic, social, and behavioral performances were found between moderately intelligible and normal intelligibility speech. Teachers associated moderately intelligible low-pitched speech with more behavior problems than moderately intelligible high-pitched speech or either pitch with normal intelligibility. One third of the teachers reported that they could not accurately predict a child's school performance based on the child's speech skills, one third of the teachers causally related school difficulty to SSD, and one third of the teachers made no comment. Intelligibility and speaker pitch appear to be speech variables that influence teachers' perceptions of children's school performance.

  12. Talker identification across source mechanisms: Experiments with laryngeal and electrolarynx speech

    PubMed Central

    Perrachione, Tyler K.; Stepp, Cara E.; Hillman, Robert E.; Wong, Patrick C.M.

    2015-01-01

    Purpose To determine listeners' ability to learn talker identity from speech produced with an electrolarynx, explore source and filter differentiation in talker identification, and describe acoustic-phonetic changes associated with electrolarynx use. Method Healthy adult control listeners learned to identify talkers from speech recordings produced using talkers' normal laryngeal vocal source or an electrolarynx. Listeners' abilities to identify talkers from the trained vocal source (Experiment 1) and generalize this knowledge to the untrained source (Experiment 2) were assessed. Acoustic-phonetic measurements of spectral differences between source mechanisms were performed. Additional listeners attempted to match recordings from different source mechanisms to a single talker (Experiment 3). Results Listeners successfully learned talker identity from electrolarynx speech, but less accurately than from laryngeal speech. Listeners were unable to generalize talker identity to the untrained source mechanism. Electrolarynx use resulted in vowels with higher F1 frequencies compared to laryngeal speech. Listeners matched recordings from different sources to a single talker better than chance. Conclusions Electrolarynx speech, though lacking individual differences in voice quality, nevertheless conveys sufficient indexical information related to the vocal filter and articulation for listeners to identify individual talkers. Psychologically, perception of talker identity arises from a “gestalt” of the vocal source and filter. PMID:24801962

  13. Modeling Co-evolution of Speech and Biology.

    PubMed

    de Boer, Bart

    2016-04-01

    Two computer simulations are investigated that model interaction of cultural evolution of language and biological evolution of adaptations to language. Both are agent-based models in which a population of agents imitates each other using realistic vowels. The agents evolve under selective pressure for good imitation. In one model, the evolution of the vocal tract is modeled; in the other, a cognitive mechanism for perceiving speech accurately is modeled. In both cases, biological adaptations to using and learning speech evolve, even though the system of speech sounds itself changes at a more rapid time scale than biological evolution. However, the fact that the available acoustic space is used maximally (a self-organized result of cultural evolution) is constant, and therefore biological evolution does have a stable target. This work shows that when cultural and biological traits are continuous, their co-evolution may lead to cognitive adaptations that are strong enough to detect empirically. Copyright © 2016 Cognitive Science Society, Inc.

  14. Discrimination of brief speech sounds is impaired in rats with auditory cortex lesions

    PubMed Central

    Porter, Benjamin A.; Rosenthal, Tara R.; Ranasinghe, Kamalini G.; Kilgard, Michael P.

    2011-01-01

    Auditory cortex (AC) lesions impair complex sound discrimination. However, a recent study demonstrated spared performance on an acoustic startle response test of speech discrimination following AC lesions (Floody et al., 2010). The current study reports the effects of AC lesions on two operant speech discrimination tasks. AC lesions caused a modest and quickly recovered impairment in the ability of rats to discriminate consonant-vowel-consonant speech sounds. This result seems to suggest that AC does not play a role in speech discrimination. However, the speech sounds used in both studies differed in many acoustic dimensions and an adaptive change in discrimination strategy could allow the rats to use an acoustic difference that does not require an intact AC to discriminate. Based on our earlier observation that the first 40 ms of the spatiotemporal activity patterns elicited by speech sounds best correlate with behavioral discriminations of these sounds (Engineer et al., 2008), we predicted that eliminating additional cues by truncating speech sounds to the first 40 ms would render the stimuli indistinguishable to a rat with AC lesions. Although the initial discrimination of truncated sounds took longer to learn, the final performance paralleled rats using full-length consonant-vowel-consonant sounds. After 20 days of testing, half of the rats using speech onsets received bilateral AC lesions. Lesions severely impaired speech onset discrimination for at least one-month post lesion. These results support the hypothesis that auditory cortex is required to accurately discriminate the subtle differences between similar consonant and vowel sounds. PMID:21167211

  15. Automatic transducer switching provides accurate wide range measurement of pressure differential

    NASA Technical Reports Server (NTRS)

    Yoder, S. K.

    1967-01-01

    Automatic pressure transducer switching network sequentially selects any one of a number of limited-range transducers as gas pressure rises or falls, extending the range of measurement and lessening the chances of damage due to high pressure.

  16. Impaired Feedforward Control and Enhanced Feedback Control of Speech in Patients with Cerebellar Degeneration

    PubMed Central

    Agnew, Zarinah; Nagarajan, Srikantan; Houde, John; Ivry, Richard B.

    2017-01-01

    The cerebellum has been hypothesized to form a crucial part of the speech motor control network. Evidence for this comes from patients with cerebellar damage, who exhibit a variety of speech deficits, as well as imaging studies showing cerebellar activation during speech production in healthy individuals. To date, the precise role of the cerebellum in speech motor control remains unclear, as it has been implicated in both anticipatory (feedforward) and reactive (feedback) control. Here, we assess both anticipatory and reactive aspects of speech motor control, comparing the performance of patients with cerebellar degeneration and matched controls. Experiment 1 tested feedforward control by examining speech adaptation across trials in response to a consistent perturbation of auditory feedback. Experiment 2 tested feedback control, examining online corrections in response to inconsistent perturbations of auditory feedback. Both male and female patients and controls were tested. The patients were impaired in adapting their feedforward control system relative to controls, exhibiting an attenuated anticipatory response to the perturbation. In contrast, the patients produced even larger compensatory responses than controls, suggesting an increased reliance on sensory feedback to guide speech articulation in this population. Together, these results suggest that the cerebellum is crucial for maintaining accurate feedforward control of speech, but relatively uninvolved in feedback control. SIGNIFICANCE STATEMENT Speech motor control is a complex activity that is thought to rely on both predictive, feedforward control as well as reactive, feedback control. While the cerebellum has been shown to be part of the speech motor control network, its functional contribution to feedback and feedforward control remains controversial. Here, we use real-time auditory perturbations of speech to show that patients with cerebellar degeneration are impaired in adapting feedforward control of speech but retain the ability to make online feedback corrections; indeed, the patients show an increased sensitivity to feedback. These results indicate that the cerebellum forms a crucial part of the feedforward control system for speech but is not essential for online, feedback control. PMID:28842410

  17. Speech perception and phonological short-term memory capacity in language impairment: preliminary evidence from adolescents with specific language impairment (SLI) and autism spectrum disorders (ASD).

    PubMed

    Loucas, Tom; Riches, Nick Greatorex; Charman, Tony; Pickles, Andrew; Simonoff, Emily; Chandler, Susie; Baird, Gillian

    2010-01-01

    The cognitive bases of language impairment in specific language impairment (SLI) and autism spectrum disorders (ASD) were investigated in a novel non-word comparison task which manipulated phonological short-term memory (PSTM) and speech perception, both implicated in poor non-word repetition. This study aimed to investigate the contributions of PSTM and speech perception in non-word processing and whether individuals with SLI and ASD plus language impairment (ALI) show similar or different patterns of deficit in these cognitive processes. Three groups of adolescents (aged 14-17 years), 14 with SLI, 16 with ALI, and 17 age and non-verbal IQ matched typically developing (TD) controls, made speeded discriminations between non-word pairs. Stimuli varied in PSTM load (two- or four-syllables) and speech perception load (mismatches on a word-initial or word-medial segment). Reaction times showed effects of both non-word length and mismatch position and these factors interacted: four-syllable and word-initial mismatch stimuli resulted in the slowest decisions. Individuals with language impairment showed the same pattern of performance as those with typical development in the reaction time data. A marginal interaction between group and item length was driven by the SLI and ALI groups being less accurate with long items than short ones, a difference not found in the TD group. Non-word discrimination suggests that there are similarities and differences between adolescents with SLI and ALI and their TD peers. Reaction times appear to be affected by increasing PSTM and speech perception loads in a similar way. However, there was some, albeit weaker, evidence that adolescents with SLI and ALI are less accurate than TD individuals, with both showing an effect of PSTM load. This may indicate, at some level, the processing substrate supporting both PSTM and speech perception is intact in adolescents with SLI and ALI, but also in both there may be impaired access to PSTM resources.

  18. Real-time classification of auditory sentences using evoked cortical activity in humans

    NASA Astrophysics Data System (ADS)

    Moses, David A.; Leonard, Matthew K.; Chang, Edward F.

    2018-06-01

    Objective. Recent research has characterized the anatomical and functional basis of speech perception in the human auditory cortex. These advances have made it possible to decode speech information from activity in brain regions like the superior temporal gyrus, but no published work has demonstrated this ability in real-time, which is necessary for neuroprosthetic brain-computer interfaces. Approach. Here, we introduce a real-time neural speech recognition (rtNSR) software package, which was used to classify spoken input from high-resolution electrocorticography signals in real-time. We tested the system with two human subjects implanted with electrode arrays over the lateral brain surface. Subjects listened to multiple repetitions of ten sentences, and rtNSR classified what was heard in real-time from neural activity patterns using direct sentence-level and HMM-based phoneme-level classification schemes. Main results. We observed single-trial sentence classification accuracies of 90% or higher for each subject with less than 7 minutes of training data, demonstrating the ability of rtNSR to use cortical recordings to perform accurate real-time speech decoding in a limited vocabulary setting. Significance. Further development and testing of the package with different speech paradigms could influence the design of future speech neuroprosthetic applications.

  19. Cross-modal associations in synaesthesia: Vowel colours in the ear of the beholder.

    PubMed

    Moos, Anja; Smith, Rachel; Miller, Sam R; Simmons, David R

    2014-01-01

    Human speech conveys many forms of information, but for some exceptional individuals (synaesthetes), listening to speech sounds can automatically induce visual percepts such as colours. In this experiment, grapheme-colour synaesthetes and controls were asked to assign colours, or shades of grey, to different vowel sounds. We then investigated whether the acoustic content of these vowel sounds influenced participants' colour and grey-shade choices. We found that both colour and grey-shade associations varied systematically with vowel changes. The colour effect was significant for both participant groups, but significantly stronger and more consistent for synaesthetes. Because not all vowel sounds that we used are "translatable" into graphemes, we conclude that acoustic-phonetic influences co-exist with established graphemic influences in the cross-modal correspondences of both synaesthetes and non-synaesthetes.

  20. What’s Wrong With Automatic Speech Recognition (ASR) and How Can We Fix It?

    DTIC Science & Technology

    2013-03-01

    Jordan Cohen International Computer Science Institute 1947 Center Street, Suite 600 Berkeley, CA 94704 MARCH 2013 Final Report ...This report was cleared for public release by the 88th Air Base Wing Public Affairs Office and is available to the general public, including foreign...711th Human Performance Wing Air Force Research Laboratory This report is published in the interest of scientific and technical

Top