Sample records for automatic speech recognizer

  1. Automatic Title Generation for Spoken Broadcast News

    DTIC Science & Technology

    2001-01-01

    degrades much less with speech -recognized transcripts. Meanwhile, even though KNN performance not as well as TF.IDF and NBL in terms of F1 metric, it...test corpus of 1006 broadcast news documents, comparing the results over manual transcription to the results over automatically recognized speech . We...use both F1 and the average number of correct title words in the correct order as metric. Overall, the results show that title generation for speech

  2. Development of A Two-Stage Procedure for the Automatic Recognition of Dysfluencies in the Speech of Children Who Stutter: I. Psychometric Procedures Appropriate for Selection of Training Material for Lexical Dysfluency Classifiers

    PubMed Central

    Howell, Peter; Sackin, Stevie; Glenn, Kazan

    2007-01-01

    This program of work is intended to develop automatic recognition procedures to locate and assess stuttered dysfluencies. This and the following article together, develop and test recognizers for repetitions and prolongations. The automatic recognizers classify the speech in two stages: In the first, the speech is segmented and in the second the segments are categorized. The units that are segmented are words. Here assessments by human judges on the speech of 12 children who stutter are described using a corresponding procedure. The accuracy of word boundary placement across judges, categorization of the words as fluent, repetition or prolongation, and duration of the different fluency categories are reported. These measures allow reliable instances of repetitions and prolongations to be selected for training and assessing the recognizers in the subsequent paper. PMID:9328878

  3. Second Language Learners and Speech Act Comprehension

    ERIC Educational Resources Information Center

    Holtgraves, Thomas

    2007-01-01

    Recognizing the specific speech act ( Searle, 1969) that a speaker performs with an utterance is a fundamental feature of pragmatic competence. Past research has demonstrated that native speakers of English automatically recognize speech acts when they comprehend utterances (Holtgraves & Ashley, 2001). The present research examined whether this…

  4. Bridging automatic speech recognition and psycholinguistics: Extending Shortlist to an end-to-end model of human speech recognition (L)

    NASA Astrophysics Data System (ADS)

    Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis

    2003-12-01

    This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.

  5. Automatic lip reading by using multimodal visual features

    NASA Astrophysics Data System (ADS)

    Takahashi, Shohei; Ohya, Jun

    2013-12-01

    Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.

  6. How should a speech recognizer work?

    PubMed

    Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M

    2005-11-12

    Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.

  7. On the recognition of emotional vocal expressions: motivations for a holistic approach.

    PubMed

    Esposito, Anna; Esposito, Antonietta M

    2012-10-01

    Human beings seem to be able to recognize emotions from speech very well and information communication technology aims to implement machines and agents that can do the same. However, to be able to automatically recognize affective states from speech signals, it is necessary to solve two main technological problems. The former concerns the identification of effective and efficient processing algorithms capable of capturing emotional acoustic features from speech sentences. The latter focuses on finding computational models able to classify, with an approximation as good as human listeners, a given set of emotional states. This paper will survey these topics and provide some insights for a holistic approach to the automatic analysis, recognition and synthesis of affective states.

  8. Automatic Speech Recognition from Neural Signals: A Focused Review.

    PubMed

    Herff, Christian; Schultz, Tanja

    2016-01-01

    Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.

  9. Automatic speech recognition and training for severely dysarthric users of assistive technology: the STARDUST project.

    PubMed

    Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil

    2006-01-01

    The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent proximity to "normal" articulatory patterns. Severe dysarthric output may also be characterized by a small mass of distinguishable phonetic tokens making the acoustic differentiation of target words difficult. Speaker dependent computer speech recognition using Hidden Markov Models was achieved by the identification of robust phonetic elements within the individual speaker output patterns. A new system of speech training using computer generated visual and auditory feedback reduced the inconsistent production of key phonetic tokens over time.

  10. The Mechanical Recognition of Speech: Prospects for Use in the Teaching of Languages.

    ERIC Educational Resources Information Center

    Pulliam, Robert

    1970-01-01

    This paper begins with a brief account of the development of automatic speech recogniton (ASR) and then proceeds to an examination of ASR systems typical of the kind now in operation. It is stressed that such systems, although highly developed, do not recognize speech in the same sense as the human being does, and that they can not deal with a…

  11. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review

    NASA Astrophysics Data System (ADS)

    Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH

    2017-09-01

    This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.

  12. Random Deep Belief Networks for Recognizing Emotions from Speech Signals.

    PubMed

    Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

    2017-01-01

    Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.

  13. Random Deep Belief Networks for Recognizing Emotions from Speech Signals

    PubMed Central

    Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

    2017-01-01

    Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition. PMID:28356908

  14. Learning L2 Pronunciation with a Mobile Speech Recognizer: French /y/

    ERIC Educational Resources Information Center

    Liakin, Denis; Cardoso, Walcir; Liakina, Natallia

    2015-01-01

    This study investigates the acquisition of the L2 French vowel /y/ in a mobile-assisted learning environment, via the use of automatic speech recognition (ASR). Particularly, it addresses the question of whether ASR-based pronunciation instruction using a mobile device can improve the production and perception of French /y/. Forty-two elementary…

  15. Hidden Markov models in automatic speech recognition

    NASA Astrophysics Data System (ADS)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  16. Accelerometer-based automatic voice onset detection in speech mapping with navigated repetitive transcranial magnetic stimulation.

    PubMed

    Vitikainen, Anne-Mari; Mäkelä, Elina; Lioumis, Pantelis; Jousmäki, Veikko; Mäkelä, Jyrki P

    2015-09-30

    The use of navigated repetitive transcranial magnetic stimulation (rTMS) in mapping of speech-related brain areas has recently shown to be useful in preoperative workflow of epilepsy and tumor patients. However, substantial inter- and intraobserver variability and non-optimal replicability of the rTMS results have been reported, and a need for additional development of the methodology is recognized. In TMS motor cortex mappings the evoked responses can be quantitatively monitored by electromyographic recordings; however, no such easily available setup exists for speech mappings. We present an accelerometer-based setup for detection of vocalization-related larynx vibrations combined with an automatic routine for voice onset detection for rTMS speech mapping applying naming. The results produced by the automatic routine were compared with the manually reviewed video-recordings. The new method was applied in the routine navigated rTMS speech mapping for 12 consecutive patients during preoperative workup for epilepsy or tumor surgery. The automatic routine correctly detected 96% of the voice onsets, resulting in 96% sensitivity and 71% specificity. Majority (63%) of the misdetections were related to visible throat movements, extra voices before the response, or delayed naming of the previous stimuli. The no-response errors were correctly detected in 88% of events. The proposed setup for automatic detection of voice onsets provides quantitative additional data for analysis of the rTMS-induced speech response modifications. The objectively defined speech response latencies increase the repeatability, reliability and stratification of the rTMS results. Copyright © 2015 Elsevier B.V. All rights reserved.

  17. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system.

    PubMed

    Zekveld, Adriana A; Kramer, Sophia E; Kessens, Judith M; Vlaming, Marcel S M G; Houtgast, Tammo

    2009-04-01

    The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that speech comprehension in noise by young listeners with normal hearing improves when presenting partly incorrect, automatically generated subtitles. The current study focused on the effects of age, hearing loss, visual working memory capacity, and linguistic skills on the benefit obtained from automatically generated subtitles during listening to speech in noise. In order to investigate the effects of age and hearing loss, three groups of participants were included: 22 young persons with normal hearing (YNH, mean age = 21 years), 22 middle-aged adults with normal hearing (MA-NH, mean age = 55 years) and 30 middle-aged adults with hearing impairment (MA-HI, mean age = 57 years). The benefit from automatic subtitling was measured by Speech Reception Threshold (SRT) tests (Plomp & Mimpen, 1979). Both unimodal auditory and bimodal audiovisual SRT tests were performed. In the audiovisual tests, the subtitles were presented simultaneously with the speech, whereas in the auditory test, only speech was presented. The difference between the auditory and audiovisual SRT was defined as the audiovisual benefit. Participants additionally rated the listening effort. We examined the influences of ASR accuracy level and text delay on the audiovisual benefit and the listening effort using a repeated measures General Linear Model analysis. In a correlation analysis, we evaluated the relationships between age, auditory SRT, visual working memory capacity and the audiovisual benefit and listening effort. The automatically generated subtitles improved speech comprehension in noise for all ASR accuracies and delays covered by the current study. Higher ASR accuracy levels resulted in more benefit obtained from the subtitles. Speech comprehension improved even for relatively low ASR accuracy levels; for example, participants obtained about 2 dB SNR audiovisual benefit for ASR accuracies around 74%. Delaying the presentation of the text reduced the benefit and increased the listening effort. Participants with relatively low unimodal speech comprehension obtained greater benefit from the subtitles than participants with better unimodal speech comprehension. We observed an age-related decline in the working-memory capacity of the listeners with normal hearing. A higher age and a lower working memory capacity were associated with increased effort required to use the subtitles to improve speech comprehension. Participants were able to use partly incorrect and delayed subtitles to increase their comprehension of speech in noise, regardless of age and hearing loss. This supports the further development and evaluation of an assistive listening system that displays automatically recognized speech to aid speech comprehension by listeners with hearing impairment.

  18. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    NASA Technical Reports Server (NTRS)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  19. Speech recognition systems on the Cell Broadband Engine

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, Y; Jones, H; Vaidya, S

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousandsmore » of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.« less

  20. Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data

    PubMed Central

    Cao, Beiming; Kim, Myungjong; Mau, Ted; Wang, Jun

    2017-01-01

    Individuals with larynx (vocal folds) impaired have problems in controlling their glottal vibration, producing whispered speech with extreme hoarseness. Standard automatic speech recognition using only acoustic cues is typically ineffective for whispered speech because the corresponding spectral characteristics are distorted. Articulatory cues such as the tongue and lip motion may help in recognizing whispered speech since articulatory motion patterns are generally not affected. In this paper, we investigated whispered speech recognition for patients with reconstructed larynx using articulatory movement data. A data set with both acoustic and articulatory motion data was collected from a patient with surgically reconstructed larynx using an electromagnetic articulograph. Two speech recognition systems, Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural network-HMM (DNN-HMM), were used in the experiments. Experimental results showed adding either tongue or lip motion data to acoustic features such as mel-frequency cepstral coefficient (MFCC) significantly reduced the phone error rates on both speech recognition systems. Adding both tongue and lip data achieved the best performance. PMID:29423453

  1. Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

    NASA Technical Reports Server (NTRS)

    Ye, Sherry

    2015-01-01

    NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.

  2. A voice-input voice-output communication aid for people with severe speech impairment.

    PubMed

    Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter

    2013-01-01

    A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.

  3. LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.

    PubMed

    Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu

    2005-01-01

    Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.

  4. Specific acoustic models for spontaneous and dictated style in indonesian speech recognition

    NASA Astrophysics Data System (ADS)

    Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.

    2018-03-01

    The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.

  5. Tone classification of syllable-segmented Thai speech based on multilayer perception

    NASA Astrophysics Data System (ADS)

    Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman

    2002-05-01

    Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.

  6. Intonation and dialog context as constraints for speech recognition.

    PubMed

    Taylor, P; King, S; Isard, S; Wright, H

    1998-01-01

    This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.

  7. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

    PubMed

    Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

    2011-01-01

    The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

  8. Noise-robust speech recognition through auditory feature detection and spike sequence decoding.

    PubMed

    Schafer, Phillip B; Jin, Dezhe Z

    2014-03-01

    Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.

  9. The Automation System Censor Speech for the Indonesian Rude Swear Words Based on Support Vector Machine and Pitch Analysis

    NASA Astrophysics Data System (ADS)

    Endah, S. N.; Nugraheni, D. M. K.; Adhy, S.; Sutikno

    2017-04-01

    According to Law No. 32 of 2002 and the Indonesian Broadcasting Commission Regulation No. 02/P/KPI/12/2009 & No. 03/P/KPI/12/2009, stated that broadcast programs should not scold with harsh words, not harass, insult or demean minorities and marginalized groups. However, there are no suitable tools to censor those words automatically. Therefore, researches to develop a system of intelligent software to censor the words automatically are needed. To conduct censor, the system must be able to recognize the words in question. This research proposes the classification of speech divide into two classes using Support Vector Machine (SVM), first class is set of rude words and the second class is set of properly words. The speech pitch values as an input in SVM, it used for the development of the system for the Indonesian rude swear word. The results of the experiment show that SVM is good for this system.

  10. Speech input system for meat inspection and pathological coding used thereby

    NASA Astrophysics Data System (ADS)

    Abe, Shozo

    Meat inspection is one of exclusive and important jobs of veterinarians though it is not well known in general. As the inspection should be conducted skillfully during a series of continuous operations in a slaughter house, development of automatic inspecting systems has been required for a long time. We employed a hand-free speech input system to record the inspecting data because inspecters have to use their both hands to treat the internals of catles and check their health conditions by necked eyes. The data collected by the inspectors are transfered to a speech recognizer and then stored as controlable data of each catle inspected. Control of terms such as pathological conditions to be input and their coding are also important in this speech input system and practical examples are shown.

  11. Restoring the missing features of the corrupted speech using linear interpolation methods

    NASA Astrophysics Data System (ADS)

    Rassem, Taha H.; Makbol, Nasrin M.; Hasan, Ali Muttaleb; Zaki, Siti Syazni Mohd; Girija, P. N.

    2017-10-01

    One of the main challenges in the Automatic Speech Recognition (ASR) is the noise. The performance of the ASR system reduces significantly if the speech is corrupted by noise. In spectrogram representation of a speech signal, after deleting low Signal to Noise Ratio (SNR) elements, the incomplete spectrogram is obtained. In this case, the speech recognizer should make modifications to the spectrogram in order to restore the missing elements, which is one direction. In another direction, speech recognizer should be able to restore the missing elements due to deleting low SNR elements before performing the recognition. This is can be done using different spectrogram reconstruction methods. In this paper, the geometrical spectrogram reconstruction methods suggested by some researchers are implemented as a toolbox. In these geometrical reconstruction methods, the linear interpolation along time or frequency methods are used to predict the missing elements between adjacent observed elements in the spectrogram. Moreover, a new linear interpolation method using time and frequency together is presented. The CMU Sphinx III software is used in the experiments to test the performance of the linear interpolation reconstruction method. The experiments are done under different conditions such as different lengths of the window and different lengths of utterances. Speech corpus consists of 20 males and 20 females; each one has two different utterances are used in the experiments. As a result, 80% recognition accuracy is achieved with 25% SNR ratio.

  12. Effects and modeling of phonetic and acoustic confusions in accented speech.

    PubMed

    Fung, Pascale; Liu, Yi

    2005-11-01

    Accented speech recognition is more challenging than standard speech recognition due to the effects of phonetic and acoustic confusions. Phonetic confusion in accented speech occurs when an expected phone is pronounced as a different one, which leads to erroneous recognition. Acoustic confusion occurs when the pronounced phone is found to lie acoustically between two baseform models and can be equally recognized as either one. We propose that it is necessary to analyze and model these confusions separately in order to improve accented speech recognition without degrading standard speech recognition. Since low phonetic confusion units in accented speech do not give rise to automatic speech recognition errors, we focus on analyzing and reducing phonetic and acoustic confusability under high phonetic confusion conditions. We propose using likelihood ratio test to measure phonetic confusion, and asymmetric acoustic distance to measure acoustic confusion. Only accent-specific phonetic units with low acoustic confusion are used in an augmented pronunciation dictionary, while phonetic units with high acoustic confusion are reconstructed using decision tree merging. Experimental results show that our approach is effective and superior to methods modeling phonetic confusion or acoustic confusion alone in accented speech, with a significant 5.7% absolute WER reduction, without degrading standard speech recognition.

  13. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    ERIC Educational Resources Information Center

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  14. Text as a Supplement to Speech in Young and Older Adults a)

    PubMed Central

    Krull, Vidya; Humes, Larry E.

    2015-01-01

    Objective The purpose of this experiment was to quantify the contribution of visual text to auditory speech recognition in background noise. Specifically, we tested the hypothesis that partially accurate visual text from an automatic speech recognizer could be used successfully to supplement speech understanding in difficult listening conditions in older adults, with normal or impaired hearing. Our working hypotheses were based on what is known regarding audiovisual speech perception in the elderly from speechreading literature. We hypothesized that: 1) combining auditory and visual text information will result in improved recognition accuracy compared to auditory or visual text information alone; 2) benefit from supplementing speech with visual text (auditory and visual enhancement) in young adults will be greater than that in older adults; and 3) individual differences in performance on perceptual measures would be associated with cognitive abilities. Design Fifteen young adults with normal hearing, fifteen older adults with normal hearing, and fifteen older adults with hearing loss participated in this study. All participants completed sentence recognition tasks in auditory-only, text-only, and combined auditory-text conditions. The auditory sentence stimuli were spectrally shaped to restore audibility for the older participants with impaired hearing. All participants also completed various cognitive measures, including measures of working memory, processing speed, verbal comprehension, perceptual and cognitive speed, processing efficiency, inhibition, and the ability to form wholes from parts. Group effects were examined for each of the perceptual and cognitive measures. Audiovisual benefit was calculated relative to performance on auditory-only and visual-text only conditions. Finally, the relationship between perceptual measures and other independent measures were examined using principal-component factor analyses, followed by regression analyses. Results Both young and older adults performed similarly on nine out of ten perceptual measures (auditory, visual, and combined measures). Combining degraded speech with partially correct text from an automatic speech recognizer improved the understanding of speech in both young and older adults, relative to both auditory- and text-only performance. In all subjects, cognition emerged as a key predictor for a general speech-text integration ability. Conclusions These results suggest that neither age nor hearing loss affected the ability of subjects to benefit from text when used to support speech, after ensuring audibility through spectral shaping. These results also suggest that the benefit obtained by supplementing auditory input with partially accurate text is modulated by cognitive ability, specifically lexical and verbal skills. PMID:26458131

  15. Analysis of human scream and its impact on text-independent speaker verification.

    PubMed

    Hansen, John H L; Nandwana, Mahesh Kumar; Shokouhi, Navid

    2017-04-01

    Scream is defined as sustained, high-energy vocalizations that lack phonological structure. Lack of phonological structure is how scream is identified from other forms of loud vocalization, such as "yell." This study investigates the acoustic aspects of screams and addresses those that are known to prevent standard speaker identification systems from recognizing the identity of screaming speakers. It is well established that speaker variability due to changes in vocal effort and Lombard effect contribute to degraded performance in automatic speech systems (i.e., speech recognition, speaker identification, diarization, etc.). However, previous research in the general area of speaker variability has concentrated on human speech production, whereas less is known about non-speech vocalizations. The UT-NonSpeech corpus is developed here to investigate speaker verification from scream samples. This study considers a detailed analysis in terms of fundamental frequency, spectral peak shift, frame energy distribution, and spectral tilt. It is shown that traditional speaker recognition based on the Gaussian mixture models-universal background model framework is unreliable when evaluated with screams.

  16. Human phoneme recognition depending on speech-intrinsic variability.

    PubMed

    Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger

    2010-11-01

    The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).

  17. Speech systems research at Texas Instruments

    NASA Technical Reports Server (NTRS)

    Doddington, George R.

    1977-01-01

    An assessment of automatic speech processing technology is presented. Fundamental problems in the development and the deployment of automatic speech processing systems are defined and a technology forecast for speech systems is presented.

  18. A technology prototype system for rating therapist empathy from audio recordings in addiction counseling.

    PubMed

    Xiao, Bo; Huang, Chewei; Imel, Zac E; Atkins, David C; Georgiou, Panayiotis; Narayanan, Shrikanth S

    2016-04-01

    Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.

  19. A technology prototype system for rating therapist empathy from audio recordings in addiction counseling

    PubMed Central

    Xiao, Bo; Huang, Chewei; Imel, Zac E.; Atkins, David C.; Georgiou, Panayiotis; Narayanan, Shrikanth S.

    2016-01-01

    Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy—a key therapy quality index—from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training. PMID:28286867

  20. Using Automatic Speech Recognition to Dictate Mathematical Expressions: The Development of the "TalkMaths" Application at Kingston University

    ERIC Educational Resources Information Center

    Wigmore, Angela; Hunter, Gordon; Pflugel, Eckhard; Denholm-Price, James; Binelli, Vincent

    2009-01-01

    Speech technology--especially automatic speech recognition--has now advanced to a level where it can be of great benefit both to able-bodied people and those with various disabilities. In this paper we describe an application "TalkMaths" which, using the output from a commonly-used conventional automatic speech recognition system,…

  1. Factors that influence the performance of experienced speech recognition users.

    PubMed

    Koester, Heidi Horstmann

    2006-01-01

    Performance on automatic speech recognition (ASR) systems for users with physical disabilities varies widely between individuals. The goal of this study was to discover some key factors that account for that variation. Using data from 23 experienced ASR users with physical disabilities, the effect of 20 different independent variables on recognition accuracy and text entry rate with ASR was measured using bivariate and multivariate analyses. The results show that use of appropriate correction strategies had the strongest influence on user performance with ASR. The amount of time the user spent on his or her computer, the user's manual typing speed, and the speed with which the ASR system recognized speech were all positively associated with better performance. The amount or perceived adequacy of ASR training did not have a significant impact on performance for this user group.

  2. The development of the Athens Emotional States Inventory (AESI): collection, validation and automatic processing of emotionally loaded sentences.

    PubMed

    Chaspari, Theodora; Soldatos, Constantin; Maragos, Petros

    2015-01-01

    The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. The resulting database contains 696 recorded utterances in Greek language by 20 native speakers and has a total duration of approximately 28 min. Speech classification results yield accuracy up to 75.15% for automatically recognizing the emotions in AESI. These results indicate the usefulness of our approach for collecting emotional data with reliable content, balanced across classes and with reduced environmental variability.

  3. Brain-to-text: decoding spoken phrases from phone representations in the brain.

    PubMed

    Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

    2015-01-01

    It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.

  4. Brain-to-text: decoding spoken phrases from phone representations in the brain

    PubMed Central

    Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

    2015-01-01

    It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech. PMID:26124702

  5. Automatic feedback to promote safe walking and speech loudness control in persons with multiple disabilities: two single-case studies.

    PubMed

    Lancioni, Giulio E; Singh, Nirbhay N; O'Reilly, Mark F; Green, Vanessa A; Alberti, Gloria; Boccasini, Adele; Smaldone, Angela; Oliva, Doretta; Bosco, Andrea

    2014-08-01

    Assessing automatic feedback technologies to promote safe travel and speech loudness control in two men with multiple disabilities, respectively. The men were involved in two single-case studies. In Study I, the technology involved a microprocessor, two photocells, and a verbal feedback device. The man received verbal alerting/feedback when the photocells spotted an obstacle in front of him. In Study II, the technology involved a sound-detecting unit connected to a throat and an airborne microphone, and to a vibration device. Vibration occurred when the man's speech loudness exceeded a preset level. The man included in Study I succeeded in using the automatic feedback in substitution of caregivers' alerting/feedback for safe travel. The man of Study II used the automatic feedback to successfully reduce his speech loudness. Automatic feedback can be highly effective in helping persons with multiple disabilities improve their travel and speech performance.

  6. Integrating hidden Markov model and PRAAT: a toolbox for robust automatic speech transcription

    NASA Astrophysics Data System (ADS)

    Kabir, A.; Barker, J.; Giurgiu, M.

    2010-09-01

    An automatic time-aligned phone transcription toolbox of English speech corpora has been developed. Especially the toolbox would be very useful to generate robust automatic transcription and able to produce phone level transcription using speaker independent models as well as speaker dependent models without manual intervention. The system is based on standard Hidden Markov Models (HMM) approach and it was successfully experimented over a large audiovisual speech corpus namely GRID corpus. One of the most powerful features of the toolbox is the increased flexibility in speech processing where the speech community would be able to import the automatic transcription generated by HMM Toolkit (HTK) into a popular transcription software, PRAAT, and vice-versa. The toolbox has been evaluated through statistical analysis on GRID data which shows that automatic transcription deviates by an average of 20 ms with respect to manual transcription.

  7. Automatic detection of articulation disorders in children with cleft lip and palate.

    PubMed

    Maier, Andreas; Hönig, Florian; Bocklet, Tobias; Nöth, Elmar; Stelzle, Florian; Nkenke, Emeka; Schuster, Maria

    2009-11-01

    Speech of children with cleft lip and palate (CLP) is sometimes still disordered even after adequate surgical and nonsurgical therapies. Such speech shows complex articulation disorders, which are usually assessed perceptually, consuming time and manpower. Hence, there is a need for an easy to apply and reliable automatic method. To create a reference for an automatic system, speech data of 58 children with CLP were assessed perceptually by experienced speech therapists for characteristic phonetic disorders at the phoneme level. The first part of the article aims to detect such characteristics by a semiautomatic procedure and the second to evaluate a fully automatic, thus simple, procedure. The methods are based on a combination of speech processing algorithms. The semiautomatic method achieves moderate to good agreement (kappa approximately 0.6) for the detection of all phonetic disorders. On a speaker level, significant correlations between the perceptual evaluation and the automatic system of 0.89 are obtained. The fully automatic system yields a correlation on the speaker level of 0.81 to the perceptual evaluation. This correlation is in the range of the inter-rater correlation of the listeners. The automatic speech evaluation is able to detect phonetic disorders at an experts'level without any additional human postprocessing.

  8. Military applications of automatic speech recognition and future requirements

    NASA Technical Reports Server (NTRS)

    Beek, Bruno; Cupples, Edward J.

    1977-01-01

    An updated summary of the state-of-the-art of automatic speech recognition and its relevance to military applications is provided. A number of potential systems for military applications are under development. These include: (1) digital narrowband communication systems; (2) automatic speech verification; (3) on-line cartographic processing unit; (4) word recognition for militarized tactical data system; and (5) voice recognition and synthesis for aircraft cockpit.

  9. Selected Topics from LVCSR Research for Asian Languages at Tokyo Tech

    NASA Astrophysics Data System (ADS)

    Furui, Sadaoki

    This paper presents our recent work in regard to building Large Vocabulary Continuous Speech Recognition (LVCSR) systems for the Thai, Indonesian, and Chinese languages. For Thai, since there is no word boundary in the written form, we have proposed a new method for automatically creating word-like units from a text corpus, and applied topic and speaking style adaptation to the language model to recognize spoken-style utterances. For Indonesian, we have applied proper noun-specific adaptation to acoustic modeling, and rule-based English-to-Indonesian phoneme mapping to solve the problem of large variation in proper noun and English word pronunciation in a spoken-query information retrieval system. In spoken Chinese, long organization names are frequently abbreviated, and abbreviated utterances cannot be recognized if the abbreviations are not included in the dictionary. We have proposed a new method for automatically generating Chinese abbreviations, and by expanding the vocabulary using the generated abbreviations, we have significantly improved the performance of spoken query-based search.

  10. Automatic speech recognition technology development at ITT Defense Communications Division

    NASA Technical Reports Server (NTRS)

    White, George M.

    1977-01-01

    An assessment of the applications of automatic speech recognition to defense communication systems is presented. Future research efforts include investigations into the following areas: (1) dynamic programming; (2) recognition of speech degraded by noise; (3) speaker independent recognition; (4) large vocabulary recognition; (5) word spotting and continuous speech recognition; and (6) isolated word recognition.

  11. Automatic Method of Pause Measurement for Normal and Dysarthric Speech

    ERIC Educational Resources Information Center

    Rosen, Kristin; Murdoch, Bruce; Folker, Joanne; Vogel, Adam; Cahill, Louise; Delatycki, Martin; Corben, Louise

    2010-01-01

    This study proposes an automatic method for the detection of pauses and identification of pause types in conversational speech for the purpose of measuring the effects of Friedreich's Ataxia (FRDA) on speech. Speech samples of [approximately] 3 minutes were recorded from 13 speakers with FRDA and 18 healthy controls. Pauses were measured from the…

  12. Call recognition and individual identification of fish vocalizations based on automatic speech recognition: An example with the Lusitanian toadfish.

    PubMed

    Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C

    2015-12-01

    The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types.

  13. Development of coffee maker service robot using speech and face recognition systems using POMDP

    NASA Astrophysics Data System (ADS)

    Budiharto, Widodo; Meiliana; Santoso Gunawan, Alexander Agung

    2016-07-01

    There are many development of intelligent service robot in order to interact with user naturally. This purpose can be done by embedding speech and face recognition ability on specific tasks to the robot. In this research, we would like to propose Intelligent Coffee Maker Robot which the speech recognition is based on Indonesian language and powered by statistical dialogue systems. This kind of robot can be used in the office, supermarket or restaurant. In our scenario, robot will recognize user's face and then accept commands from the user to do an action, specifically in making a coffee. Based on our previous work, the accuracy for speech recognition is about 86% and face recognition is about 93% in laboratory experiments. The main problem in here is to know the intention of user about how sweetness of the coffee. The intelligent coffee maker robot should conclude the user intention through conversation under unreliable automatic speech in noisy environment. In this paper, this spoken dialog problem is treated as a partially observable Markov decision process (POMDP). We describe how this formulation establish a promising framework by empirical results. The dialog simulations are presented which demonstrate significant quantitative outcome.

  14. Automatic detection of Parkinson's disease in running speech spoken in three different languages.

    PubMed

    Orozco-Arroyave, J R; Hönig, F; Arias-Londoño, J D; Vargas-Bonilla, J F; Daqrouq, K; Skodda, S; Rusz, J; Nöth, E

    2016-01-01

    The aim of this study is the analysis of continuous speech signals of people with Parkinson's disease (PD) considering recordings in different languages (Spanish, German, and Czech). A method for the characterization of the speech signals, based on the automatic segmentation of utterances into voiced and unvoiced frames, is addressed here. The energy content of the unvoiced sounds is modeled using 12 Mel-frequency cepstral coefficients and 25 bands scaled according to the Bark scale. Four speech tasks comprising isolated words, rapid repetition of the syllables /pa/-/ta/-/ka/, sentences, and read texts are evaluated. The method proves to be more accurate than classical approaches in the automatic classification of speech of people with PD and healthy controls. The accuracies range from 85% to 99% depending on the language and the speech task. Cross-language experiments are also performed confirming the robustness and generalization capability of the method, with accuracies ranging from 60% to 99%. This work comprises a step forward for the development of computer aided tools for the automatic assessment of dysarthric speech signals in multiple languages.

  15. Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners with Simulated Age-Related Hearing Loss

    ERIC Educational Resources Information Center

    Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian

    2017-01-01

    Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist…

  16. Speech Processing and Recognition (SPaRe)

    DTIC Science & Technology

    2011-01-01

    results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing ( NLP ), and...Processing ( NLP ), Information Retrieval (IR) 16. SECURITY CLASSIFICATION OF: UNCLASSIFED 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...Figure 9, the IOC was only expected to provide document submission and search; automatic speech recognition (ASR) for English, Spanish, Arabic , and

  17. A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks.

    PubMed

    Shahamiri, Seyed Reza; Salim, Siti Salwah Binti

    2014-09-01

    Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of motor speech articulators. Although a few attempts have been made to apply ASR technologies to sufferers of dysarthria, previous studies show that such ASR systems have not attained an adequate level of performance. In this study, a dysarthric multi-networks speech recognizer (DM-NSR) model is provided using a realization of multi-views multi-learners approach called multi-nets artificial neural networks, which tolerates variability of dysarthric speech. In particular, the DM-NSR model employs several ANNs (as learners) to approximate the likelihood of ASR vocabulary words and to deal with the complexity of dysarthric speech. The proposed DM-NSR approach was presented as both speaker-dependent and speaker-independent paradigms. In order to highlight the performance of the proposed model over legacy models, multi-views single-learner models of the DM-NSRs were also provided and their efficiencies were compared in detail. Moreover, a comparison among the prominent dysarthric ASR methods and the proposed one is provided. The results show that the DM-NSR recorded improved recognition rate by up to 24.67% and the error rate was reduced by up to 8.63% over the reference model.

  18. EMG-based speech recognition using hidden markov models with global control variables.

    PubMed

    Lee, Ki-Seung

    2008-03-01

    It is well known that a strong relationship exists between human voices and the movement of articulatory facial muscles. In this paper, we utilize this knowledge to implement an automatic speech recognition scheme which uses solely surface electromyogram (EMG) signals. The sequence of EMG signals for each word is modelled by a hidden Markov model (HMM) framework. The main objective of the work involves building a model for state observation density when multichannel observation sequences are given. The proposed model reflects the dependencies between each of the EMG signals, which are described by introducing a global control variable. We also develop an efficient model training method, based on a maximum likelihood criterion. In a preliminary study, 60 isolated words were used as recognition variables. EMG signals were acquired from three articulatory facial muscles. The findings indicate that such a system may have the capacity to recognize speech signals with an accuracy of up to 87.07%, which is superior to the independent probabilistic model.

  19. Evaluation of speech recognizers for use in advanced combat helicopter crew station research and development

    NASA Technical Reports Server (NTRS)

    Simpson, Carol A.

    1990-01-01

    The U.S. Army Crew Station Research and Development Facility uses vintage 1984 speech recognizers. An evaluation was performed of newer off-the-shelf speech recognition devices to determine whether newer technology performance and capabilities are substantially better than that of the Army's current speech recognizers. The Phonetic Discrimination (PD-100) Test was used to compare recognizer performance in two ambient noise conditions: quiet office and helicopter noise. Test tokens were spoken by males and females and in isolated-word and connected-work mode. Better overall recognition accuracy was obtained from the newer recognizers. Recognizer capabilities needed to support the development of human factors design requirements for speech command systems in advanced combat helicopters are listed.

  20. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    ERIC Educational Resources Information Center

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  1. Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks

    ERIC Educational Resources Information Center

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2016-01-01

    This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…

  2. Attention Is Required for Knowledge-Based Sequential Grouping: Insights from the Integration of Syllables into Words.

    PubMed

    Ding, Nai; Pan, Xunyi; Luo, Cheng; Su, Naifei; Zhang, Wen; Zhang, Jianfeng

    2018-01-31

    How the brain groups sequential sensory events into chunks is a fundamental question in cognitive neuroscience. This study investigates whether top-down attention or specific tasks are required for the brain to apply lexical knowledge to group syllables into words. Neural responses tracking the syllabic and word rhythms of a rhythmic speech sequence were concurrently monitored using electroencephalography (EEG). The participants performed different tasks, attending to either the rhythmic speech sequence or a distractor, which was another speech stream or a nonlinguistic auditory/visual stimulus. Attention to speech, but not a lexical-meaning-related task, was required for reliable neural tracking of words, even when the distractor was a nonlinguistic stimulus presented cross-modally. Neural tracking of syllables, however, was reliably observed in all tested conditions. These results strongly suggest that neural encoding of individual auditory events (i.e., syllables) is automatic, while knowledge-based construction of temporal chunks (i.e., words) crucially relies on top-down attention. SIGNIFICANCE STATEMENT Why we cannot understand speech when not paying attention is an old question in psychology and cognitive neuroscience. Speech processing is a complex process that involves multiple stages, e.g., hearing and analyzing the speech sound, recognizing words, and combining words into phrases and sentences. The current study investigates which speech-processing stage is blocked when we do not listen carefully. We show that the brain can reliably encode syllables, basic units of speech sounds, even when we do not pay attention. Nevertheless, when distracted, the brain cannot group syllables into multisyllabic words, which are basic units for speech meaning. Therefore, the process of converting speech sound into meaning crucially relies on attention. Copyright © 2018 the authors 0270-6474/18/381178-11$15.00/0.

  3. Action Unit Models of Facial Expression of Emotion in the Presence of Speech

    PubMed Central

    Shah, Miraj; Cooper, David G.; Cao, Houwei; Gur, Ruben C.; Nenkova, Ani; Verma, Ragini

    2014-01-01

    Automatic recognition of emotion using facial expressions in the presence of speech poses a unique challenge because talking reveals clues for the affective state of the speaker but distorts the canonical expression of emotion on the face. We introduce a corpus of acted emotion expression where speech is either present (talking) or absent (silent). The corpus is uniquely suited for analysis of the interplay between the two conditions. We use a multimodal decision level fusion classifier to combine models of emotion from talking and silent faces as well as from audio to recognize five basic emotions: anger, disgust, fear, happy and sad. Our results strongly indicate that emotion prediction in the presence of speech from action unit facial features is less accurate when the person is talking. Modeling talking and silent expressions separately and fusing the two models greatly improves accuracy of prediction in the talking setting. The advantages are most pronounced when silent and talking face models are fused with predictions from audio features. In this multi-modal prediction both the combination of modalities and the separate models of talking and silent facial expression of emotion contribute to the improvement. PMID:25525561

  4. Relationships among Rapid Digit Naming, Phonological Processing, Motor Automaticity, and Speech Perception in Poor, Average, and Good Readers and Spellers

    ERIC Educational Resources Information Center

    Savage, Robert S.; Frederickson, Norah; Goodwin, Roz; Patni, Ulla; Smith, Nicola; Tuersley, Louise

    2005-01-01

    In this article, we explore the relationship between rapid automatized naming (RAN) and other cognitive processes among below-average, average, and above-average readers and spellers. Nonsense word reading, phonological awareness, RAN, automaticity of balance, speech perception, and verbal short-term and working memory were measured. Factor…

  5. Assessing Children's Home Language Environments Using Automatic Speech Recognition Technology

    ERIC Educational Resources Information Center

    Greenwood, Charles R.; Thiemann-Bourque, Kathy; Walker, Dale; Buzhardt, Jay; Gilkerson, Jill

    2011-01-01

    The purpose of this research was to replicate and extend some of the findings of Hart and Risley using automatic speech processing instead of human transcription of language samples. The long-term goal of this work is to make the current approach to speech processing possible by researchers and clinicians working on a daily basis with families and…

  6. An Exploration of the Potential of Automatic Speech Recognition to Assist and Enable Receptive Communication in Higher Education

    ERIC Educational Resources Information Center

    Wald, Mike

    2006-01-01

    The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search…

  7. Dynamic relation between working memory capacity and speech recognition in noise during the first 6 months of hearing aid use.

    PubMed

    Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker

    2014-11-23

    The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.

  8. Multilevel Analysis in Analyzing Speech Data

    ERIC Educational Resources Information Center

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  9. Digital signal processing algorithms for automatic voice recognition

    NASA Technical Reports Server (NTRS)

    Botros, Nazeih M.

    1987-01-01

    The current digital signal analysis algorithms are investigated that are implemented in automatic voice recognition algorithms. Automatic voice recognition means, the capability of a computer to recognize and interact with verbal commands. The digital signal is focused on, rather than the linguistic, analysis of speech signal. Several digital signal processing algorithms are available for voice recognition. Some of these algorithms are: Linear Predictive Coding (LPC), Short-time Fourier Analysis, and Cepstrum Analysis. Among these algorithms, the LPC is the most widely used. This algorithm has short execution time and do not require large memory storage. However, it has several limitations due to the assumptions used to develop it. The other 2 algorithms are frequency domain algorithms with not many assumptions, but they are not widely implemented or investigated. However, with the recent advances in the digital technology, namely signal processors, these 2 frequency domain algorithms may be investigated in order to implement them in voice recognition. This research is concerned with real time, microprocessor based recognition algorithms.

  10. How Does Reading Performance Modulate the Impact of Orthographic Knowledge on Speech Processing? A Comparison of Normal Readers and Dyslexic Adults

    ERIC Educational Resources Information Center

    Pattamadilok, Chotiga; Nelis, Aubéline; Kolinsky, Régine

    2014-01-01

    Studies on proficient readers showed that speech processing is affected by knowledge of the orthographic code. Yet, the automaticity of the orthographic influence depends on task demand. Here, we addressed this automaticity issue in normal and dyslexic adult readers by comparing the orthographic effects obtained in two speech processing tasks that…

  11. Female voice communications in high level aircraft cockpit noises--part II: vocoder and automatic speech recognition systems.

    PubMed

    Nixon, C; Anderson, T; Morris, L; McCavitt, A; McKinley, R; Yeager, D; McDaniel, M

    1998-11-01

    The intelligibility of female and male speech is equivalent under most ordinary living conditions. However, due to small differences between their acoustic speech signals, called speech spectra, one can be more or less intelligible than the other in certain situations such as high levels of noise. Anecdotal information, supported by some empirical observations, suggests that some of the high intensity noise spectra of military aircraft cockpits may degrade the intelligibility of female speech more than that of male speech. In an applied research study, the intelligibility of female and male speech was measured in several high level aircraft cockpit noise conditions experienced in military aviation. In Part I, (Nixon CW, et al. Aviat Space Environ Med 1998; 69:675-83) female speech intelligibility measured in the spectra and levels of aircraft cockpit noises and with noise-canceling microphones was lower than that of the male speech in all conditions. However, the differences were small and only those at some of the highest noise levels were significant. Although speech intelligibility of both genders was acceptable during normal cruise noises, improvements are required in most of the highest levels of noise created during maximum aircraft operating conditions. These results are discussed in a Part I technical report. This Part II report examines the intelligibility in the same aircraft cockpit noises of vocoded female and male speech and the accuracy with which female and male speech in some of the cockpit noises were understood by automatic speech recognition systems. The intelligibility of vocoded female speech was generally the same as that of vocoded male speech. No significant differences were measured between the recognition accuracy of male and female speech by the automatic speech recognition systems. The intelligibility of female and male speech was equivalent for these conditions.

  12. Objective voice and speech analysis of persons with chronic hoarseness by prosodic analysis of speech samples.

    PubMed

    Haderlein, Tino; Döllinger, Michael; Matoušek, Václav; Nöth, Elmar

    2016-10-01

    Automatic voice assessment is often performed using sustained vowels. In contrast, speech analysis of read-out texts can be applied to voice and speech assessment. Automatic speech recognition and prosodic analysis were used to find regression formulae between automatic and perceptual assessment of four voice and four speech criteria. The regression was trained with 21 men and 62 women (average age 49.2 years) and tested with another set of 24 men and 49 women (48.3 years), all suffering from chronic hoarseness. They read the text 'Der Nordwind und die Sonne' ('The North Wind and the Sun'). Five voice and speech therapists evaluated the data on 5-point Likert scales. Ten prosodic and recognition accuracy measures (features) were identified which describe all the examined criteria. Inter-rater correlation within the expert group was between r = 0.63 for the criterion 'match of breath and sense units' and r = 0.87 for the overall voice quality. Human-machine correlation was between r = 0.40 for the match of breath and sense units and r = 0.82 for intelligibility. The perceptual ratings of different criteria were highly correlated with each other. Likewise, the feature sets modeling the criteria were very similar. The automatic method is suitable for assessing chronic hoarseness in general and for subgroups of functional and organic dysphonia. In its current version, it is almost as reliable as a randomly picked rater from a group of voice and speech therapists.

  13. Do 6-Month-Olds Understand That Speech Can Communicate?

    ERIC Educational Resources Information Center

    Vouloumanos, Athena; Martin, Alia; Onishi, Kristine H.

    2014-01-01

    Adults and 12-month-old infants recognize that even unfamiliar speech can communicate information between third parties, suggesting that they can separate the communicative function of speech from its lexical content. But do infants recognize that speech can communicate due to their experience understanding and producing language, or do they…

  14. Recognizing intentions in infant-directed speech: evidence for universals.

    PubMed

    Bryant, Gregory A; Barrett, H Clark

    2007-08-01

    In all languages studied to date, distinct prosodic contours characterize different intention categories of infant-directed (ID) speech. This vocal behavior likely exists universally as a species-typical trait, but little research has examined whether listeners can accurately recognize intentions in ID speech using only vocal cues, without access to semantic information. We recorded native-English-speaking mothers producing four intention categories of utterances (prohibition, approval, comfort, and attention) as both ID and adult-directed (AD) speech, and we then presented the utterances to Shuar adults (South American hunter-horticulturalists). Shuar subjects were able to reliably distinguish ID from AD speech and were able to reliably recognize the intention categories in both types of speech, although performance was significantly better with ID speech. This is the first demonstration that adult listeners in an indigenous, nonindustrialized, and nonliterate culture can accurately infer intentions from both ID speech and AD speech in a language they do not speak.

  15. A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech

    PubMed Central

    Tóth, László; Hoffmann, Ildikó; Gosztolya, Gábor; Vincze, Veronika; Szatlóczki, Gréta; Bánréti, Zoltán; Pákáski, Magdolna; Kálmán, János

    2018-01-01

    Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer’s disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive de-cline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech sig-nals, first manually (using the Praat software), and then automatically, with an automatic speech recogni-tion (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. Conclusion: The temporal analysis of spontaneous speech can be exploited in implementing a new, auto-matic detection-based tool for screening MCI for the community. PMID:29165085

  16. A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech.

    PubMed

    Toth, Laszlo; Hoffmann, Ildiko; Gosztolya, Gabor; Vincze, Veronika; Szatloczki, Greta; Banreti, Zoltan; Pakaski, Magdolna; Kalman, Janos

    2018-01-01

    Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. The temporal analysis of spontaneous speech can be exploited in implementing a new, automatic detection-based tool for screening MCI for the community. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  17. Versatile simulation testbed for rotorcraft speech I/O system design

    NASA Technical Reports Server (NTRS)

    Simpson, Carol A.

    1986-01-01

    A versatile simulation testbed for the design of a rotorcraft speech I/O system is described in detail. The testbed will be used to evaluate alternative implementations of synthesized speech displays and speech recognition controls for the next generation of Army helicopters including the LHX. The message delivery logic is discussed as well as the message structure, the speech recognizer command structure and features, feedback from the recognizer, and random access to controls via speech command.

  18. On the context dependency of implicit self-esteem in social anxiety disorder.

    PubMed

    Hiller, Thomas S; Steffens, Melanie C; Ritter, Viktoria; Stangier, Ulrich

    2017-12-01

    Cognitive models assume that negative self-evaluations are automatically activated in individuals with Social Anxiety Disorder (SAD) during social situations, increasing their individual level of anxiety. This study examined automatic self-evaluations (i.e., implicit self-esteem) and state anxiety in a group of individuals with SAD (n = 45) and a non-clinical comparison group (NC; n = 46). Participants were randomly assigned to either a speech condition with social threat induction (giving an impromptu speech) or to a no-speech condition without social threat induction. We measured implicit self-esteem with an Implicit Association Test (IAT). Implicit self-esteem differed significantly between SAD and NC groups under the speech condition but not under the no-speech condition. The SAD group showed lower implicit self-esteem than the NC group under the speech-condition. State anxiety was significantly higher under the speech condition than under the no-speech condition in the SAD group but not in the NC group. Mediation analyses supported the idea that for the SAD group, the effect of experimental condition on state anxiety was mediated by implicit self-esteem. The causal relation between implicit self-esteem and state anxiety could not be determined. The findings corroborate hypotheses derived from cognitive models of SAD: Automatic self-evaluations were negatively biased in individuals with SAD facing social threat and showed an inverse relationship to levels of state anxiety. However, automatic self-evaluations in individuals with SAD can be unbiased (similar to NC) in situations without social threat. Copyright © 2017 Elsevier Ltd. All rights reserved.

  19. An investigation of articulatory setting using real-time magnetic resonance imaging

    PubMed Central

    Ramanarayanan, Vikram; Goldstein, Louis; Byrd, Dani; Narayanan, Shrikanth S.

    2013-01-01

    This paper presents an automatic procedure to analyze articulatory setting in speech production using real-time magnetic resonance imaging of the moving human vocal tract. The procedure extracts frames corresponding to inter-speech pauses, speech-ready intervals and absolute rest intervals from magnetic resonance imaging sequences of read and spontaneous speech elicited from five healthy speakers of American English and uses automatically extracted image features to quantify vocal tract posture during these intervals. Statistical analyses show significant differences between vocal tract postures adopted during inter-speech pauses and those at absolute rest before speech; the latter also exhibits a greater variability in the adopted postures. In addition, the articulatory settings adopted during inter-speech pauses in read and spontaneous speech are distinct. The results suggest that adopted vocal tract postures differ on average during rest positions, ready positions and inter-speech pauses, and might, in that order, involve an increasing degree of active control by the cognitive speech planning mechanism. PMID:23862826

  20. Assessment of voice, speech, and related quality of life in advanced head and neck cancer patients 10-years+ after chemoradiotherapy.

    PubMed

    Kraaijenga, S A C; Oskam, I M; van Son, R J J H; Hamming-Vrieze, O; Hilgers, F J M; van den Brekel, M W M; van der Molen, L

    2016-04-01

    Assessment of long-term objective and subjective voice, speech, articulation, and quality of life in patients with head and neck cancer (HNC) treated with concurrent chemoradiotherapy (CRT) for advanced, stage IV disease. Twenty-two disease-free survivors, treated with cisplatin-based CRT for inoperable HNC (1999-2004), were evaluated at 10-years post-treatment. A standard Dutch text was recorded. Perceptual analysis of voice, speech, and articulation was conducted by two expert listeners (SLPs). Also an experimental expert system based on automatic speech recognition was used. Patients' perception of voice and speech and related quality of life was assessed with the Voice Handicap Index (VHI) and Speech Handicap Index (SHI) questionnaires. At a median follow-up of 11-years, perceptual evaluation showed abnormal scores in up to 64% of cases, depending on the outcome parameter analyzed. Automatic assessment of voice and speech parameters correlated moderate to strong with perceptual outcome scores. Patient-reported problems with voice (VHI>15) and speech (SHI>6) in daily life were present in 68% and 77% of patients, respectively. Patients treated with IMRT showed significantly less impairment compared to those treated with conventional radiotherapy. More than 10-years after organ-preservation treatment, voice and speech problems are common in this patient cohort, as assessed with perceptual evaluation, automatic speech recognition, and with validated structured questionnaires. There were fewer complaints in patients treated with IMRT than with conventional radiotherapy. Copyright © 2016 Elsevier Ltd. All rights reserved.

  1. Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

    NASA Astrophysics Data System (ADS)

    Maskeliunas, Rytis; Rudzionis, Vytautas

    2011-06-01

    In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90%.

  2. Speech fluency profile on different tasks for individuals with Parkinson's disease.

    PubMed

    Juste, Fabiola Staróbole; Andrade, Claudia Regina Furquim de

    2017-07-20

    To characterize the speech fluency profile of patients with Parkinson's disease. Study participants were 40 individuals of both genders aged 40 to 80 years divided into 2 groups: Research Group - RG (20 individuals with diagnosis of Parkinson's disease) and Control Group - CG (20 individuals with no communication or neurological disorders). For all of the participants, three speech samples involving different tasks were collected: monologue, individual reading, and automatic speech. The RG presented a significant larger number of speech disruptions, both stuttering-like and typical dysfluencies, and higher percentage of speech discontinuity in the monologue and individual reading tasks compared with the CG. Both groups presented reduced number of speech disruptions (stuttering-like and typical dysfluencies) in the automatic speech task; the groups presented similar performance in this task. Regarding speech rate, individuals in the RG presented lower number of words and syllables per minute compared with those in the CG in all speech tasks. Participants of the RG presented altered parameters of speech fluency compared with those of the CG; however, this change in fluency cannot be considered a stuttering disorder.

  3. Automatic speech recognition in air traffic control

    NASA Technical Reports Server (NTRS)

    Karlsson, Joakim

    1990-01-01

    Automatic Speech Recognition (ASR) technology and its application to the Air Traffic Control system are described. The advantages of applying ASR to Air Traffic Control, as well as criteria for choosing a suitable ASR system are presented. Results from previous research and directions for future work at the Flight Transportation Laboratory are outlined.

  4. Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation

    ERIC Educational Resources Information Center

    Kim, In-Seok

    2006-01-01

    This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…

  5. Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation

    ERIC Educational Resources Information Center

    Elimat, Amal Khalil; AbuSeileek, Ali Farhan

    2014-01-01

    This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…

  6. Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.

    ERIC Educational Resources Information Center

    Makhoul, John I.

    The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…

  7. Investigating Prompt Difficulty in an Automatically Scored Speaking Performance Assessment

    ERIC Educational Resources Information Center

    Cox, Troy L.

    2013-01-01

    Speaking assessments for second language learners have traditionally been expensive to administer because of the cost of rating the speech samples. To reduce the cost, many researchers are investigating the potential of using automatic speech recognition (ASR) as a means to score examinee responses to open-ended prompts. This study examined the…

  8. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    ERIC Educational Resources Information Center

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  9. The performance of an automatic acoustic-based program classifier compared to hearing aid users' manual selection of listening programs.

    PubMed

    Searchfield, Grant D; Linford, Tania; Kobayashi, Kei; Crowhen, David; Latzel, Matthias

    2018-03-01

    To compare preference for and performance of manually selected programmes to an automatic sound classifier, the Phonak AutoSense OS. A single blind repeated measures study. Participants were fit with Phonak Virto V90 ITE aids; preferences for different listening programmes were compared across four different sound scenarios (speech in: quiet, noise, loud noise and a car). Following a 4-week trial preferences were reassessed and the users preferred programme was compared to the automatic classifier for sound quality and hearing in noise (HINT test) using a 12 loudspeaker array. Twenty-five participants with symmetrical moderate-severe sensorineural hearing loss. Participant preferences of manual programme for scenarios varied considerably between and within sessions. A HINT Speech Reception Threshold (SRT) advantage was observed for the automatic classifier over participant's manual selection for speech in quiet, loud noise and car noise. Sound quality ratings were similar for both manual and automatic selections. The use of a sound classifier is a viable alternative to manual programme selection.

  10. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    NASA Astrophysics Data System (ADS)

    Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.

    2009-12-01

    This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  11. A Mis-recognized Medical Vocabulary Correction System for Speech-based Electronic Medical Record

    PubMed Central

    Seo, Hwa Jeong; Kim, Ju Han; Sakabe, Nagamasa

    2002-01-01

    Speech recognition as an input tool for electronic medical record (EMR) enables efficient data entry at the point of care. However, the recognition accuracy for medical vocabulary is much poorer than that for doctor-patient dialogue. We developed a mis-recognized medical vocabulary correction system based on syllable-by-syllable comparison of speech text against medical vocabulary database. Using specialty medical vocabulary, the algorithm detects and corrects mis-recognized medical vocabularies in narrative text. Our preliminary evaluation showed 94% of accuracy in mis-recognized medical vocabulary correction.

  12. Inter-speaker speech variability assessment using statistical deformable models from 3.0 tesla magnetic resonance images.

    PubMed

    Vasconcelos, Maria J M; Ventura, Sandra M R; Freitas, Diamantino R S; Tavares, João Manuel R S

    2012-03-01

    The morphological and dynamic characterisation of the vocal tract during speech production has been gaining greater attention due to the motivation of the latest improvements in magnetic resonance (MR) imaging; namely, with the use of higher magnetic fields, such as 3.0 Tesla. In this work, the automatic study of the vocal tract from 3.0 Tesla MR images was assessed through the application of statistical deformable models. Therefore, the primary goal focused on the analysis of the shape of the vocal tract during the articulation of European Portuguese sounds, followed by the evaluation of the results concerning the automatic segmentation, i.e. identification of the vocal tract in new MR images. In what concerns speech production, this is the first attempt to automatically characterise and reconstruct the vocal tract shape of 3.0 Tesla MR images by using deformable models; particularly, by using active and appearance shape models. The achieved results clearly evidence the adequacy and advantage of the automatic analysis of the 3.0 Tesla MR images of these deformable models in order to extract the vocal tract shape and assess the involved articulatory movements. These achievements are mostly required, for example, for a better knowledge of speech production, mainly of patients suffering from articulatory disorders, and to build enhanced speech synthesizer models.

  13. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance

    PubMed Central

    Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn

    2016-01-01

    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486

  14. Speech Intelligibility Predicted from Neural Entrainment of the Speech Envelope.

    PubMed

    Vanthornhout, Jonas; Decruy, Lien; Wouters, Jan; Simon, Jonathan Z; Francart, Tom

    2018-04-01

    Speech intelligibility is currently measured by scoring how well a person can identify a speech signal. The results of such behavioral measures reflect neural processing of the speech signal, but are also influenced by language processing, motivation, and memory. Very often, electrophysiological measures of hearing give insight in the neural processing of sound. However, in most methods, non-speech stimuli are used, making it hard to relate the results to behavioral measures of speech intelligibility. The use of natural running speech as a stimulus in electrophysiological measures of hearing is a paradigm shift which allows to bridge the gap between behavioral and electrophysiological measures. Here, by decoding the speech envelope from the electroencephalogram, and correlating it with the stimulus envelope, we demonstrate an electrophysiological measure of neural processing of running speech. We show that behaviorally measured speech intelligibility is strongly correlated with our electrophysiological measure. Our results pave the way towards an objective and automatic way of assessing neural processing of speech presented through auditory prostheses, reducing confounds such as attention and cognitive capabilities. We anticipate that our electrophysiological measure will allow better differential diagnosis of the auditory system, and will allow the development of closed-loop auditory prostheses that automatically adapt to individual users.

  15. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    NASA Astrophysics Data System (ADS)

    Heracleous, Panikos; Kaino, Tomomi; Saruwatari, Hiroshi; Shikano, Kiyohiro

    2006-12-01

    We present the use of stethoscope and silicon NAM (nonaudible murmur) microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (nonaudible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc.) for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a[InlineEquation not available: see fulltext.] word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  16. The Effect of Automatic Speech Recognition Eyespeak Software on Iraqi Students' English Pronunciation: A Pilot Study

    ERIC Educational Resources Information Center

    Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani

    2017-01-01

    The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…

  17. Using Automatic Speech Recognition Technology with Elicited Oral Response Testing

    ERIC Educational Resources Information Center

    Cox, Troy L.; Davies, Randall S.

    2012-01-01

    This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…

  18. Automatic evaluation of hypernasality based on a cleft palate speech database.

    PubMed

    He, Ling; Zhang, Jing; Liu, Qi; Yin, Heng; Lech, Margaret; Huang, Yunzhi

    2015-05-01

    The hypernasality is one of the most typical characteristics of cleft palate (CP) speech. The evaluation outcome of hypernasality grading decides the necessity of follow-up surgery. Currently, the evaluation of CP speech is carried out by experienced speech therapists. However, the result strongly depends on their clinical experience and subjective judgment. This work aims to propose an automatic evaluation system for hypernasality grading in CP speech. The database tested in this work is collected by the Hospital of Stomatology, Sichuan University, which has the largest number of CP patients in China. Based on the production process of hypernasality, source sound pulse and vocal tract filter features are presented. These features include pitch, the first and second energy amplified frequency bands, cepstrum based features, MFCC, short-time energy in the sub-bands features. These features combined with KNN classier are applied to automatically classify four grades of hypernasality: normal, mild, moderate and severe. The experiment results show that the proposed system achieves a good performance. The classification rates for four hypernasality grades reach up to 80.4%. The sensitivity of proposed features to the gender is also discussed.

  19. Feeling backwards? How temporal order in speech affects the time course of vocal emotion recognition

    PubMed Central

    Rigoulot, Simon; Wassiliwizky, Eugen; Pell, Marc D.

    2013-01-01

    Recent studies suggest that the time course for recognizing vocal expressions of basic emotion in speech varies significantly by emotion type, implying that listeners uncover acoustic evidence about emotions at different rates in speech (e.g., fear is recognized most quickly whereas happiness and disgust are recognized relatively slowly; Pell and Kotz, 2011). To investigate whether vocal emotion recognition is largely dictated by the amount of time listeners are exposed to speech or the position of critical emotional cues in the utterance, 40 English participants judged the meaning of emotionally-inflected pseudo-utterances presented in a gating paradigm, where utterances were gated as a function of their syllable structure in segments of increasing duration from the end of the utterance (i.e., gated syllable-by-syllable from the offset rather than the onset of the stimulus). Accuracy for detecting six target emotions in each gate condition and the mean identification point for each emotion in milliseconds were analyzed and compared to results from Pell and Kotz (2011). We again found significant emotion-specific differences in the time needed to accurately recognize emotions from speech prosody, and new evidence that utterance-final syllables tended to facilitate listeners' accuracy in many conditions when compared to utterance-initial syllables. The time needed to recognize fear, anger, sadness, and neutral from speech cues was not influenced by how utterances were gated, although happiness and disgust were recognized significantly faster when listeners heard the end of utterances first. Our data provide new clues about the relative time course for recognizing vocally-expressed emotions within the 400–1200 ms time window, while highlighting that emotion recognition from prosody can be shaped by the temporal properties of speech. PMID:23805115

  20. Primary Progressive Speech Abulia.

    PubMed

    Milano, Nicholas J; Heilman, Kenneth M

    2015-01-01

    Primary progressive aphasia (PPA) is a neurodegenerative disorder characterized by progressive language impairment. The three variants of PPA include the nonfluent/agrammatic, semantic, and logopenic types. The goal of this report is to describe two patients with a loss of speech initiation that was associated with bilateral medial frontal atrophy. Two patients with progressive speech deficits were evaluated and their examinations revealed a paucity of spontaneous speech; however their naming, repetition, reading, and writing were all normal. The patients had no evidence of agrammatism or apraxia of speech but did have impaired speech fluency. In addition to impaired production of propositional spontaneous speech, these patients had impaired production of automatic speech (e.g., reciting the Lord's Prayer) and singing. Structural brain imaging revealed bilateral medial frontal atrophy in both patients. These patients' language deficits are consistent with a PPA, but they are in the pattern of a dynamic aphasia. Whereas the signs-symptoms of dynamic aphasia have been previously described, to our knowledge these are the first cases associated with predominantly bilateral medial frontal atrophy that impaired both propositional and automatic speech. Thus, this profile may represent a new variant of PPA.

  1. Probabilistic Phonotactics as a Cue for Recognizing Spoken Cantonese Words in Speech

    ERIC Educational Resources Information Center

    Yip, Michael C. W.

    2017-01-01

    Previous experimental psycholinguistic studies suggested that the probabilistic phonotactics information might likely to hint the locations of word boundaries in continuous speech and hence posed an interesting solution to the empirical question on how we recognize/segment individual spoken word in speech. We investigated this issue by using…

  2. Use of Computer Speech Technologies To Enhance Learning.

    ERIC Educational Resources Information Center

    Ferrell, Joe

    1999-01-01

    Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)

  3. Visual speech influences speech perception immediately but not automatically.

    PubMed

    Mitterer, Holger; Reinisch, Eva

    2017-02-01

    Two experiments examined the time course of the use of auditory and visual speech cues to spoken word recognition using an eye-tracking paradigm. Results of the first experiment showed that the use of visual speech cues from lipreading is reduced if concurrently presented pictures require a division of attentional resources. This reduction was evident even when listeners' eye gaze was on the speaker rather than the (static) pictures. Experiment 2 used a deictic hand gesture to foster attention to the speaker. At the same time, the visual processing load was reduced by keeping the visual display constant over a fixed number of successive trials. Under these conditions, the visual speech cues from lipreading were used. Moreover, the eye-tracking data indicated that visual information was used immediately and even earlier than auditory information. In combination, these data indicate that visual speech cues are not used automatically, but if they are used, they are used immediately.

  4. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    NASA Astrophysics Data System (ADS)

    Caballero Morales, Santiago Omar; Cox, Stephen J.

    2009-12-01

    Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  5. Automatic speech recognition research at NASA-Ames Research Center

    NASA Technical Reports Server (NTRS)

    Coler, Clayton R.; Plummer, Robert P.; Huff, Edward M.; Hitchcock, Myron H.

    1977-01-01

    A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented. The voice command system VCS encodes speech by sampling 16 bandpass filters with center frequencies in the range from 200 to 5000 Hz. Variations in speaking rate are compensated for by a compression algorithm that subdivides each utterance into eight subintervals in such a way that the amount of spectral change within each subinterval is the same. The recorded filter values within each subinterval are then reduced to a 15-bit representation, giving a 120-bit encoding for each utterance. The VCS incorporates a simple recognition algorithm that utilizes five training samples of each word in a vocabulary of up to 24 words. The recognition rate of approximately 85 percent correct for untrained speakers and 94 percent correct for trained speakers was not considered adequate for flight systems use. Therefore, the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.

  6. Automatic speech recognition using a predictive echo state network classifier.

    PubMed

    Skowronski, Mark D; Harris, John G

    2007-04-01

    We have combined an echo state network (ESN) with a competitive state machine framework to create a classification engine called the predictive ESN classifier. We derive the expressions for training the predictive ESN classifier and show that the model was significantly more noise robust compared to a hidden Markov model in noisy speech classification experiments by 8+/-1 dB signal-to-noise ratio. The simple training algorithm and noise robustness of the predictive ESN classifier make it an attractive classification engine for automatic speech recognition.

  7. Automatic concept extraction from spoken medical reports.

    PubMed

    Happe, André; Pouliquen, Bruno; Burgun, Anita; Cuggia, Marc; Le Beux, Pierre

    2003-07-01

    The objective of this project is to investigate methods whereby a combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices. We based our study on existing speech recognition software programs and on NOMINDEX, a tool that extracts MeSH concepts from medical text in natural language and that is mainly based on a French medical lexicon and on the UMLS. For each document, the process consists of three steps: (1) dictation and digital audio recording, (2) speech recognition, (3) automatic indexing. The evaluation consisted of a comparison between the set of concepts extracted by NOMINDEX after the speech recognition phase and the set of keywords manually extracted from the initial document. The method was evaluated on a set of 28 patient discharge summaries extracted from the MENELAS corpus in French, corresponding to in-patients admitted for coronarography. The overall precision was 73% and the overall recall was 90%. Indexing errors were mainly due to word sense ambiguity and abbreviations. A specific issue was the fact that the standard French translation of MeSH terms lacks diacritics. A preliminary evaluation of speech recognition tools showed that the rate of accurate recognition was higher than 98%. Only 3% of the indexing errors were generated by inadequate speech recognition. We discuss several areas to focus on to improve this prototype. However, the very low rate of indexing errors due to speech recognition errors highlights the potential benefits of combining speech recognition techniques and automatic indexing.

  8. The Use of an Autonomous Pedagogical Agent and Automatic Speech Recognition for Teaching Sight Words to Students with Autism Spectrum Disorder

    ERIC Educational Resources Information Center

    Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.

    2017-01-01

    In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…

  9. Real-time speech gisting for ATC applications

    NASA Astrophysics Data System (ADS)

    Dunkelberger, Kirk A.

    1995-06-01

    Command and control within the ATC environment remains primarily voice-based. Hence, automatic real time, speaker independent, continuous speech recognition (CSR) has many obvious applications and implied benefits to the ATC community: automated target tagging, aircraft compliance monitoring, controller training, automatic alarm disabling, display management, and many others. However, while current state-of-the-art CSR systems provide upwards of 98% word accuracy in laboratory environments, recent low-intrusion experiments in the ATCT environments demonstrated less than 70% word accuracy in spite of significant investments in recognizer tuning. Acoustic channel irregularities and controller/pilot grammar verities impact current CSR algorithms at their weakest points. It will be shown herein, however, that real time context- and environment-sensitive gisting can provide key command phrase recognition rates of greater than 95% using the same low-intrusion approach. The combination of real time inexact syntactic pattern recognition techniques and a tight integration of CSR, gisting, and ATC database accessor system components is the key to these high phase recognition rates. A system concept for real time gisting in the ATC context is presented herein. After establishing an application context, discussion presents a minimal CSR technology context then focuses on the gisting mechanism, desirable interfaces into the ATCT database environment, and data and control flow within the prototype system. Results of recent tests for a subset of the functionality are presented together with suggestions for further research.

  10. The Suitability of Cloud-Based Speech Recognition Engines for Language Learning

    ERIC Educational Resources Information Center

    Daniels, Paul; Iwago, Koji

    2017-01-01

    As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…

  11. Crew Communication as a Factor in Aviation Accidents

    NASA Technical Reports Server (NTRS)

    Goguen, J.; Linde, C.; Murphy, M.

    1986-01-01

    The crew communication process is analyzed. Planning and explanation are shown to be well-structured discourse types, described by formal rules. These formal rules are integrated with those describing the other most important discourse type within the cockpit: the command-and-control speech act chain. The latter is described as a sequence of speech acts for making requests (including orders and suggestions), for making reports, for supporting or challenging statements, and for acknowledging previous speech acts. Mitigation level, a linguistic indication of indirectness and tentativeness in speech, was an important variable in several hypotheses, i.e., the speech of subordinates is more mitigated than the speech of superiors, the speech of all crewmembers is less mitigated when they know that they are in either a problem or emergency situation, and mitigation is a factor in failures of crewmembers to initiate discussion of new topics or have suggestions ratified by the captain. Test results also show that planning and explanation are more frequently performed by captains, are done more during crew- recognized problems, and are done less during crew-recognized emergencies. The test results also indicated that planning and explanation are more frequently performed by captains than by other crewmembers, are done more during crew-recognized problems, and are done less during-recognized emergencies.

  12. Voice technology and BBN

    NASA Technical Reports Server (NTRS)

    Wolf, Jared J.

    1977-01-01

    The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.

  13. Presentation video retrieval using automatically recovered slide and spoken text

    NASA Astrophysics Data System (ADS)

    Cooper, Matthew

    2013-03-01

    Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.

  14. Incorporating Speech Recognition into a Natural User Interface

    NASA Technical Reports Server (NTRS)

    Chapa, Nicholas

    2017-01-01

    The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to investigate options for a Speech Recognition System. To that end I attempted to integrate Sphinx4 into a user interface. Sphinx4 had great accuracy and is the only free program able to perform offline speech dictation. However it had a limited dictionary of words that could be recognized, single syllable words were almost impossible for it to hear, and since it ran on Java it could not be integrated into the Unity based NUI. PocketSphinx ran much faster than Sphinx4 which would've made it ideal as a plugin to the Unity NUI, unfortunately creating a C# wrapper for the C code made the program unusable with Unity due to the wrapper slowing code execution and class files becoming unreachable. Unity Grammar Recognizer is the ideal speech recognition interface, it is flexible in recognizing multiple variations of the same command. It is also the most accurate program in recognizing speech due to using an XML grammar to specify speech structure instead of relying solely on a Dictionary and Language model. The Unity Grammar Recognizer will be used with the NUI for these reasons as well as being written in C# which further simplifies the incorporation.

  15. Twelve-month-old infants recognize that speech can communicate unobservable intentions.

    PubMed

    Vouloumanos, Athena; Onishi, Kristine H; Pogue, Amanda

    2012-08-07

    Much of our knowledge is acquired not from direct experience but through the speech of others. Speech allows rapid and efficient transfer of information that is otherwise not directly observable. Do infants recognize that speech, even if unfamiliar, can communicate about an important aspect of the world that cannot be directly observed: a person's intentions? Twelve-month-olds saw a person (the Communicator) attempt but fail to achieve a target action (stacking a ring on a funnel). The Communicator subsequently directed either speech or a nonspeech vocalization to another person (the Recipient) who had not observed the attempts. The Recipient either successfully stacked the ring (Intended outcome), attempted but failed to stack the ring (Observable outcome), or performed a different stacking action (Related outcome). Infants recognized that speech could communicate about unobservable intentions, looking longer at Observable and Related outcomes than the Intended outcome when the Communicator used speech. However, when the Communicator used nonspeech, infants looked equally at the three outcomes. Thus, for 12-month-olds, speech can transfer information about unobservable aspects of the world such as internal mental states, which provides preverbal infants with a tool for acquiring information beyond their immediate experience.

  16. Twelve-month-old infants recognize that speech can communicate unobservable intentions

    PubMed Central

    Vouloumanos, Athena; Onishi, Kristine H.; Pogue, Amanda

    2012-01-01

    Much of our knowledge is acquired not from direct experience but through the speech of others. Speech allows rapid and efficient transfer of information that is otherwise not directly observable. Do infants recognize that speech, even if unfamiliar, can communicate about an important aspect of the world that cannot be directly observed: a person’s intentions? Twelve-month-olds saw a person (the Communicator) attempt but fail to achieve a target action (stacking a ring on a funnel). The Communicator subsequently directed either speech or a nonspeech vocalization to another person (the Recipient) who had not observed the attempts. The Recipient either successfully stacked the ring (Intended outcome), attempted but failed to stack the ring (Observable outcome), or performed a different stacking action (Related outcome). Infants recognized that speech could communicate about unobservable intentions, looking longer at Observable and Related outcomes than the Intended outcome when the Communicator used speech. However, when the Communicator used nonspeech, infants looked equally at the three outcomes. Thus, for 12-month-olds, speech can transfer information about unobservable aspects of the world such as internal mental states, which provides preverbal infants with a tool for acquiring information beyond their immediate experience. PMID:22826217

  17. User Evaluation of a Communication System That Automatically Generates Captions to Improve Telephone Communication

    PubMed Central

    Zekveld, Adriana A.; Kramer, Sophia E.; Kessens, Judith M.; Vlaming, Marcel S. M. G.; Houtgast, Tammo

    2009-01-01

    This study examined the subjective benefit obtained from automatically generated captions during telephone-speech comprehension in the presence of babble noise. Short stories were presented by telephone either with or without captions that were generated offline by an automatic speech recognition (ASR) system. To simulate online ASR, the word accuracy (WA) level of the captions was 60% or 70% and the text was presented delayed to the speech. After each test, the hearing impaired participants (n = 20) completed the NASA-Task Load Index and several rating scales evaluating the support from the captions. Participants indicated that using the erroneous text in speech comprehension was difficult and the reported task load did not differ between the audio + text and audio-only conditions. In a follow-up experiment (n = 10), the perceived benefit of presenting captions increased with an increase of WA levels to 80% and 90%, and elimination of the text delay. However, in general, the task load did not decrease when captions were presented. These results suggest that the extra effort required to process the text could have been compensated for by less effort required to comprehend the speech. Future research should aim at reducing the complexity of the task to increase the willingness of hearing impaired persons to use an assistive communication system automatically providing captions. The current results underline the need for obtaining both objective and subjective measures of benefit when evaluating assistive communication systems. PMID:19126551

  18. The Contribution of Brainstem and Cerebellar Pathways to Auditory Recognition

    PubMed Central

    McLachlan, Neil M.; Wilson, Sarah J.

    2017-01-01

    The cerebellum has been known to play an important role in motor functions for many years. More recently its role has been expanded to include a range of cognitive and sensory-motor processes, and substantial neuroimaging and clinical evidence now points to cerebellar involvement in most auditory processing tasks. In particular, an increase in the size of the cerebellum over recent human evolution has been attributed in part to the development of speech. Despite this, the auditory cognition literature has largely overlooked afferent auditory connections to the cerebellum that have been implicated in acoustically conditioned reflexes in animals, and could subserve speech and other auditory processing in humans. This review expands our understanding of auditory processing by incorporating cerebellar pathways into the anatomy and functions of the human auditory system. We reason that plasticity in the cerebellar pathways underpins implicit learning of spectrotemporal information necessary for sound and speech recognition. Once learnt, this information automatically recognizes incoming auditory signals and predicts likely subsequent information based on previous experience. Since sound recognition processes involving the brainstem and cerebellum initiate early in auditory processing, learnt information stored in cerebellar memory templates could then support a range of auditory processing functions such as streaming, habituation, the integration of auditory feature information such as pitch, and the recognition of vocal communications. PMID:28373850

  19. Use of speech-to-text technology for documentation by healthcare providers.

    PubMed

    Ajami, Sima

    2016-01-01

    Medical records are a critical component of a patient's treatment. However, documentation of patient-related information is considered a secondary activity in the provision of healthcare services, often leading to incomplete medical records and patient data of low quality. Advances in information technology (IT) in the health system and registration of information in electronic health records (EHR) using speechto- text conversion software have facilitated service delivery. This narrative review is a literature search with the help of libraries, books, conference proceedings, databases of Science Direct, PubMed, Proquest, Springer, SID (Scientific Information Database), and search engines such as Yahoo, and Google. I used the following keywords and their combinations: speech recognition, automatic report documentation, voice to text software, healthcare, information, and voice recognition. Due to lack of knowledge of other languages, I searched all texts in English or Persian with no time limits. Of a total of 70, only 42 articles were selected. Speech-to-text conversion technology offers opportunities to improve the documentation process of medical records, reduce cost and time of recording information, enhance the quality of documentation, improve the quality of services provided to patients, and support healthcare providers in legal matters. Healthcare providers should recognize the impact of this technology on service delivery.

  20. Two Methods of Automatic Evaluation of Speech Signal Enhancement Recorded in the Open-Air MRI Environment

    NASA Astrophysics Data System (ADS)

    Přibil, Jiří; Přibilová, Anna; Frollo, Ivan

    2017-12-01

    The paper focuses on two methods of evaluation of successfulness of speech signal enhancement recorded in the open-air magnetic resonance imager during phonation for the 3D human vocal tract modeling. The first approach enables to obtain a comparison based on statistical analysis by ANOVA and hypothesis tests. The second method is based on classification by Gaussian mixture models (GMM). The performed experiments have confirmed that the proposed ANOVA and GMM classifiers for automatic evaluation of the speech quality are functional and produce fully comparable results with the standard evaluation based on the listening test method.

  1. How Accurately Can the Google Web Speech API Recognize and Transcribe Japanese L2 English Learners' Oral Production?

    ERIC Educational Resources Information Center

    Ashwell, Tim; Elam, Jesse R.

    2017-01-01

    The ultimate aim of our research project was to use the Google Web Speech API to automate scoring of elicited imitation (EI) tests. However, in order to achieve this goal, we had to take a number of preparatory steps. We needed to assess how accurate this speech recognition tool is in recognizing native speakers' production of the test items; we…

  2. Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model.

    PubMed

    Jürgens, Tim; Brand, Thomas

    2009-11-01

    This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.

  3. Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems.

    PubMed

    Greene, Beth G; Logan, John S; Pisoni, David B

    1986-03-01

    We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.

  4. Performance of Automated Speech Scoring on Different Low- to Medium-Entropy Item Types for Low-Proficiency English Learners. Research Report. ETS RR-17-12

    ERIC Educational Resources Information Center

    Loukina, Anastassia; Zechner, Klaus; Yoon, Su-Youn; Zhang, Mo; Tao, Jidong; Wang, Xinhao; Lee, Chong Min; Mulholland, Matthew

    2017-01-01

    This report presents an overview of the "SpeechRater"? automated scoring engine model building and evaluation process for several item types with a focus on a low-English-proficiency test-taker population. We discuss each stage of speech scoring, including automatic speech recognition, filtering models for nonscorable responses, and…

  5. Contributions of speech science to the technology of man-machine voice interactions

    NASA Technical Reports Server (NTRS)

    Lea, Wayne A.

    1977-01-01

    Research in speech understanding was reviewed. Plans which include prosodics research, phonological rules for speech understanding systems, and continued interdisciplinary phonetics research are discussed. Improved acoustic phonetic analysis capabilities in speech recognizers are suggested.

  6. Segmentation and Characterization of Chewing Bouts by Monitoring Temporalis Muscle Using Smart Glasses With Piezoelectric Sensor.

    PubMed

    Farooq, Muhammad; Sazonov, Edward

    2017-11-01

    Several methods have been proposed for automatic and objective monitoring of food intake, but their performance suffers in the presence of speech and motion artifacts. This paper presents a novel sensor system and algorithms for detection and characterization of chewing bouts from a piezoelectric strain sensor placed on the temporalis muscle. The proposed data acquisition device was incorporated into the temple of eyeglasses. The system was tested by ten participants in two part experiments, one under controlled laboratory conditions and the other in unrestricted free-living. The proposed food intake recognition method first performed an energy-based segmentation to isolate candidate chewing segments (instead of using epochs of fixed duration commonly reported in research literature), with the subsequent classification of the segments by linear support vector machine models. On participant level (combining data from both laboratory and free-living experiments), with ten-fold leave-one-out cross-validation, chewing were recognized with average F-score of 96.28% and the resultant area under the curve was 0.97, which are higher than any of the previously reported results. A multivariate regression model was used to estimate chew counts from segments classified as chewing with an average mean absolute error of 3.83% on participant level. These results suggest that the proposed system is able to identify chewing segments in the presence of speech and motion artifacts, as well as automatically and accurately quantify chewing behavior, both under controlled laboratory conditions and unrestricted free-living.

  7. Automatic translation among spoken languages

    NASA Technical Reports Server (NTRS)

    Walter, Sharon M.; Costigan, Kelly

    1994-01-01

    The Machine Aided Voice Translation (MAVT) system was developed in response to the shortage of experienced military field interrogators with both foreign language proficiency and interrogation skills. Combining speech recognition, machine translation, and speech generation technologies, the MAVT accepts an interrogator's spoken English question and translates it into spoken Spanish. The spoken Spanish response of the potential informant can then be translated into spoken English. Potential military and civilian applications for automatic spoken language translation technology are discussed in this paper.

  8. Speech Recognition Software for Language Learning: Toward an Evaluation of Validity and Student Perceptions

    ERIC Educational Resources Information Center

    Cordier, Deborah

    2009-01-01

    A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…

  9. The Path to Language: Toward Bilingual Education for Deaf Children.

    ERIC Educational Resources Information Center

    Bouvet, Danielle

    Discussion of speech instruction in bilingual education for deaf children refutes the assumption that speech is acquired automatically by hearing children and examines a program in which deaf children are taught alongside hearing children. The first part looks at how speech functions and how children acquire it: including the nature of the…

  10. Cognitive Load in Voice Therapy Carry-Over Exercises

    ERIC Educational Resources Information Center

    Iwarsson, Jenny; Morris, David Jackson; Balling, Laura Winther

    2017-01-01

    Purpose: The cognitive load generated by online speech production may vary with the nature of the speech task. This article examines 3 speech tasks used in voice therapy carry-over exercises, in which a patient is required to adopt and automatize new voice behaviors, ultimately in daily spontaneous communication. Method: Twelve subjects produced…

  11. Speech Recognition Using Multiple Features and Multiple Recognizers

    DTIC Science & Technology

    1991-12-03

    6 2.1 Introduction ............................................... 6 2.2 Human Speech Communication Process...119 How to Setup ASRT.......................................... 119 How to Use Interactive Menus .................................. 120...recognize a word from an acoustic signal. The human ear and brain perform this type of recognition with incredible speed and precision. Even though

  12. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications.

    PubMed

    VanDam, Mark; Silbert, Noah H

    2016-01-01

    Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output.

  13. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications

    PubMed Central

    2016-01-01

    Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output. PMID:27529813

  14. Massively-Parallel Architectures for Automatic Recognition of Visual Speech Signals

    DTIC Science & Technology

    1988-10-12

    Secusrity Clamifieation, Nlassively-Parallel Architectures for Automa ic Recognitio of Visua, Speech Signals 12. PERSONAL AUTHOR(S) Terrence J...characteristics of speech from tJhe, visual speech signals. Neural networks have been trained on a database of vowels. The rqw images of faces , aligned and...images of faces , aligned and preprocessed, were used as input to these network which were trained to estimate the corresponding envelope of the

  15. Speech Recognition as a Support Service for Deaf and Hard of Hearing Students: Adaptation and Evaluation. Final Report to Spencer Foundation.

    ERIC Educational Resources Information Center

    Stinson, Michael; Elliot, Lisa; McKee, Barbara; Coyne, Gina

    This report discusses a project that adapted new automatic speech recognition (ASR) technology to provide real-time speech-to-text transcription as a support service for students who are deaf and hard of hearing (D/HH). In this system, as the teacher speaks, a hearing intermediary, or captionist, dictates into the speech recognition system in a…

  16. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms.

    PubMed

    Schädler, Marc R; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than -20 dB could not be predicted.

  17. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms

    PubMed Central

    Schädler, Marc R.; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than −20 dB could not be predicted. PMID:29692200

  18. Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems

    PubMed Central

    GREENE, BETH G.; LOGAN, JOHN S.; PISONI, DAVID B.

    2012-01-01

    We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered. PMID:23225916

  19. Improving the Child's Speech. Second Edition.

    ERIC Educational Resources Information Center

    Anderson, Virgil A.; Newby, Hayes A.

    This is primarily a book for the classroom teacher, although it will also prove useful to parents, child-guidance workers, physicians, and others who are concerned with the development and training of children with speech handicaps. Contents include "Speech Improvement as an Educational Problem,""Recognizing Speech Disabilities,""The Development…

  20. Parents and Speech Therapist Perception of Parental Involvement in Kailila Therapy Center, Jakarta, Indonesia

    ERIC Educational Resources Information Center

    Jane, Griselda; Tunjungsari, Harini

    2015-01-01

    Parental involvement in a speech therapy has not been prioritized in most therapy centers in Indonesia. One of the therapy centers that has recognized the importance of parental involvement is Kailila Speech Therapy Center. In Kailila speech therapy center, parental involvement in children's speech therapy is an obligation that has been…

  1. Intelligent interfaces for expert systems

    NASA Technical Reports Server (NTRS)

    Villarreal, James A.; Wang, Lui

    1988-01-01

    Vital to the success of an expert system is an interface to the user which performs intelligently. A generic intelligent interface is being developed for expert systems. This intelligent interface was developed around the in-house developed Expert System for the Flight Analysis System (ESFAS). The Flight Analysis System (FAS) is comprised of 84 configuration controlled FORTRAN subroutines that are used in the preflight analysis of the space shuttle. In order to use FAS proficiently, a person must be knowledgeable in the areas of flight mechanics, the procedures involved in deploying a certain payload, and an overall understanding of the FAS. ESFAS, still in its developmental stage, is taking into account much of this knowledge. The generic intelligent interface involves the integration of a speech recognizer and synthesizer, a preparser, and a natural language parser to ESFAS. The speech recognizer being used is capable of recognizing 1000 words of connected speech. The natural language parser is a commercial software package which uses caseframe instantiation in processing the streams of words from the speech recognizer or the keyboard. The systems configuration is described along with capabilities and drawbacks.

  2. Intra- and Inter-database Study for Arabic, English, and German Databases: Do Conventional Speech Features Detect Voice Pathology?

    PubMed

    Ali, Zulfiqar; Alsulaiman, Mansour; Muhammad, Ghulam; Elamvazuthi, Irraivan; Al-Nasheri, Ahmed; Mesallam, Tamer A; Farahat, Mohamed; Malki, Khalid H

    2017-05-01

    A large population around the world has voice complications. Various approaches for subjective and objective evaluations have been suggested in the literature. The subjective approach strongly depends on the experience and area of expertise of a clinician, and human error cannot be neglected. On the other hand, the objective or automatic approach is noninvasive. Automatic developed systems can provide complementary information that may be helpful for a clinician in the early screening of a voice disorder. At the same time, automatic systems can be deployed in remote areas where a general practitioner can use them and may refer the patient to a specialist to avoid complications that may be life threatening. Many automatic systems for disorder detection have been developed by applying different types of conventional speech features such as the linear prediction coefficients, linear prediction cepstral coefficients, and Mel-frequency cepstral coefficients (MFCCs). This study aims to ascertain whether conventional speech features detect voice pathology reliably, and whether they can be correlated with voice quality. To investigate this, an automatic detection system based on MFCC was developed, and three different voice disorder databases were used in this study. The experimental results suggest that the accuracy of the MFCC-based system varies from database to database. The detection rate for the intra-database ranges from 72% to 95%, and that for the inter-database is from 47% to 82%. The results conclude that conventional speech features are not correlated with voice, and hence are not reliable in pathology detection. Copyright © 2017 The Voice Foundation. Published by Elsevier Inc. All rights reserved.

  3. Reference-free automatic quality assessment of tracheoesophageal speech.

    PubMed

    Huang, Andy; Falk, Tiago H; Chan, Wai-Yip; Parsa, Vijay; Doyle, Philip

    2009-01-01

    Evaluation of the quality of tracheoesophageal (TE) speech using machines instead of human experts can enhance the voice rehabilitation process for patients who have undergone total laryngectomy and voice restoration. Towards the goal of devising a reference-free TE speech quality estimation algorithm, we investigate the efficacy of speech signal features that are used in standard telephone-speech quality assessment algorithms, in conjunction with a recently introduced speech modulation spectrum measure. Tests performed on two TE speech databases demonstrate that the modulation spectral measure and a subset of features in the standard ITU-T P.563 algorithm estimate TE speech quality with better correlation (up to 0.9) than previously proposed features.

  4. Recognizing speech in a novel accent: the motor theory of speech perception reframed.

    PubMed

    Moulin-Frier, Clément; Arbib, Michael A

    2013-08-01

    The motor theory of speech perception holds that we perceive the speech of another in terms of a motor representation of that speech. However, when we have learned to recognize a foreign accent, it seems plausible that recognition of a word rarely involves reconstruction of the speech gestures of the speaker rather than the listener. To better assess the motor theory and this observation, we proceed in three stages. Part 1 places the motor theory of speech perception in a larger framework based on our earlier models of the adaptive formation of mirror neurons for grasping, and for viewing extensions of that mirror system as part of a larger system for neuro-linguistic processing, augmented by the present consideration of recognizing speech in a novel accent. Part 2 then offers a novel computational model of how a listener comes to understand the speech of someone speaking the listener's native language with a foreign accent. The core tenet of the model is that the listener uses hypotheses about the word the speaker is currently uttering to update probabilities linking the sound produced by the speaker to phonemes in the native language repertoire of the listener. This, on average, improves the recognition of later words. This model is neutral regarding the nature of the representations it uses (motor vs. auditory). It serve as a reference point for the discussion in Part 3, which proposes a dual-stream neuro-linguistic architecture to revisits claims for and against the motor theory of speech perception and the relevance of mirror neurons, and extracts some implications for the reframing of the motor theory.

  5. [The endpoint detection of cough signal in continuous speech].

    PubMed

    Yang, Guoqing; Mo, Hongqiang; Li, Wen; Lian, Lianfang; Zheng, Zeguang

    2010-06-01

    The endpoint detection of cough signal in continuous speech has been researched in order to improve the efficiency and veracity of manual recognition or computer-based automatic recognition. First, using the short time zero crossing ratio(ZCR) for identifying the suspicious coughs and getting the threshold of short time energy based on acoustic characteristics of cough. Then, the short time energy is combined with short time ZCR in order to implement the endpoint detection of cough in continuous speech. To evaluate the effect of the method, first, the virtual number of coughs in each recording was identified by two experienced doctors using the graphical user interface (GUI). Second, the recordings were analyzed by automatic endpoint detection program under Matlab7.0. Finally, the comparison between these two results showed: The error rate of undetected cough is 2.18%, and 98.13% of noise, silence and speech were removed. The way of setting short time energy threshold is robust. The endpoint detection program can remove most speech and noise, thus maintaining a lower rate of error.

  6. [Repetitive phenomenona in the spontaneous speech of aphasic patients: perseveration, stereotypy, echolalia, automatism and recurring utterance].

    PubMed

    Wallesch, C W; Brunner, R J; Seemüller, E

    1983-12-01

    Repetitive phenomena in spontaneous speech were investigated in 30 patients with chronic infarctions of the left hemisphere which included Broca's and/or Wernicke's area and/or the basal ganglia. Perseverations, stereotypies, and echolalias occurred with all types of brain lesions, automatisms and recurring utterances only with those patients, whose infarctions involved Wernicke's area and basal ganglia. These patients also showed more echolalic responses. The results are discussed in view of the role of the basal ganglia as motor program generators.

  7. "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing.

    PubMed

    Xiao, Bo; Imel, Zac E; Georgiou, Panayiotis G; Atkins, David C; Narayanan, Shrikanth S

    2015-01-01

    The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.

  8. Do Older Listeners With Hearing Loss Benefit From Dynamic Pitch for Speech Recognition in Noise?

    PubMed

    Shen, Jing; Souza, Pamela E

    2017-10-12

    Dynamic pitch, the variation in the fundamental frequency of speech, aids older listeners' speech perception in noise. It is unclear, however, whether some older listeners with hearing loss benefit from strengthened dynamic pitch cues for recognizing speech in certain noise scenarios and how this relative benefit may be associated with individual factors. We first examined older individuals' relative benefit between natural and strong dynamic pitches for better speech recognition in noise. Further, we reported the individual factors of the 2 groups of listeners who benefit differently from natural and strong dynamic pitches. Speech reception thresholds of 13 older listeners with mild-moderate hearing loss were measured using target speech with 3 levels of dynamic pitch strength. Individuals' ability to benefit from dynamic pitch was defined as the speech reception threshold difference between speeches with and without dynamic pitch cues. The relative benefit of natural versus strong dynamic pitch varied across individuals. However, this relative benefit remained consistent for the same individuals across those background noises with temporal modulation. Those listeners who benefited more from strong dynamic pitch reported better subjective speech perception abilities. Strong dynamic pitch may be more beneficial than natural dynamic pitch for some older listeners to recognize speech better in noise, particularly when the noise has temporal modulation.

  9. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers

    PubMed Central

    Liu, Yin; Yin, Heng; Zhang, Junpeng; Zhang, Jing; Zhang, Jiang

    2017-01-01

    The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with “quasi-unvoiced” or with “quasi-voiced” initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the syllables is 91.24%. PMID:28926572

  10. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers.

    PubMed

    He, Ling; Liu, Yin; Yin, Heng; Zhang, Junpeng; Zhang, Jing; Zhang, Jiang

    2017-01-01

    The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with "quasi-unvoiced" or with "quasi-voiced" initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the syllables is 91.24%.

  11. Development and Perceptual Evaluation of Amplitude-Based F0 Control in Electrolarynx Speech

    ERIC Educational Resources Information Center

    Saikachi, Yoko; Stevens, Kenneth N.; Hillman, Robert E.

    2009-01-01

    Purpose: Current electrolarynx (EL) devices produce a mechanical speech quality that has been largely attributed to the lack of natural fundamental frequency (F0) variation. In order to improve the quality of EL speech, in the present study the authors aimed to develop and evaluate an automatic F0 control scheme, in which F0 was modulated based on…

  12. A chimpanzee recognizes synthetic speech with significantly reduced acoustic cues to phonetic content.

    PubMed

    Heimbauer, Lisa A; Beran, Michael J; Owren, Michael J

    2011-07-26

    A long-standing debate concerns whether humans are specialized for speech perception, which some researchers argue is demonstrated by the ability to understand synthetic speech with significantly reduced acoustic cues to phonetic content. We tested a chimpanzee (Pan troglodytes) that recognizes 128 spoken words, asking whether she could understand such speech. Three experiments presented 48 individual words, with the animal selecting a corresponding visuographic symbol from among four alternatives. Experiment 1 tested spectrally reduced, noise-vocoded (NV) synthesis, originally developed to simulate input received by human cochlear-implant users. Experiment 2 tested "impossibly unspeechlike" sine-wave (SW) synthesis, which reduces speech to just three moving tones. Although receiving only intermittent and noncontingent reward, the chimpanzee performed well above chance level, including when hearing synthetic versions for the first time. Recognition of SW words was least accurate but improved in experiment 3 when natural words in the same session were rewarded. The chimpanzee was more accurate with NV than SW versions, as were 32 human participants hearing these items. The chimpanzee's ability to spontaneously recognize acoustically reduced synthetic words suggests that experience rather than specialization is critical for speech-perception capabilities that some have suggested are uniquely human. Copyright © 2011 Elsevier Ltd. All rights reserved.

  13. Speech perception of sine-wave signals by children with cochlear implants

    PubMed Central

    Nittrouer, Susan; Kuess, Jamie; Lowenstein, Joanna H.

    2015-01-01

    Children need to discover linguistically meaningful structures in the acoustic speech signal. Being attentive to recurring, time-varying formant patterns helps in that process. However, that kind of acoustic structure may not be available to children with cochlear implants (CIs), thus hindering development. The major goal of this study was to examine whether children with CIs are as sensitive to time-varying formant structure as children with normal hearing (NH) by asking them to recognize sine-wave speech. The same materials were presented as speech in noise, as well, to evaluate whether any group differences might simply reflect general perceptual deficits on the part of children with CIs. Vocabulary knowledge, phonemic awareness, and “top-down” language effects were all also assessed. Finally, treatment factors were examined as possible predictors of outcomes. Results showed that children with CIs were as accurate as children with NH at recognizing sine-wave speech, but poorer at recognizing speech in noise. Phonemic awareness was related to that recognition. Top-down effects were similar across groups. Having had a period of bimodal stimulation near the time of receiving a first CI facilitated these effects. Results suggest that children with CIs have access to the important time-varying structure of vocal-tract formants. PMID:25994709

  14. Use of Speech Analyses within a Mobile Application for the Assessment of Cognitive Impairment in Elderly People.

    PubMed

    Konig, Alexandra; Satt, Aharon; Sorin, Alex; Hoory, Ran; Derreumaux, Alexandre; David, Renaud; Robert, Phillippe H

    2018-01-01

    Various types of dementia and Mild Cognitive Impairment (MCI) are manifested as irregularities in human speech and language, which have proven to be strong predictors for the disease presence and progress ion. Therefore, automatic speech analytics provided by a mobile application may be a useful tool in providing additional indicators for assessment and detection of early stage dementia and MCI. 165 participants (subjects with subjective cognitive impairment (SCI), MCI patients, Alzheimer's disease (AD) and mixed dementia (MD) patients) were recorded with a mobile application while performing several short vocal cognitive tasks during a regular consultation. These tasks included verbal fluency, picture description, counting down and a free speech task. The voice recordings were processed in two steps: in the first step, vocal markers were extracted using speech signal processing techniques; in the second, the vocal markers were tested to assess their 'power' to distinguish between SCI, MCI, AD and MD. The second step included training automatic classifiers for detecting MCI and AD, based on machine learning methods, and testing the detection accuracy. The fluency and free speech tasks obtain the highest accuracy rates of classifying AD vs. MD vs. MCI vs. SCI. Using the data, we demonstrated classification accuracy as follows: SCI vs. AD = 92% accuracy; SCI vs. MD = 92% accuracy; SCI vs. MCI = 86% accuracy and MCI vs. AD = 86%. Our results indicate the potential value of vocal analytics and the use of a mobile application for accurate automatic differentiation between SCI, MCI and AD. This tool can provide the clinician with meaningful information for assessment and monitoring of people with MCI and AD based on a non-invasive, simple and low-cost method. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  15. Speech recognition for embedded automatic positioner for laparoscope

    NASA Astrophysics Data System (ADS)

    Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin

    2014-07-01

    In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.

  16. Hierarchical singleton-type recurrent neural fuzzy networks for noisy speech recognition.

    PubMed

    Juang, Chia-Feng; Chiou, Chyi-Tian; Lai, Chun-Lung

    2007-05-01

    This paper proposes noisy speech recognition using hierarchical singleton-type recurrent neural fuzzy networks (HSRNFNs). The proposed HSRNFN is a hierarchical connection of two singleton-type recurrent neural fuzzy networks (SRNFNs), where one is used for noise filtering and the other for recognition. The SRNFN is constructed by recurrent fuzzy if-then rules with fuzzy singletons in the consequences, and their recurrent properties make them suitable for processing speech patterns with temporal characteristics. In n words recognition, n SRNFNs are created for modeling n words, where each SRNFN receives the current frame feature and predicts the next one of its modeling word. The prediction error of each SRNFN is used as recognition criterion. In filtering, one SRNFN is created, and each SRNFN recognizer is connected to the same SRNFN filter, which filters noisy speech patterns in the feature domain before feeding them to the SRNFN recognizer. Experiments with Mandarin word recognition under different types of noise are performed. Other recognizers, including multilayer perceptron (MLP), time-delay neural networks (TDNNs), and hidden Markov models (HMMs), are also tested and compared. These experiments and comparisons demonstrate good results with HSRNFN for noisy speech recognition tasks.

  17. Analysis of Factors Affecting System Performance in the ASpIRE Challenge

    DTIC Science & Technology

    2015-12-13

    performance in the ASpIRE (Automatic Speech recognition In Reverberant Environments) challenge. In particular, overall word error rate (WER) of the solver...systems is analyzed as a function of room, distance between talker and microphone, and microphone type. We also analyze speech activity detection...analysis will inform the design of future challenges and provide insight into the efficacy of current solutions addressing noisy reverberant speech

  18. Identifying Speech Acts in E-Mails: Toward Automated Scoring of the "TOEIC"® E-Mail Task. Research Report. ETS RR-12-16

    ERIC Educational Resources Information Center

    De Felice, Rachele; Deane, Paul

    2012-01-01

    This study proposes an approach to automatically score the "TOEIC"® Writing e-mail task. We focus on one component of the scoring rubric, which notes whether the test-takers have used particular speech acts such as requests, orders, or commitments. We developed a computational model for automated speech act identification and tested it…

  19. Automated Intelligibility Assessment of Pathological Speech Using Phonological Features

    NASA Astrophysics Data System (ADS)

    Middag, Catherine; Martens, Jean-Pierre; Van Nuffelen, Gwen; De Bodt, Marc

    2009-12-01

    It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words) and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008) is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.

  20. Tuning time-frequency methods for the detection of metered HF speech

    NASA Astrophysics Data System (ADS)

    Nelson, Douglas J.; Smith, Lawrence H.

    2002-12-01

    Speech is metered if the stresses occur at a nearly regular rate. Metered speech is common in poetry, and it can occur naturally in speech, if the speaker is spelling a word or reciting words or numbers from a list. In radio communications, the CQ request, call sign and other codes are frequently metered. In tactical communications and air traffic control, location, heading and identification codes may be metered. Moreover metering may be expected to survive even in HF communications, which are corrupted by noise, interference and mistuning. For this environment, speech recognition and conventional machine-based methods are not effective. We describe Time-Frequency methods which have been adapted successfully to the problem of mitigation of HF signal conditions and detection of metered speech. These methods are based on modeled time and frequency correlation properties of nearly harmonic functions. We derive these properties and demonstrate a performance gain over conventional correlation and spectral methods. Finally, in addressing the problem of HF single sideband (SSB) communications, the problems of carrier mistuning, interfering signals, such as manual Morse, and fast automatic gain control (AGC) must be addressed. We demonstrate simple methods which may be used to blindly mitigate mistuning and narrowband interference, and effectively invert the fast automatic gain function.

  1. Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners With Simulated Age-Related Hearing Loss.

    PubMed

    Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian

    2017-09-18

    The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.

  2. Is automatic speech-to-text transcription ready for use in psychological experiments?

    PubMed

    Ziman, Kirsten; Heusser, Andrew C; Fitzpatrick, Paxton C; Field, Campbell E; Manning, Jeremy R

    2018-04-23

    Verbal responses are a convenient and naturalistic way for participants to provide data in psychological experiments (Salzinger, The Journal of General Psychology, 61(1),65-94:1959). However, audio recordings of verbal responses typically require additional processing, such as transcribing the recordings into text, as compared with other behavioral response modalities (e.g., typed responses, button presses, etc.). Further, the transcription process is often tedious and time-intensive, requiring human listeners to manually examine each moment of recorded speech. Here we evaluate the performance of a state-of-the-art speech recognition algorithm (Halpern et al., 2016) in transcribing audio data into text during a list-learning experiment. We compare transcripts made by human annotators to the computer-generated transcripts. Both sets of transcripts matched to a high degree and exhibited similar statistical properties, in terms of the participants' recall performance and recall dynamics that the transcripts captured. This proof-of-concept study suggests that speech-to-text engines could provide a cheap, reliable, and rapid means of automatically transcribing speech data in psychological experiments. Further, our findings open the door for verbal response experiments that scale to thousands of participants (e.g., administered online), as well as a new generation of experiments that decode speech on the fly and adapt experimental parameters based on participants' prior responses.

  3. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    PubMed

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time. It is found that the proposed ML based sentence extraction techniques and the composite feature set used with RNN as classifier outperform all other approaches. By using ANN in FF form as feature extractor, the performance of the system is evaluated and a comparison is made. Experimental results show that the application of big data samples has enhanced the learning of the ASR system. Further, the ANN based sample and feature extraction techniques are found to be efficient enough to enable application of ML techniques in big data aspects as part of ASR systems. Copyright © 2015 Elsevier Ltd. All rights reserved.

  4. "Docs vs. Glocks" and the Regulation of Physicians' Speech.

    PubMed

    Appelbaum, Paul S

    2017-07-01

    Physicians increasingly have recognized the importance of asking patients whether they own firearms and suggesting safe means of storage. Florida's legislature perceived these questions as threats to patients' rights to keep guns and passed a law restricting physicians from making such inquiries. When a number of physicians and their organizations challenged the law in 2011, a six-year odyssey through the courts ensued. In the end, the U.S. 11th Circuit Court of Appeals struck down the statute, recognizing that physicians' free speech rights extend to communications with patients, a decision that may influence other attempts to restrict clinicians' speech.

  5. Speech recognition in one- and two-talker maskers in school-age children and adults: Development of perceptual masking and glimpsing

    PubMed Central

    Buss, Emily; Leibold, Lori J.; Porter, Heather L.; Grose, John H.

    2017-01-01

    Children perform more poorly than adults on a wide range of masked speech perception paradigms, but this effect is particularly pronounced when the masker itself is also composed of speech. The present study evaluated two factors that might contribute to this effect: the ability to perceptually isolate the target from masker speech, and the ability to recognize target speech based on sparse cues (glimpsing). Speech reception thresholds (SRTs) were estimated for closed-set, disyllabic word recognition in children (5–16 years) and adults in a one- or two-talker masker. Speech maskers were 60 dB sound pressure level (SPL), and they were either presented alone or in combination with a 50-dB-SPL speech-shaped noise masker. There was an age effect overall, but performance was adult-like at a younger age for the one-talker than the two-talker masker. Noise tended to elevate SRTs, particularly for older children and adults, and when summed with the one-talker masker. Removing time-frequency epochs associated with a poor target-to-masker ratio markedly improved SRTs, with larger effects for younger listeners; the age effect was not eliminated, however. Results were interpreted as indicating that development of speech-in-speech recognition is likely impacted by development of both perceptual masking and the ability recognize speech based on sparse cues. PMID:28464682

  6. Understanding the abstract role of speech in communication at 12 months.

    PubMed

    Martin, Alia; Onishi, Kristine H; Vouloumanos, Athena

    2012-04-01

    Adult humans recognize that even unfamiliar speech can communicate information between third parties, demonstrating an ability to separate communicative function from linguistic content. We examined whether 12-month-old infants understand that speech can communicate before they understand the meanings of specific words. Specifically, we test the understanding that speech permits the transfer of information about a Communicator's target object to a Recipient. Initially, the Communicator selectively grasped one of two objects. In test, the Communicator could no longer reach the objects. She then turned to the Recipient and produced speech (a nonsense word) or non-speech (coughing). Infants looked longer when the Recipient selected the non-target than the target object when the Communicator had produced speech but not coughing (Experiment 1). Looking time patterns differed from the speech condition when the Recipient rather than the Communicator produced the speech (Experiment 2), and when the Communicator produced a positive emotional vocalization (Experiment 3), but did not differ when the Recipient had previously received information about the target by watching the Communicator's selective grasping (Experiment 4). Thus infants understand the information-transferring properties of speech and recognize some of the conditions under which others' information states can be updated. These results suggest that infants possess an abstract understanding of the communicative function of speech, providing an important potential mechanism for language and knowledge acquisition. Copyright © 2011 Elsevier B.V. All rights reserved.

  7. Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition

    NASA Astrophysics Data System (ADS)

    Ryumin, D.; Karpov, A. A.

    2017-05-01

    In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.

  8. Objective speech quality evaluation of real-time speech coders

    NASA Astrophysics Data System (ADS)

    Viswanathan, V. R.; Russell, W. H.; Huggins, A. W. F.

    1984-02-01

    This report describes the work performed in two areas: subjective testing of a real-time 16 kbit/s adaptive predictive coder (APC) and objective speech quality evaluation of real-time coders. The speech intelligibility of the APC coder was tested using the Diagnostic Rhyme Test (DRT), and the speech quality was tested using the Diagnostic Acceptability Measure (DAM) test, under eight operating conditions involving channel error, acoustic background noise, and tandem link with two other coders. The test results showed that the DRT and DAM scores of the APC coder equalled or exceeded the corresponding test scores fo the 32 kbit/s CVSD coder. In the area of objective speech quality evaluation, the report describes the development, testing, and validation of a procedure for automatically computing several objective speech quality measures, given only the tape-recordings of the input speech and the corresponding output speech of a real-time speech coder.

  9. "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing

    PubMed Central

    Xiao, Bo; Imel, Zac E.; Georgiou, Panayiotis G.; Atkins, David C.; Narayanan, Shrikanth S.

    2015-01-01

    The technology for evaluating patient-provider interactions in psychotherapy–observational coding–has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies. PMID:26630392

  10. A Quantitative and Qualitative Evaluation of an Automatic Occlusion Device for Tracheoesophageal Speech: The Provox FreeHands HME

    ERIC Educational Resources Information Center

    Hamade, Rachel; Hewlett, Nigel; Scanlon, Emer

    2006-01-01

    This study aimed to evaluate a new automatic tracheostoma valve: the Provox FreeHands HME (manufactured by Atos Medical AB, Sweden). Data from four laryngectomee participants using automatic and also manual occlusion were subjected to acoustic and perceptual analysis. The main results were a significant decrease, from the manual to automatic…

  11. Thai Automatic Speech Recognition

    DTIC Science & Technology

    2005-01-01

    used in an external DARPA evaluation involving medical scenarios between an American Doctor and a naïve monolingual Thai patient. 2. Thai Language... dictionary generation more challenging, and (3) the lack of word segmentation, which calls for automatic segmentation approaches to make n-gram language...requires a dictionary and provides various segmentation algorithms to automatically select suitable segmentations. Here we used a maximal matching

  12. The Relationship Between Speech Production and Speech Perception Deficits in Parkinson's Disease.

    PubMed

    De Keyser, Kim; Santens, Patrick; Bockstael, Annelies; Botteldooren, Dick; Talsma, Durk; De Vos, Stefanie; Van Cauwenberghe, Mieke; Verheugen, Femke; Corthals, Paul; De Letter, Miet

    2016-10-01

    This study investigated the possible relationship between hypokinetic speech production and speech intensity perception in patients with Parkinson's disease (PD). Participants included 14 patients with idiopathic PD and 14 matched healthy controls (HCs) with normal hearing and cognition. First, speech production was objectified through a standardized speech intelligibility assessment, acoustic analysis, and speech intensity measurements. Second, an overall estimation task and an intensity estimation task were addressed to evaluate overall speech perception and speech intensity perception, respectively. Finally, correlation analysis was performed between the speech characteristics of the overall estimation task and the corresponding acoustic analysis. The interaction between speech production and speech intensity perception was investigated by an intensity imitation task. Acoustic analysis and speech intensity measurements demonstrated significant differences in speech production between patients with PD and the HCs. A different pattern in the auditory perception of speech and speech intensity was found in the PD group. Auditory perceptual deficits may influence speech production in patients with PD. The present results suggest a disturbed auditory perception related to an automatic monitoring deficit in PD.

  13. Voice Preprocessor for Digital Voice Applications

    DTIC Science & Technology

    1989-09-11

    helit tralnsforniers are marketed for use kll ith multitonle MI( LMS and are acceptable for w ice appl ica- tiotns. 3. Automatic Gain (iontro: A...variations of speech spectral tilt to improve the quaiit\\ of the extracted speech parameters. Nlore imnportantly, the onlN analoii circuit "e use is a

  14. Fifty years of progress in speech and speaker recognition

    NASA Astrophysics Data System (ADS)

    Furui, Sadaoki

    2004-10-01

    Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.

  15. Automatic measurement of prosody in behavioral variant FTD.

    PubMed

    Nevler, Naomi; Ash, Sharon; Jester, Charles; Irwin, David J; Liberman, Mark; Grossman, Murray

    2017-08-15

    To help understand speech changes in behavioral variant frontotemporal dementia (bvFTD), we developed and implemented automatic methods of speech analysis for quantification of prosody, and evaluated clinical and anatomical correlations. We analyzed semi-structured, digitized speech samples from 32 patients with bvFTD (21 male, mean age 63 ± 8.5, mean disease duration 4 ± 3.1 years) and 17 matched healthy controls (HC). We automatically extracted fundamental frequency (f0, the physical property of sound most closely correlating with perceived pitch) and computed pitch range on a logarithmic scale (semitone) that controls for individual and sex differences. We correlated f0 range with neuropsychiatric tests, and related f0 range to gray matter (GM) atrophy using 3T T1 MRI. We found significantly reduced f0 range in patients with bvFTD (mean 4.3 ± 1.8 ST) compared to HC (5.8 ± 2.1 ST; p = 0.03). Regression related reduced f0 range in bvFTD to GM atrophy in bilateral inferior and dorsomedial frontal as well as left anterior cingulate and anterior insular regions. Reduced f0 range reflects impaired prosody in bvFTD. This is associated with neuroanatomic networks implicated in language production and social disorders centered in the frontal lobe. These findings support the feasibility of automated speech analysis in frontotemporal dementia and other disorders. © 2017 American Academy of Neurology.

  16. Stylistic gait synthesis based on hidden Markov models

    NASA Astrophysics Data System (ADS)

    Tilmanne, Joëlle; Moinet, Alexis; Dutoit, Thierry

    2012-12-01

    In this work we present an expressive gait synthesis system based on hidden Markov models (HMMs), following and modifying a procedure originally developed for speaking style adaptation, in speech synthesis. A large database of neutral motion capture walk sequences was used to train an HMM of average walk. The model was then used for automatic adaptation to a particular style of walk using only a small amount of training data from the target style. The open source toolkit that we adapted for motion modeling also enabled us to take into account the dynamics of the data and to model accurately the duration of each HMM state. We also address the assessment issue and propose a procedure for qualitative user evaluation of the synthesized sequences. Our tests show that the style of these sequences can easily be recognized and look natural to the evaluators.

  17. Building Searchable Collections of Enterprise Speech Data.

    ERIC Educational Resources Information Center

    Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret

    The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…

  18. The Meaning and Use of Folk Speech in Art Criticism.

    ERIC Educational Resources Information Center

    Congdon, Kristin G.

    1986-01-01

    This article investigates the use of folk speech in the art criticism of people who are not art professionals. Maintains that if folk speech is recognized and evaluated in the art classroom, art educators may help expand both the visual and verbal perceptions and expressions of students. (JDH)

  19. Understanding the Abstract Role of Speech in Communication at 12 Months

    ERIC Educational Resources Information Center

    Martin, Alia; Onishi, Kristine H.; Vouloumanos, Athena

    2012-01-01

    Adult humans recognize that even unfamiliar speech can communicate information between third parties, demonstrating an ability to separate communicative function from linguistic content. We examined whether 12-month-old infants understand that speech can communicate before they understand the meanings of specific words. Specifically, we test the…

  20. The Cross-Cultural Study of Language Acquisition.

    ERIC Educational Resources Information Center

    Heath, Shirley Brice

    1985-01-01

    One approach to studying the nature of diverse speech exchange systems across sociocultural groups starts from the premise that all learning is cultural learning, and that language socialization is the way individuals become members of both their primary speech community and their secondary speech communities. Researchers must recognize that the…

  1. How Autism Affects Speech Understanding in Multitalker Environments

    DTIC Science & Technology

    2015-12-01

    Page 1 AD_________________ Award Number: W81XWH-12-1-0363 TITLE: How Autism Affects Speech Understanding in Multitalker Environments PRINCIPAL...COVERED 30 Sep 2012 - 29 Sep 2015 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER How Autism Affects Speech Understanding in Multitalker 5b. GRANT NUMBER...that adults with Autism Spectrum Disorders have particular difficulty recognizing speech in acoustically-hostile environments (e.g., Alcantara et al

  2. The effect of language experience on perceptual normalization of Mandarin tones and non-speech pitch contours.

    PubMed

    Luo, Xin; Ashmore, Krista B

    2014-06-01

    Context-dependent pitch perception helps listeners recognize tones produced by speakers with different fundamental frequencies (f0s). The role of language experience in tone normalization remains unclear. In this cross-language study of tone normalization, native Mandarin and English listeners were asked to recognize Mandarin Tone 1 (high-flat) and Tone 2 (mid-rising) with a preceding Mandarin sentence. To further test whether context-dependent pitch perception is speech-specific or domain-general, both language groups were asked to identify non-speech flat and rising pitch contours with a preceding non-speech flat pitch contour. Results showed that both Mandarin and English listeners made more rising responses with non-speech than with speech stimuli, due to differences in spectral complexity and listening task between the two stimulus types. English listeners made more rising responses than Mandarin listeners with both speech and non-speech stimuli. Contrastive context effects (more rising responses in the high-f0 context than in the low-f0 context) were found with both speech and non-speech stimuli for Mandarin listeners, but not for English listeners. English listeners' lack of tone experience may have caused more rising responses and limited use of context f0 cues. These results suggest that context-dependent pitch perception in tone normalization is domain-general, but influenced by long-term language experience.

  3. Automatically Detecting Likely Edits in Clinical Notes Created Using Automatic Speech Recognition

    PubMed Central

    Lybarger, Kevin; Ostendorf, Mari; Yetisgen, Meliha

    2017-01-01

    The use of automatic speech recognition (ASR) to create clinical notes has the potential to reduce costs associated with note creation for electronic medical records, but at current system accuracy levels, post-editing by practitioners is needed to ensure note quality. Aiming to reduce the time required to edit ASR transcripts, this paper investigates novel methods for automatic detection of edit regions within the transcripts, including both putative ASR errors but also regions that are targets for cleanup or rephrasing. We create detection models using logistic regression and conditional random field models, exploring a variety of text-based features that consider the structure of clinical notes and exploit the medical context. Different medical text resources are used to improve feature extraction. Experimental results on a large corpus of practitioner-edited clinical notes show that 67% of sentence-level edits and 45% of word-level edits can be detected with a false detection rate of 15%. PMID:29854187

  4. Factors influencing relative speech intelligibility in patients with oral squamous cell carcinoma: a prospective study using automatic, computer-based speech analysis.

    PubMed

    Stelzle, F; Knipfer, C; Schuster, M; Bocklet, T; Nöth, E; Adler, W; Schempf, L; Vieler, P; Riemann, M; Neukam, F W; Nkenke, E

    2013-11-01

    Oral squamous cell carcinoma (OSCC) and its treatment impair speech intelligibility by alteration of the vocal tract. The aim of this study was to identify the factors of oral cancer treatment that influence speech intelligibility by means of an automatic, standardized speech-recognition system. The study group comprised 71 patients (mean age 59.89, range 35-82 years) with OSCC ranging from stage T1 to T4 (TNM staging). Tumours were located on the tongue (n=23), lower alveolar crest (n=27), and floor of the mouth (n=21). Reconstruction was conducted through local tissue plasty or microvascular transplants. Adjuvant radiotherapy was performed in 49 patients. Speech intelligibility was evaluated before, and at 3, 6, and 12 months after tumour resection, and compared to that of a healthy control group (n=40). Postoperatively, significant influences on speech intelligibility were tumour localization (P=0.010) and resection volume (P=0.019). Additionally, adjuvant radiotherapy (P=0.049) influenced intelligibility at 3 months after surgery. At 6 months after surgery, influences were resection volume (P=0.028) and adjuvant radiotherapy (P=0.034). The influence of tumour localization (P=0.001) and adjuvant radiotherapy (P=0.022) persisted after 12 months. Tumour localization, resection volume, and radiotherapy are crucial factors for speech intelligibility. Radiotherapy significantly impaired word recognition rate (WR) values with a progression of the impairment for up to 12 months after surgery. Copyright © 2013 International Association of Oral and Maxillofacial Surgeons. Published by Elsevier Ltd. All rights reserved.

  5. Slowed Speech Input has a Differential Impact on On-line and Off-line Processing in Children’s Comprehension of Pronouns

    PubMed Central

    Walenski, Matthew; Swinney, David

    2009-01-01

    The central question underlying this study revolves around how children process co-reference relationships—such as those evidenced by pronouns (him) and reflexives (himself)—and how a slowed rate of speech input may critically affect this process. Previous studies of child language processing have demonstrated that typical language developing (TLD) children as young as 4 years of age process co-reference relations in a manner similar to adults on-line. In contrast, off-line measures of pronoun comprehension suggest a developmental delay for pronouns (relative to reflexives). The present study examines dependency relations in TLD children (ages 5–13) and investigates how a slowed rate of speech input affects the unconscious (on-line) and conscious (off-line) parsing of these constructions. For the on-line investigations (using a cross-modal picture priming paradigm), results indicate that at a normal rate of speech TLD children demonstrate adult-like syntactic reflexes. At a slowed rate of speech the typical language developing children displayed a breakdown in automatic syntactic parsing (again, similar to the pattern seen in unimpaired adults). As demonstrated in the literature, our off-line investigations (sentence/picture matching task) revealed that these children performed much better on reflexives than on pronouns at a regular speech rate. However, at the slow speech rate, performance on pronouns was substantially improved, whereas performance on reflexives was not different than at the regular speech rate. We interpret these results in light of a distinction between fast automatic processes (relied upon for on-line processing in real time) and conscious reflective processes (relied upon for off-line processing), such that slowed speech input disrupts the former, yet improves the latter. PMID:19343495

  6. Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production.

    PubMed

    Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy

    2016-04-01

    Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing. Copyright © 2016 Elsevier B.V. All rights reserved.

  7. Automatic intelligibility classification of sentence-level pathological speech

    PubMed Central

    Kim, Jangwon; Kumar, Naveen; Tsiartas, Andreas; Li, Ming; Narayanan, Shrikanth S.

    2014-01-01

    Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. Although automatic evaluation of speech intelligibility and quality could come in handy in these scenarios to assist experts in diagnosis and treatment design, the many sources and types of variability often make it a very challenging computational processing problem. In this work we propose novel sentence-level features to capture abnormal variation in the prosodic, voice quality and pronunciation aspects in pathological speech. In addition, we propose a post-classification posterior smoothing scheme which refines the posterior of a test sample based on the posteriors of other test samples. Finally, we perform feature-level fusions and subsystem decision fusion for arriving at a final intelligibility decision. The performances are tested on two pathological speech datasets, the NKI CCRT Speech Corpus (advanced head and neck cancer) and the TORGO database (cerebral palsy or amyotrophic lateral sclerosis), by evaluating classification accuracy without overlapping subjects’ data among training and test partitions. Results show that the feature sets of each of the voice quality subsystem, prosodic subsystem, and pronunciation subsystem, offer significant discriminating power for binary intelligibility classification. We observe that the proposed posterior smoothing in the acoustic space can further reduce classification errors. The smoothed posterior score fusion of subsystems shows the best classification performance (73.5% for unweighted, and 72.8% for weighted, average recalls of the binary classes). PMID:25414544

  8. iFER: facial expression recognition using automatically selected geometric eye and eyebrow features

    NASA Astrophysics Data System (ADS)

    Oztel, Ismail; Yolcu, Gozde; Oz, Cemil; Kazan, Serap; Bunyak, Filiz

    2018-03-01

    Facial expressions have an important role in interpersonal communications and estimation of emotional states or intentions. Automatic recognition of facial expressions has led to many practical applications and became one of the important topics in computer vision. We present a facial expression recognition system that relies on geometry-based features extracted from eye and eyebrow regions of the face. The proposed system detects keypoints on frontal face images and forms a feature set using geometric relationships among groups of detected keypoints. Obtained feature set is refined and reduced using the sequential forward selection (SFS) algorithm and fed to a support vector machine classifier to recognize five facial expression classes. The proposed system, iFER (eye-eyebrow only facial expression recognition), is robust to lower face occlusions that may be caused by beards, mustaches, scarves, etc. and lower face motion during speech production. Preliminary experiments on benchmark datasets produced promising results outperforming previous facial expression recognition studies using partial face features, and comparable results to studies using whole face information, only slightly lower by ˜ 2.5 % compared to the best whole face facial recognition system while using only ˜ 1 / 3 of the facial region.

  9. Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students

    ERIC Educational Resources Information Center

    Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice

    2012-01-01

    Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…

  10. The Promise of NLP and Speech Processing Technologies in Language Assessment

    ERIC Educational Resources Information Center

    Chapelle, Carol A.; Chung, Yoo-Ree

    2010-01-01

    Advances in natural language processing (NLP) and automatic speech recognition and processing technologies offer new opportunities for language testing. Despite their potential uses on a range of language test item types, relatively little work has been done in this area, and it is therefore not well understood by test developers, researchers or…

  11. Validation of Automated Scoring of Oral Reading

    ERIC Educational Resources Information Center

    Balogh, Jennifer; Bernstein, Jared; Cheng, Jian; Van Moere, Alistair; Townshend, Brent; Suzuki, Masanori

    2012-01-01

    A two-part experiment is presented that validates a new measurement tool for scoring oral reading ability. Data collected by the U.S. government in a large-scale literacy assessment of adults were analyzed by a system called VersaReader that uses automatic speech recognition and speech processing technologies to score oral reading fluency. In the…

  12. Voice Interactive Analysis System Study. Final Report, August 28, 1978 through March 23, 1979.

    ERIC Educational Resources Information Center

    Harry, D. P.; And Others

    The Voice Interactive Analysis System study continued research and development of the LISTEN real-time, minicomputer based connected speech recognition system, within NAVTRAEQUIPCEN'S program of developing automatic speech technology in support of training. An attempt was made to identify the most effective features detected by the TTI-500 model…

  13. Visual-auditory integration during speech imitation in autism.

    PubMed

    Williams, Justin H G; Massaro, Dominic W; Peel, Natalie J; Bosseler, Alexis; Suddendorf, Thomas

    2004-01-01

    Children with autistic spectrum disorder (ASD) may have poor audio-visual integration, possibly reflecting dysfunctional 'mirror neuron' systems which have been hypothesised to be at the core of the condition. In the present study, a computer program, utilizing speech synthesizer software and a 'virtual' head (Baldi), delivered speech stimuli for identification in auditory, visual or bimodal conditions. Children with ASD were poorer than controls at recognizing stimuli in the unimodal conditions, but once performance on this measure was controlled for, no group difference was found in the bimodal condition. A group of participants with ASD were also trained to develop their speech-reading ability. Training improved visual accuracy and this also improved the children's ability to utilize visual information in their processing of speech. Overall results were compared to predictions from mathematical models based on integration and non-integration, and were most consistent with the integration model. We conclude that, whilst they are less accurate in recognizing stimuli in the unimodal condition, children with ASD show normal integration of visual and auditory speech stimuli. Given that training in recognition of visual speech was effective, children with ASD may benefit from multi-modal approaches in imitative therapy and language training.

  14. Audio-video feature correlation: faces and speech

    NASA Astrophysics Data System (ADS)

    Durand, Gwenael; Montacie, Claude; Caraty, Marie-Jose; Faudemay, Pascal

    1999-08-01

    This paper presents a study of the correlation of features automatically extracted from the audio stream and the video stream of audiovisual documents. In particular, we were interested in finding out whether speech analysis tools could be combined with face detection methods, and to what extend they should be combined. A generic audio signal partitioning algorithm as first used to detect Silence/Noise/Music/Speech segments in a full length movie. A generic object detection method was applied to the keyframes extracted from the movie in order to detect the presence or absence of faces. The correlation between the presence of a face in the keyframes and of the corresponding voice in the audio stream was studied. A third stream, which is the script of the movie, is warped on the speech channel in order to automatically label faces appearing in the keyframes with the name of the corresponding character. We naturally found that extracted audio and video features were related in many cases, and that significant benefits can be obtained from the joint use of audio and video analysis methods.

  15. Articulatory Mediation of Speech Perception: A Causal Analysis of Multi-Modal Imaging Data

    ERIC Educational Resources Information Center

    Gow, David W., Jr.; Segawa, Jennifer A.

    2009-01-01

    The inherent confound between the organization of articulation and the acoustic-phonetic structure of the speech signal makes it exceptionally difficult to evaluate the competing claims of motor and acoustic-phonetic accounts of how listeners recognize coarticulated speech. Here we use Granger causation analysis of high spatiotemporal resolution…

  16. Automatic Speech Recognition in Air Traffic Control: a Human Factors Perspective

    NASA Technical Reports Server (NTRS)

    Karlsson, Joakim

    1990-01-01

    The introduction of Automatic Speech Recognition (ASR) technology into the Air Traffic Control (ATC) system has the potential to improve overall safety and efficiency. However, because ASR technology is inherently a part of the man-machine interface between the user and the system, the human factors issues involved must be addressed. Here, some of the human factors problems are identified and related methods of investigation are presented. Research at M.I.T.'s Flight Transportation Laboratory is being conducted from a human factors perspective, focusing on intelligent parser design, presentation of feedback, error correction strategy design, and optimal choice of input modalities.

  17. Multimodal Translation System Using Texture-Mapped Lip-Sync Images for Video Mail and Automatic Dubbing Applications

    NASA Astrophysics Data System (ADS)

    Morishima, Shigeo; Nakamura, Satoshi

    2004-12-01

    We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.

  18. Noise Suppression Based on Multi-Model Compositions Using Multi-Pass Search with Multi-Label N-gram Models

    NASA Astrophysics Data System (ADS)

    Jitsuhiro, Takatoshi; Toriyama, Tomoji; Kogure, Kiyoshi

    We propose a noise suppression method based on multi-model compositions and multi-pass search. In real environments, input speech for speech recognition includes many kinds of noise signals. To obtain good recognized candidates, suppressing many kinds of noise signals at once and finding target speech is important. Before noise suppression, to find speech and noise label sequences, we introduce multi-pass search with acoustic models including many kinds of noise models and their compositions, their n-gram models, and their lexicon. Noise suppression is frame-synchronously performed using the multiple models selected by recognized label sequences with time alignments. We evaluated this method using the E-Nightingale task, which contains voice memoranda spoken by nurses during actual work at hospitals. The proposed method obtained higher performance than the conventional method.

  19. Learner Attention to Form in ACCESS Task-Based Interaction

    ERIC Educational Resources Information Center

    Dao, Phung; Iwashita, Noriko; Gatbonton, Elizabeth

    2017-01-01

    This study explored the potential effects of communicative tasks developed using a reformulation of a task-based language teaching called Automatization in Communicative Contexts of Essential Speech Sequences (ACCESS) that includes automatization of language elements as one of its goals on learner attention to form in task-based interaction. The…

  20. Auditory-Motor Processing of Speech Sounds

    PubMed Central

    Möttönen, Riikka; Dutton, Rebekah; Watkins, Kate E.

    2013-01-01

    The motor regions that control movements of the articulators activate during listening to speech and contribute to performance in demanding speech recognition and discrimination tasks. Whether the articulatory motor cortex modulates auditory processing of speech sounds is unknown. Here, we aimed to determine whether the articulatory motor cortex affects the auditory mechanisms underlying discrimination of speech sounds in the absence of demanding speech tasks. Using electroencephalography, we recorded responses to changes in sound sequences, while participants watched a silent video. We also disrupted the lip or the hand representation in left motor cortex using transcranial magnetic stimulation. Disruption of the lip representation suppressed responses to changes in speech sounds, but not piano tones. In contrast, disruption of the hand representation had no effect on responses to changes in speech sounds. These findings show that disruptions within, but not outside, the articulatory motor cortex impair automatic auditory discrimination of speech sounds. The findings provide evidence for the importance of auditory-motor processes in efficient neural analysis of speech sounds. PMID:22581846

  1. Alternative Speech Communication System for Persons with Severe Speech Disorders

    NASA Astrophysics Data System (ADS)

    Selouani, Sid-Ahmed; Sidi Yakoub, Mohammed; O'Shaughnessy, Douglas

    2009-12-01

    Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders. The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality of communication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and close to the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a grafting technique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system in order to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and two Acadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposed methods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achieved by the speech synthesis systems that deal with SSD and dysarthria, respectively.

  2. Evaluation of synthesized voice approach callouts /SYNCALL/

    NASA Technical Reports Server (NTRS)

    Simpson, C. A.

    1981-01-01

    The two basic approaches to the generation of 'synthesized' speech include a utilization of analog recorded human speech and a construction of speech entirely from algorithms applied to constants describing speech sounds. Given the availability of synthesized speech displays for man-machine systems, research is needed to study suggested applications for speech and design principles for speech displays. The present investigation is concerned with a study for which new performance measures were developed. A number of air carrier approach and landing accidents during low or impaired visibility have been associated with the absence of approach callouts. The study had the purpose to compare a pilot-not-flying (PNF) approach callout system to a system composed of PNF callouts augmented by an automatic synthesized voice callout system (SYNCALL). Pilots were found to favor the use of a SYNCALL system containing certain modifications.

  3. Quantitative acoustic measurements for characterization of speech and voice disorders in early untreated Parkinson's disease.

    PubMed

    Rusz, J; Cmejla, R; Ruzickova, H; Ruzicka, E

    2011-01-01

    An assessment of vocal impairment is presented for separating healthy people from persons with early untreated Parkinson's disease (PD). This study's main purpose was to (a) determine whether voice and speech disorder are present from early stages of PD before starting dopaminergic pharmacotherapy, (b) ascertain the specific characteristics of the PD-related vocal impairment, (c) identify PD-related acoustic signatures for the major part of traditional clinically used measurement methods with respect to their automatic assessment, and (d) design new automatic measurement methods of articulation. The varied speech data were collected from 46 Czech native speakers, 23 with PD. Subsequently, 19 representative measurements were pre-selected, and Wald sequential analysis was then applied to assess the efficiency of each measure and the extent of vocal impairment of each subject. It was found that measurement of the fundamental frequency variations applied to two selected tasks was the best method for separating healthy from PD subjects. On the basis of objective acoustic measures, statistical decision-making theory, and validation from practicing speech therapists, it has been demonstrated that 78% of early untreated PD subjects indicate some form of vocal impairment. The speech defects thus uncovered differ individually in various characteristics including phonation, articulation, and prosody.

  4. Automatic processing of tones and speech stimuli in children with specific language impairment.

    PubMed

    Uwer, Ruth; Albrecht, Ronald; von Suchodoletz, W

    2002-08-01

    It is well known from behavioural experiments that children with specific language impairment (SLI) have difficulties discriminating consonant-vowel (CV) syllables such as /ba/, /da/, and /ga/. Mismatch negativity (MMN) is an auditory event-related potential component that represents the outcome of an automatic comparison process. It could, therefore, be a promising tool for assessing central auditory processing deficits for speech and non-speech stimuli in children with SLI. MMN is typically evoked by occasionally occurring 'deviant' stimuli in a sequence of identical 'standard' sounds. In this study MMN was elicited using simple tone stimuli, which differed in frequency (1000 versus 1200 Hz) and duration (175 versus 100 ms) and to digitized CV syllables which differed in place of articulation (/ba/, /da/, and /ga/) in children with expressive and receptive SLI and healthy control children (n=21 in each group, 46 males and 17 females; age range 5 to 10 years). Mean MMN amplitudes between groups were compared. Additionally, the behavioural discrimination performance was assessed. Children with SLI had attenuated MMN amplitudes to speech stimuli, but there was no significant difference between the two diagnostic subgroups. MMN to tone stimuli did not differ between the groups. Children with SLI made more errors in the discrimination task, but discrimination scores did not correlate with MMN amplitudes. The present data suggest that children with SLI show a specific deficit in automatic discrimination of CV syllables differing in place of articulation, whereas the processing of simple tone differences seems to be unimpaired.

  5. Relationship between listeners' nonnative speech recognition and categorization abilities

    PubMed Central

    Atagi, Eriko; Bent, Tessa

    2015-01-01

    Enhancement of the perceptual encoding of talker characteristics (indexical information) in speech can facilitate listeners' recognition of linguistic content. The present study explored this indexical-linguistic relationship in nonnative speech processing by examining listeners' performance on two tasks: nonnative accent categorization and nonnative speech-in-noise recognition. Results indicated substantial variability across listeners in their performance on both the accent categorization and nonnative speech recognition tasks. Moreover, listeners' accent categorization performance correlated with their nonnative speech-in-noise recognition performance. These results suggest that having more robust indexical representations for nonnative accents may allow listeners to more accurately recognize the linguistic content of nonnative speech. PMID:25618098

  6. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    NASA Technical Reports Server (NTRS)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  7. Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.

    DTIC Science & Technology

    1983-05-01

    16.311 % a. Seale In/Se"l tAL4 lrs e y i s 2 I ROM men "Ig eddiei, m releerla ons leveltc. Ŗ dots ghoeea INDtISTRtAIJ%6LITARY SPEECH SYNTHESIS PRODUCTS...saquence The SC-01 Suech Syntheszer conftains 64 cf, arent poneme~hs which are accessed try A 6-tht code. 1 - the proper sequ.enti omthnatiors of thoe...connected speech input with widely differing emotional states, diverse accents, and substantial nonperiodic background noise input. As noted previously

  8. Start-Up Rhetoric in Eight Speeches of Barack Obama

    ERIC Educational Resources Information Center

    O'Connell, Daniel C.; Kowal, Sabine; Sabin, Edward J.; Lamia, John F.; Dannevik, Margaret

    2010-01-01

    Our purpose in the following was to investigate the start-up rhetoric employed by U.S. President Barack Obama in his speeches. The initial 5 min from eight of his speeches from May to September of 2009 were selected for their variety of setting, audience, theme, and purpose. It was generally hypothesized that Barack Obama, widely recognized for…

  9. Listening to Accented Speech in a Second Language: First Language and Age of Acquisition Effects

    ERIC Educational Resources Information Center

    Larraza, Saioa; Samuel, Arthur G.; Oñederra, Miren Lourdes

    2016-01-01

    Bilingual speakers must acquire the phonemic inventory of 2 languages and need to recognize spoken words cross-linguistically; a demanding job potentially made even more difficult due to dialectal variation, an intrinsic property of speech. The present work examines how bilinguals perceive second language (L2) accented speech and where…

  10. Clinical Use of AEVP- and AERP-Measures in Childhood Speech Disorders

    ERIC Educational Resources Information Center

    Maassen, Ben; Pasman, Jaco; Nijland, Lian; Rotteveel, Jan

    2006-01-01

    It has long been recognized that from the first months of life auditory perception plays a crucial role in speech and language development. Only in recent years, however, is the precise mechanism of auditory development and its interaction with the acquisition of speech and language beginning to be systematically revealed. This paper presents the…

  11. Teaching Evidence-Based Practice (EBP) to Speech-Language Therapy Students: Are Students Competent and Confident EBP Users?

    ERIC Educational Resources Information Center

    Spek, B.; Wieringa-de Waard, M.; Lucas, C.; van Dijk, N.

    2013-01-01

    Background: The importance and value of the principles of evidence-based practice (EBP) in the decision-making process is recognized by speech-language therapists (SLTs) worldwide and as a result curricula for speech-language therapy students incorporated EBP principles. However, the willingness actually to use EBP principles in their future…

  12. Measuring Prevalence of Other-Oriented Transactive Contributions Using an Automated Measure of Speech Style Accommodation

    ERIC Educational Resources Information Center

    Gweon, Gahgene; Jain, Mahaveer; McDonough, John; Raj, Bhiksha; Rose, Carolyn P.

    2013-01-01

    This paper contributes to a theory-grounded methodological foundation for automatic collaborative learning process analysis. It does this by illustrating how insights from the social psychology and sociolinguistics of speech style provide a theoretical framework to inform the design of a computational model. The purpose of that model is to detect…

  13. Detecting and Understanding the Impact of Cognitive and Interpersonal Conflict in Computer Supported Collaborative Learning Environments

    ERIC Educational Resources Information Center

    Prata, David Nadler; Baker, Ryan S. J. d.; Costa, Evandro d. B.; Rose, Carolyn P.; Cui, Yue; de Carvalho, Adriana M. J. B.

    2009-01-01

    This paper presents a model which can automatically detect a variety of student speech acts as students collaborate within a computer supported collaborative learning environment. In addition, an analysis is presented which gives substantial insight as to how students' learning is associated with students' speech acts, knowledge that will…

  14. Evaluation of an Enhanced Stimulus-Stimulus Pairing Procedure to Increase Early Vocalizations of Children with Autism

    ERIC Educational Resources Information Center

    Esch, Barbara E.; Carr, James E.; Grow, Laura L.

    2009-01-01

    Evidence to support stimulus-stimulus pairing (SSP) in speech acquisition is less than robust, calling into question the ability of SSP to reliably establish automatically reinforcing properties of speech and limiting the procedure's clinical utility for increasing vocalizations. We evaluated the effects of a modified SSP procedure on…

  15. Yaounde French Speech Corpus

    DTIC Science & Technology

    2017-03-01

    the Center for Technology Enhanced Language Learning (CTELL), a research cell in the Department of Foreign Languages, United States Military Academy...models for automatic speech recognition (ASR), and to, thereby, investigate the utility of ASR in pedagogical technology . The corpus is a sample of...lexical resources, language technology 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU 18. NUMBER OF

  16. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    ERIC Educational Resources Information Center

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  17. Are the Literacy Difficulties That Characterize Developmental Dyslexia Associated with a Failure to Integrate Letters and Speech Sounds?

    ERIC Educational Resources Information Center

    Nash, Hannah M.; Gooch, Debbie; Hulme, Charles; Mahajan, Yatin; McArthur, Genevieve; Steinmetzger, Kurt; Snowling, Margaret J.

    2017-01-01

    The "automatic letter-sound integration hypothesis" (Blomert, [Blomert, L., 2011]) proposes that dyslexia results from a failure to fully integrate letters and speech sounds into automated audio-visual objects. We tested this hypothesis in a sample of English-speaking children with dyslexic difficulties (N = 13) and samples of…

  18. EduSpeak[R]: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications

    ERIC Educational Resources Information Center

    Franco, Horacio; Bratt, Harry; Rossier, Romain; Rao Gadde, Venkata; Shriberg, Elizabeth; Abrash, Victor; Precoda, Kristin

    2010-01-01

    SRI International's EduSpeak[R] system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to…

  19. Transcription and Annotation of a Japanese Accented Spoken Corpus of L2 Spanish for the Development of CAPT Applications

    ERIC Educational Resources Information Center

    Carranza, Mario

    2016-01-01

    This paper addresses the process of transcribing and annotating spontaneous non-native speech with the aim of compiling a training corpus for the development of Computer Assisted Pronunciation Training (CAPT) applications, enhanced with Automatic Speech Recognition (ASR) technology. To better adapt ASR technology to CAPT tools, the recognition…

  20. Northeast Artificial Intelligence Consortium annual report. Volume 2. 1988. Discussing, using, and recognizing plans (NLP). Interim report, January-December 1988

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shapiro, S.C.; Woolf, B.

    The Northeast Artificial Intelligence Consortium (NAIC) was created by the Air Force Systems Command, Rome Air Development Center, and the Office of Scientific Research. Its purpose is to conduct pertinent research in artificial intelligence and to perform activities ancillary to this research. This report describes progress that has been made in the fourth year of the existence of the NAIC on the technical research tasks undertaken at the member universities. The topics covered in general are: versatile expert system for equipment maintenance, distributed AI for communications system control, automatic photointerpretation, time-oriented problem solving, speech understanding systems, knowledge base maintenance, hardwaremore » architectures for very large systems, knowledge-based reasoning and planning, and a knowledge acquisition, assistance, and explanation system. The specific topic for this volume is the recognition of plans expressed in natural language, followed by their discussion and use.« less

  1. Syntax-directed content analysis of videotext: application to a map detection recognition system

    NASA Astrophysics Data System (ADS)

    Aradhye, Hrishikesh; Herson, James A.; Myers, Gregory

    2003-01-01

    Video is an increasingly important and ever-growing source of information to the intelligence and homeland defense analyst. A capability to automatically identify the contents of video imagery would enable the analyst to index relevant foreign and domestic news videos in a convenient and meaningful way. To this end, the proposed system aims to help determine the geographic focus of a news story directly from video imagery by detecting and geographically localizing political maps from news broadcasts, using the results of videotext recognition in lieu of a computationally expensive, scale-independent shape recognizer. Our novel method for the geographic localization of a map is based on the premise that the relative placement of text superimposed on a map roughly corresponds to the geographic coordinates of the locations the text represents. Our scheme extracts and recognizes videotext, and iteratively identifies the geographic area, while allowing for OCR errors and artistic freedom. The fast and reliable recognition of such maps by our system may provide valuable context and supporting evidence for other sources, such as speech recognition transcripts. The concepts of syntax-directed content analysis of videotext presented here can be extended to other content analysis systems.

  2. The transition to increased automaticity during finger sequence learning in adult males who stutter.

    PubMed

    Smits-Bandstra, Sarah; De Nil, Luc; Rochon, Elizabeth

    2006-01-01

    The present study compared the automaticity levels of persons who stutter (PWS) and persons who do not stutter (PNS) on a practiced finger sequencing task under dual task conditions. Automaticity was defined as the amount of attention required for task performance. Twelve PWS and 12 control subjects practiced finger tapping sequences under single and then dual task conditions. Control subjects performed the sequencing task significantly faster and less variably under single versus dual task conditions while PWS' performance was consistently slow and variable (comparable to the dual task performance of control subjects) under both conditions. Control subjects were significantly more accurate on a colour recognition distracter task than PWS under dual task conditions. These results suggested that control subjects transitioned to quick, accurate and increasingly automatic performance on the sequencing task after practice, while PWS did not. Because most stuttering treatment programs for adults include practice and automatization of new motor speech skills, findings of this finger sequencing study and future studies of speech sequence learning may have important implications for how to maximize stuttering treatment effectiveness. As a result of this activity, the participant will be able to: (1) Define automaticity and explain the importance of dual task paradigms to investigate automaticity; (2) Relate the proposed relationship between motor learning and automaticity as stated by the authors; (3) Summarize the reviewed literature concerning the performance of PWS on dual tasks; and (4) Explain why the ability to transition to automaticity during motor learning may have important clinical implications for stuttering treatment effectiveness.

  3. Automatic Classification of Question & Answer Discourse Segments from Teacher's Speech in Classrooms

    ERIC Educational Resources Information Center

    Blanchard, Nathaniel; D'Mello, Sidney; Olney, Andrew M.; Nystrand, Martin

    2015-01-01

    Question-answer (Q&A) is fundamental for dialogic instruction, an important pedagogical technique based on the free exchange of ideas and open-ended discussion. Automatically detecting Q&A is key to providing teachers with feedback on appropriate use of dialogic instructional strategies. In line with this, this paper studies the…

  4. Automatic Online Lecture Highlighting Based on Multimedia Analysis

    ERIC Educational Resources Information Center

    Che, Xiaoyin; Yang, Haojin; Meinel, Christoph

    2018-01-01

    Textbook highlighting is widely considered to be beneficial for students. In this paper, we propose a comprehensive solution to highlight the online lecture videos in both sentence- and segment-level, just as is done with paper books. The solution is based on automatic analysis of multimedia lecture materials, such as speeches, transcripts, and…

  5. Learning diagnostic models using speech and language measures.

    PubMed

    Peintner, Bart; Jarrold, William; Vergyriy, Dimitra; Richey, Colleen; Tempini, Maria Luisa Gorno; Ogar, Jennifer

    2008-01-01

    We describe results that show the effectiveness of machine learning in the automatic diagnosis of certain neurodegenerative diseases, several of which alter speech and language production. We analyzed audio from 9 control subjects and 30 patients diagnosed with one of three subtypes of Frontotemporal Lobar Degeneration. From this data, we extracted features of the audio signal and the words the patient used, which were obtained using our automated transcription technologies. We then automatically learned models that predict the diagnosis of the patient using these features. Our results show that learned models over these features predict diagnosis with accuracy significantly better than random. Future studies using higher quality recordings will likely improve these results.

  6. Study of Discussion Record Analysis Using Temporal Data Crystallization and Its Application to TV Scene Analysis

    DTIC Science & Technology

    2015-03-31

    analysis. For scene analysis, we use Temporal Data Crystallization (TDC), and for logical analysis, we use Speech Act theory and Toulmin Argumentation...utterance in the discussion record. (i) An utterance ID, and a speaker ID (ii) Speech acts (iii) Argument structure Speech act denotes...mediator is expected to use more OQs than CQs. When the speech act of an utterance is an argument, furthermore, we recognize the conclusion part

  7. Estimating psycho-physiological state of a human by speech analysis

    NASA Astrophysics Data System (ADS)

    Ronzhin, A. L.

    2005-05-01

    Adverse effects of intoxication, fatigue and boredom could degrade performance of highly trained operators of complex technical systems with potentially catastrophic consequences. Existing physiological fitness for duty tests are time consuming, costly, invasive, and highly unpopular. Known non-physiological tests constitute a secondary task and interfere with the busy workload of the tested operator. Various attempts to assess the current status of the operator by processing of "normal operational data" often lead to excessive amount of computations, poorly justified metrics, and ambiguity of results. At the same time, speech analysis presents a natural, non-invasive approach based upon well-established efficient data processing. In addition, it supports both behavioral and physiological biometric. This paper presents an approach facilitating robust speech analysis/understanding process in spite of natural speech variability and background noise. Automatic speech recognition is suggested as a technique for the detection of changes in the psycho-physiological state of a human that typically manifest themselves by changes of characteristics of voice tract and semantic-syntactic connectivity of conversation. Preliminary tests have confirmed that the statistically significant correlation between the error rate of automatic speech recognition and the extent of alcohol intoxication does exist. In addition, the obtained data allowed exploring some interesting correlations and establishing some quantitative models. It is proposed to utilize this approach as a part of fitness for duty test and compare its efficiency with analyses of iris, face geometry, thermography and other popular non-invasive biometric techniques.

  8. From Birdsong to Human Speech Recognition: Bayesian Inference on a Hierarchy of Nonlinear Dynamical Systems

    PubMed Central

    Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902

  9. From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems.

    PubMed

    Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

  10. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1

    NASA Astrophysics Data System (ADS)

    Garofolo, J. S.; Lamel, L. F.; Fisher, W. M.; Fiscus, J. G.; Pallett, D. S.

    1993-02-01

    The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.

  11. Acoustic Event Detection and Classification

    NASA Astrophysics Data System (ADS)

    Temko, Andrey; Nadeu, Climent; Macho, Dušan; Malkin, Robert; Zieger, Christian; Omologo, Maurizio

    The human activity that takes place in meeting rooms or classrooms is reflected in a rich variety of acoustic events (AE), produced either by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity. Indeed, speech is usually the most informative sound, but other kinds of AEs may also carry useful information, for example, clapping or laughing inside a speech, a strong yawn in the middle of a lecture, a chair moving or a door slam when the meeting has just started. Additionally, detection and classification of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition.

  12. Pragmatic Difficulties in the Production of the Speech Act of Apology by Iraqi EFL Learners

    ERIC Educational Resources Information Center

    Al-Ghazalli, Mehdi Falih; Al-Shammary, Mohanad A. Amert

    2014-01-01

    The purpose of this paper is to investigate the pragmatic difficulties encountered by Iraqi EFL university students in producing the speech act of apology. Although the act of apology is easy to recognize or use by native speakers of English, non-native speakers generally encounter difficulties in discriminating one speech act from another. The…

  13. Spontaneous Speech Events in Two Speech Databases of Human-Computer and Human-Human Dialogs in Spanish

    ERIC Educational Resources Information Center

    Rodriguez, Luis J.; Torres, M. Ines

    2006-01-01

    Previous works in English have revealed that disfluencies follow regular patterns and that incorporating them into the language model of a speech recognizer leads to lower perplexities and sometimes to a better performance. Although work on disfluency modeling has been applied outside the English community (e.g., in Japanese), as far as we know…

  14. Speech to Text: Today and Tomorrow. Proceedings of a Conference at Gallaudet University (Washington, D.C., September, 1988). GRI Monograh Series B, No. 2.

    ERIC Educational Resources Information Center

    Harkins, Judith E., Ed.; Virvan, Barbara M., Ed.

    The conference proceedings contains 23 papers on telephone relay service, real-time captioning, and automatic speech recognition, and a glossary. The keynote address, by Representative Major R. Owens, examines current issues in federal legislation. Other papers have the following titles and authors: "Telephone Relay Service: Rationale and…

  15. Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

    NASA Astrophysics Data System (ADS)

    He, Di; Lim, Boon Pang; Yang, Xuesong; Hasegawa-Johnson, Mark; Chen, Deming

    2018-06-01

    Most mainstream Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea, that some frames are more important than others. Acoustic landmark theory exploits quantal non-linearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception. In this work, we conduct experiments on the TIMIT corpus, with both GMM and DNN based ASR systems and find that frames containing landmarks are more informative for ASR than others. We find that altering the level of emphasis on landmarks by re-weighting acoustic likelihood tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (45.8% to be precise) of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrate the potential of landmarks for reducing computation.

  16. The Role of Categorical Speech Perception and Phonological Processing in Familial Risk Children With and Without Dyslexia.

    PubMed

    Hakvoort, Britt; de Bree, Elise; van der Leij, Aryan; Maassen, Ben; van Setten, Ellie; Maurits, Natasha; van Zuijen, Titia L

    2016-12-01

    This study assessed whether a categorical speech perception (CP) deficit is associated with dyslexia or familial risk for dyslexia, by exploring a possible cascading relation from speech perception to phonology to reading and by identifying whether speech perception distinguishes familial risk (FR) children with dyslexia (FRD) from those without dyslexia (FRND). Data were collected from 9-year-old FRD (n = 37) and FRND (n = 41) children and age-matched controls (n = 49) on CP identification and discrimination and on the phonological processing measures rapid automatized naming, phoneme awareness, and nonword repetition. The FRD group performed more poorly on CP than the FRND and control groups. Findings on phonological processing align with the literature in that (a) phonological processing related to reading and (b) the FRD group showed the lowest phonological processing outcomes. Furthermore, CP correlated weakly with reading, but this relationship was fully mediated by rapid automatized naming. Although CP phonological skills are related to dyslexia, there was no strong evidence for a cascade from CP to phonology to reading. Deficits in CP at the behavioral level are not directly associated with dyslexia.

  17. An attention-gating recurrent working memory architecture for emergent speech representation

    NASA Astrophysics Data System (ADS)

    Elshaw, Mark; Moore, Roger K.; Klein, Michael

    2010-06-01

    This paper describes an attention-gating recurrent self-organising map approach for emergent speech representation. Inspired by evidence from human cognitive processing, the architecture combines two main neural components. The first component, the attention-gating mechanism, uses actor-critic learning to perform selective attention towards speech. Through this selective attention approach, the attention-gating mechanism controls access to working memory processing. The second component, the recurrent self-organising map memory, develops a temporal-distributed representation of speech using phone-like structures. Representing speech in terms of phonetic features in an emergent self-organised fashion, according to research on child cognitive development, recreates the approach found in infants. Using this representational approach, in a fashion similar to infants, should improve the performance of automatic recognition systems through aiding speech segmentation and fast word learning.

  18. Strategies for distant speech recognitionin reverberant environments

    NASA Astrophysics Data System (ADS)

    Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro

    2015-12-01

    Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

  19. And then I saw her race: Race-based expectations affect infants' word processing.

    PubMed

    Weatherhead, Drew; White, Katherine S

    2018-08-01

    How do our expectations about speakers shape speech perception? Adults' speech perception is influenced by social properties of the speaker (e.g., race). When in development do these influences begin? In the current study, 16-month-olds heard familiar words produced in their native accent (e.g., "dog") and in an unfamiliar accent involving a vowel shift (e.g., "dag"), in the context of an image of either a same-race speaker or an other-race speaker. Infants' interpretation of the words depended on the speaker's race. For the same-race speaker, infants only recognized words produced in the familiar accent; for the other-race speaker, infants recognized both versions of the words. Two additional experiments showed that infants only recognized an other-race speaker's atypical pronunciations when they differed systematically from the native accent. These results provide the first evidence that expectations driven by unspoken properties of speakers, such as race, influence infants' speech processing. Copyright © 2018 Elsevier B.V. All rights reserved.

  20. Automatic measurement of voice onset time using discriminative structured prediction.

    PubMed

    Sonderegger, Morgan; Keshet, Joseph

    2012-12-01

    A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.

  1. On the Selection of Non-Invasive Methods Based on Speech Analysis Oriented to Automatic Alzheimer Disease Diagnosis

    PubMed Central

    López-de-Ipiña, Karmele; Alonso, Jesus-Bernardino; Travieso, Carlos Manuel; Solé-Casals, Jordi; Egiraun, Harkaitz; Faundez-Zanuy, Marcos; Ezeiza, Aitzol; Barroso, Nora; Ecay-Torres, Miriam; Martinez-Lage, Pablo; de Lizardui, Unai Martinez

    2013-01-01

    The work presented here is part of a larger study to identify novel technologies and biomarkers for early Alzheimer disease (AD) detection and it focuses on evaluating the suitability of a new approach for early AD diagnosis by non-invasive methods. The purpose is to examine in a pilot study the potential of applying intelligent algorithms to speech features obtained from suspected patients in order to contribute to the improvement of diagnosis of AD and its degree of severity. In this sense, Artificial Neural Networks (ANN) have been used for the automatic classification of the two classes (AD and control subjects). Two human issues have been analyzed for feature selection: Spontaneous Speech and Emotional Response. Not only linear features but also non-linear ones, such as Fractal Dimension, have been explored. The approach is non invasive, low cost and without any side effects. Obtained experimental results were very satisfactory and promising for early diagnosis and classification of AD patients. PMID:23698268

  2. Behavioral and electrophysiological evidence for early and automatic detection of phonological equivalence in variable speech inputs.

    PubMed

    Kharlamov, Viktor; Campbell, Kenneth; Kazanina, Nina

    2011-11-01

    Speech sounds are not always perceived in accordance with their acoustic-phonetic content. For example, an early and automatic process of perceptual repair, which ensures conformity of speech inputs to the listener's native language phonology, applies to individual input segments that do not exist in the native inventory or to sound sequences that are illicit according to the native phonotactic restrictions on sound co-occurrences. The present study with Russian and Canadian English speakers shows that listeners may perceive phonetically distinct and licit sound sequences as equivalent when the native language system provides robust evidence for mapping multiple phonetic forms onto a single phonological representation. In Russian, due to an optional but productive t-deletion process that affects /stn/ clusters, the surface forms [sn] and [stn] may be phonologically equivalent and map to a single phonological form /stn/. In contrast, [sn] and [stn] clusters are usually phonologically distinct in (Canadian) English. Behavioral data from identification and discrimination tasks indicated that [sn] and [stn] clusters were more confusable for Russian than for English speakers. The EEG experiment employed an oddball paradigm with nonwords [asna] and [astna] used as the standard and deviant stimuli. A reliable mismatch negativity response was elicited approximately 100 msec postchange in the English group but not in the Russian group. These findings point to a perceptual repair mechanism that is engaged automatically at a prelexical level to ensure immediate encoding of speech inputs in phonological terms, which in turn enables efficient access to the meaning of a spoken utterance.

  3. An overview of neural function and feedback control in human communication.

    PubMed

    Hood, L J

    1998-01-01

    The speech and hearing mechanisms depend on accurate sensory information and intact feedback mechanisms to facilitate communication. This article provides a brief overview of some components of the nervous system important for human communication and some electrophysiological methods used to measure cortical function in humans. An overview of automatic control and feedback mechanisms in general and as they pertain to the speech motor system and control of the hearing periphery is also presented, along with a discussion of how the speech and auditory systems interact.

  4. A social feedback loop for speech development and its reduction in autism

    PubMed Central

    Warlaumont, Anne S.; Richards, Jeffrey A.; Gilkerson, Jill; Oller, D. Kimbrough

    2014-01-01

    We analyze the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hours total) of children (8- to 48-month-olds) with and without autism. We find that adult responses are more likely when child vocalizations are speech-related. In turn, a child vocalization is more likely to be speech-related if the previous speech-related child vocalization received an immediate adult response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech-language development. Although this feedback loop applies in both typical development and autism, children with autism produce proportionally fewer speech-related vocalizations and the responses they receive are less contingent on whether their vocalizations are speech-related. We argue that such differences will diminish the strength of the social feedback loop with cascading effects on speech development over time. Differences related to socioeconomic status are also reported. PMID:24840717

  5. A Construction System for CALL Materials from TV News with Captions

    NASA Astrophysics Data System (ADS)

    Kobayashi, Satoshi; Tanaka, Takashi; Mori, Kazumasa; Nakagawa, Seiichi

    Many language learning materials have been published. In language learning, although repetition training is obviously necessary, it is difficult to maintain the learner's interest/motivation using existing learning materials, because those materials are limited in their scope and contents. In addition, we doubt whether the speech sounds used in most materials are natural in various situations. Nowadays, some TV news programs (CNN, ABC, PBS, NHK, etc.) have closed/open captions corresponding to the announcer's speech. We have developed a system that makes Computer Assisted Language Learning (CALL) materials for both English learning by Japanese and Japanese learning by foreign students from such captioned newscasts. This system computes the synchronization between captions and speech by using HMMs and a forced alignment algorithm. Materials made by the system have following functions: full/partial text caption display, repetition listening, consulting an electronic dictionary, display of the user's/announcer's sound waveform and pitch contour, and automatic construction of a dictation test. Materials have following advantages: materials present polite and natural speech, various and timely topics. Furthermore, the materials have the following possibility: automatic creation of listening/understanding tests, and storage/retrieval of the many materials. In this paper, firstly, we present the organization of the system. Then, we describe results of questionnaires on trial use of the materials. As the result, we got enough accuracy on the synchronization between captions and speech. Speaking totally, we encouraged to research this system.

  6. Cognitive Load in Voice Therapy Carry-Over Exercises.

    PubMed

    Iwarsson, Jenny; Morris, David Jackson; Balling, Laura Winther

    2017-01-01

    The cognitive load generated by online speech production may vary with the nature of the speech task. This article examines 3 speech tasks used in voice therapy carry-over exercises, in which a patient is required to adopt and automatize new voice behaviors, ultimately in daily spontaneous communication. Twelve subjects produced speech in 3 conditions: rote speech (weekdays), sentences in a set form, and semispontaneous speech. Subjects simultaneously performed a secondary visual discrimination task for which response times were measured. On completion of each speech task, subjects rated their experience on a questionnaire. Response times from the secondary, visual task were found to be shortest for the rote speech, longer for the semispontaneous speech, and longest for the sentences within the set framework. Principal components derived from the subjective ratings were found to be linked to response times on the secondary visual task. Acoustic measures reflecting fundamental frequency distribution and vocal fold compression varied across the speech tasks. The results indicate that consideration should be given to the selection of speech tasks during the process leading to automation of revised speech behavior and that self-reports may be a reliable index of cognitive load.

  7. The Teaching of Reading, Writing and Language in a Clinical Speech and Language Setting: A Blended Therapy Intervention Approach

    ERIC Educational Resources Information Center

    Ammons, Kerrie Allen

    2013-01-01

    With a growing body of research that supports a link between language and literacy, governing bodies in the field of speech and language pathology have recognized the need to reconsider the role of speech-language pathologists in addressing the emergent literacy needs of preschoolers who struggle with literacy and language concepts. This study…

  8. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    NASA Astrophysics Data System (ADS)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  9. Science 101: How Does Speech-Recognition Software Work?

    ERIC Educational Resources Information Center

    Robertson, Bill

    2016-01-01

    This column provides background science information for elementary teachers. Many innovations with computer software begin with analysis of how humans do a task. This article takes a look at how humans recognize spoken words and explains the origins of speech-recognition software.

  10. Multistage audiovisual integration of speech: dissociating identification and detection.

    PubMed

    Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias S

    2011-02-01

    Speech perception integrates auditory and visual information. This is evidenced by the McGurk illusion where seeing the talking face influences the auditory phonetic percept and by the audiovisual detection advantage where seeing the talking face influences the detectability of the acoustic speech signal. Here, we show that identification of phonetic content and detection can be dissociated as speech-specific and non-specific audiovisual integration effects. To this end, we employed synthetically modified stimuli, sine wave speech (SWS), which is an impoverished speech signal that only observers informed of its speech-like nature recognize as speech. While the McGurk illusion only occurred for informed observers, the audiovisual detection advantage occurred for naïve observers as well. This finding supports a multistage account of audiovisual integration of speech in which the many attributes of the audiovisual speech signal are integrated by separate integration processes.

  11. Robot Command Interface Using an Audio-Visual Speech Recognition System

    NASA Astrophysics Data System (ADS)

    Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy

    In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.

  12. A social feedback loop for speech development and its reduction in autism.

    PubMed

    Warlaumont, Anne S; Richards, Jeffrey A; Gilkerson, Jill; Oller, D Kimbrough

    2014-07-01

    We analyzed the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hr total) of children (8- to 48-month-olds) with and without autism. We found that an adult was more likely to respond when the child's vocalization was speech related rather than not speech related. In turn, a child's vocalization was more likely to be speech related if the child's previous speech-related vocalization had received an immediate adult response rather than no response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech development. Although this feedback loop applies in both typical development and autism, children with autism produced proportionally fewer speech-related vocalizations, and the responses they received were less contingent on whether their vocalizations were speech related. We argue that such differences will diminish the strength of the social feedback loop and have cascading effects on speech development over time. Differences related to socioeconomic status are also reported. © The Author(s) 2014.

  13. Processing of speech signals for physical and sensory disabilities.

    PubMed Central

    Levitt, H

    1995-01-01

    Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities. Images Fig. 4 PMID:7479816

  14. Processing of Speech Signals for Physical and Sensory Disabilities

    NASA Astrophysics Data System (ADS)

    Levitt, Harry

    1995-10-01

    Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities.

  15. Speech Emotion Feature Selection Method Based on Contribution Analysis Algorithm of Neural Network

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wang Xiaojia; Mao Qirong; Zhan Yongzhao

    There are many emotion features. If all these features are employed to recognize emotions, redundant features may be existed. Furthermore, recognition result is unsatisfying and the cost of feature extraction is high. In this paper, a method to select speech emotion features based on contribution analysis algorithm of NN is presented. The emotion features are selected by using contribution analysis algorithm of NN from the 95 extracted features. Cluster analysis is applied to analyze the effectiveness for the features selected, and the time of feature extraction is evaluated. Finally, 24 emotion features selected are used to recognize six speech emotions.more » The experiments show that this method can improve the recognition rate and the time of feature extraction.« less

  16. On Curbing Racial Speech.

    ERIC Educational Resources Information Center

    Gale, Mary Ellen

    1991-01-01

    An alternative interpretation of the First Amendment guarantee of free speech suggests that universities may prohibit and punish direct verbal assaults on specific individuals if the speaker intends to do harm and if a reasonable person would recognize the potential for serious interference with the victim's educational rights. (MSE)

  17. Joint Spatial-Spectral Feature Space Clustering for Speech Activity Detection from ECoG Signals

    PubMed Central

    Kanas, Vasileios G.; Mporas, Iosif; Benz, Heather L.; Sgarbas, Kyriakos N.; Bezerianos, Anastasios; Crone, Nathan E.

    2014-01-01

    Brain machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines (SVM) as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and non-speech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllable repetition tasks and may contribute to the development of portable ECoG-based communication. PMID:24658248

  18. [Creating language model of the forensic medicine domain for developing a autopsy recording system by automatic speech recognition].

    PubMed

    Niijima, H; Ito, N; Ogino, S; Takatori, T; Iwase, H; Kobayashi, M

    2000-11-01

    For the purpose of practical use of speech recognition technology for recording of forensic autopsy, a language model of the speech recording system, specialized for the forensic autopsy, was developed. The language model for the forensic autopsy by applying 3-gram model was created, and an acoustic model for Japanese speech recognition by Hidden Markov Model in addition to the above were utilized to customize the speech recognition engine for forensic autopsy. A forensic vocabulary set of over 10,000 words was compiled and some 300,000 sentence patterns were made to create the forensic language model, then properly mixing with a general language model to attain high exactitude. When tried by dictating autopsy findings, this speech recognition system was proved to be about 95% of recognition rate that seems to have reached to the practical usability in view of speech recognition software, though there remains rooms for improving its hardware and application-layer software.

  19. Auditory Training with Frequent Communication Partners

    ERIC Educational Resources Information Center

    Tye-Murray, Nancy; Spehar, Brent; Sommers, Mitchell; Barcroft, Joe

    2016-01-01

    Purpose: Individuals with hearing loss engage in auditory training to improve their speech recognition. They typically practice listening to utterances spoken by unfamiliar talkers but never to utterances spoken by their most frequent communication partner (FCP)--speech they most likely desire to recognize--under the assumption that familiarity…

  20. Optimal pattern synthesis for speech recognition based on principal component analysis

    NASA Astrophysics Data System (ADS)

    Korsun, O. N.; Poliyev, A. V.

    2018-02-01

    The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.

  1. Auditory models for speech analysis

    NASA Astrophysics Data System (ADS)

    Maybury, Mark T.

    This paper reviews the psychophysical basis for auditory models and discusses their application to automatic speech recognition. First an overview of the human auditory system is presented, followed by a review of current knowledge gleaned from neurological and psychoacoustic experimentation. Next, a general framework describes established peripheral auditory models which are based on well-understood properties of the peripheral auditory system. This is followed by a discussion of current enhancements to that models to include nonlinearities and synchrony information as well as other higher auditory functions. Finally, the initial performance of auditory models in the task of speech recognition is examined and additional applications are mentioned.

  2. Pathways of the inferior frontal occipital fasciculus in overt speech and reading.

    PubMed

    Rollans, Claire; Cheema, Kulpreet; Georgiou, George K; Cummine, Jacqueline

    2017-11-19

    In this study, we examined the relationship between tractography-based measures of white matter integrity (ex. fractional anisotropy [FA]) from diffusion tensor imaging (DTI) and five reading-related tasks, including rapid automatized naming (RAN) of letters, digits, and objects, and reading of real words and nonwords. Twenty university students with no reported history of reading difficulties were tested on all five tasks and their performance was correlated with diffusion measures extracted through DTI tractography. A secondary analysis using whole-brain Tract-Based Spatial Statistics (TBSS) was also used to find clusters showing significant negative correlations between reaction time and FA. Results showed a significant relationship between the left inferior fronto-occipital fasciculus FA and performance on the RAN of objects task, as well as a strong relationship to nonword reading, which suggests a role for this tract in slower, non-automatic and/or resource-demanding speech tasks. There were no significant relationships between FA and the faster, more automatic speech tasks (RAN of letters and digits, and real word reading). These findings provide evidence for the role of the inferior fronto-occipital fasciculus in tasks that are highly demanding of orthography-phonology translation (e.g., nonword reading) and semantic processing (e.g., RAN object). This demonstrates the importance of the inferior fronto-occipital fasciculus in basic naming and suggests that this tract may be a sensitive predictor of rapid naming performance within the typical population. We discuss the findings in the context of current models of reading and speech production to further characterize the white matter pathways associated with basic reading processes. Copyright © 2017 IBRO. Published by Elsevier Ltd. All rights reserved.

  3. Adaptive method of recognition of signals for one and two-frequency signal system in the telephony on the background of speech

    NASA Astrophysics Data System (ADS)

    Kuznetsov, Michael V.

    2006-05-01

    For reliable teamwork of various systems of automatic telecommunication including transferring systems of optical communication networks it is necessary authentic recognition of signals for one- or two-frequency service signal system. The analysis of time parameters of an accepted signal allows increasing reliability of detection and recognition of the service signal system on a background of speech.

  4. Open-Source Multi-Language Audio Database for Spoken Language Processing Applications

    DTIC Science & Technology

    2012-12-01

    Mandarin, and Russian . Approximately 30 hours of speech were collected for each language. Each passage has been carefully transcribed at the...manual and automatic methods. The Russian passages have not yet been marked at the phonetic level. Another phase of the work was to explore...You Tube. 300 passages were collected in each of three languages—English, Mandarin, and Russian . Approximately 30 hours of speech were

  5. A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms

    PubMed Central

    Wołk, Agnieszka; Glinkowski, Wojciech

    2017-01-01

    People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer. PMID:29230254

  6. A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms.

    PubMed

    Wołk, Krzysztof; Wołk, Agnieszka; Glinkowski, Wojciech

    2017-01-01

    People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer.

  7. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    PubMed Central

    Caballero-Morales, Santiago-Omar

    2013-01-01

    An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness). Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR's output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech. PMID:23935410

  8. SNR-adaptive stream weighting for audio-MES ASR.

    PubMed

    Lee, Ki-Seung

    2008-08-01

    Myoelectric signals (MESs) from the speaker's mouth region have been successfully shown to improve the noise robustness of automatic speech recognizers (ASRs), thus promising to extend their usability in implementing noise-robust ASR. In the recognition system presented herein, extracted audio and facial MES features were integrated by a decision fusion method, where the likelihood score of the audio-MES observation vector was given by a linear combination of class-conditional observation log-likelihoods of two classifiers, using appropriate weights. We developed a weighting process adaptive to SNRs. The main objective of the paper involves determining the optimal SNR classification boundaries and constructing a set of optimum stream weights for each SNR class. These two parameters were determined by a method based on a maximum mutual information criterion. Acoustic and facial MES data were collected from five subjects, using a 60-word vocabulary. Four types of acoustic noise including babble, car, aircraft, and white noise were acoustically added to clean speech signals with SNR ranging from -14 to 31 dB. The classification accuracy of the audio ASR was as low as 25.5%. Whereas, the classification accuracy of the MES ASR was 85.2%. The classification accuracy could be further improved by employing the proposed audio-MES weighting method, which was as high as 89.4% in the case of babble noise. A similar result was also found for the other types of noise.

  9. Advances to the development of a basic Mexican sign-to-speech and text language translator

    NASA Astrophysics Data System (ADS)

    Garcia-Bautista, G.; Trujillo-Romero, F.; Diaz-Gonzalez, G.

    2016-09-01

    Sign Language (SL) is the basic alternative communication method between deaf people. However, most of the hearing people have trouble understanding the SL, making communication with deaf people almost impossible and taking them apart from daily activities. In this work we present an automatic basic real-time sign language translator capable of recognize a basic list of Mexican Sign Language (MSL) signs of 10 meaningful words, letters (A-Z) and numbers (1-10) and translate them into speech and text. The signs were collected from a group of 35 MSL signers executed in front of a Microsoft Kinect™ Sensor. The hand gesture recognition system use the RGB-D camera to build and storage data point clouds, color and skeleton tracking information. In this work we propose a method to obtain the representative hand trajectory pattern information. We use Euclidean Segmentation method to obtain the hand shape and Hierarchical Centroid as feature extraction method for images of numbers and letters. A pattern recognition method based on a Back Propagation Artificial Neural Network (ANN) is used to interpret the hand gestures. Finally, we use K-Fold Cross Validation method for training and testing stages. Our results achieve an accuracy of 95.71% on words, 98.57% on numbers and 79.71% on letters. In addition, an interactive user interface was designed to present the results in voice and text format.

  10. The Psychologist as an Interlocutor in Autism Spectrum Disorder Assessment: Insights From a Study of Spontaneous Prosody

    PubMed Central

    Bone, Daniel; Lee, Chi-Chun; Black, Matthew P.; Williams, Marian E.; Lee, Sungbok; Levitt, Pat; Narayanan, Shrikanth

    2015-01-01

    Purpose The purpose of this study was to examine relationships between prosodic speech cues and autism spectrum disorder (ASD) severity, hypothesizing a mutually interactive relationship between the speech characteristics of the psychologist and the child. The authors objectively quantified acoustic-prosodic cues of the psychologist and of the child with ASD during spontaneous interaction, establishing a methodology for future large-sample analysis. Method Speech acoustic-prosodic features were semiautomatically derived from segments of semistructured interviews (Autism Diagnostic Observation Schedule, ADOS; Lord, Rutter, DiLavore, & Risi, 1999; Lord et al., 2012) with 28 children who had previously been diagnosed with ASD. Prosody was quantified in terms of intonation, volume, rate, and voice quality. Research hypotheses were tested via correlation as well as hierarchical and predictive regression between ADOS severity and prosodic cues. Results Automatically extracted speech features demonstrated prosodic characteristics of dyadic interactions. As rated ASD severity increased, both the psychologist and the child demonstrated effects for turn-end pitch slope, and both spoke with atypical voice quality. The psychologist’s acoustic cues predicted the child’s symptom severity better than did the child’s acoustic cues. Conclusion The psychologist, acting as evaluator and interlocutor, was shown to adjust his or her behavior in predictable ways based on the child’s social-communicative impairments. The results support future study of speech prosody of both interaction partners during spontaneous conversation, while using automatic computational methods that allow for scalable analysis on much larger corpora. PMID:24686340

  11. Barista: A Framework for Concurrent Speech Processing by USC-SAIL

    PubMed Central

    Can, Doğan; Gibson, James; Vaz, Colin; Georgiou, Panayiotis G.; Narayanan, Shrikanth S.

    2016-01-01

    We present Barista, an open-source framework for concurrent speech processing based on the Kaldi speech recognition toolkit and the libcppa actor library. With Barista, we aim to provide an easy-to-use, extensible framework for constructing highly customizable concurrent (and/or distributed) networks for a variety of speech processing tasks. Each Barista network specifies a flow of data between simple actors, concurrent entities communicating by message passing, modeled after Kaldi tools. Leveraging the fast and reliable concurrency and distribution mechanisms provided by libcppa, Barista lets demanding speech processing tasks, such as real-time speech recognizers and complex training workflows, to be scheduled and executed on parallel (and/or distributed) hardware. Barista is released under the Apache License v2.0. PMID:27610047

  12. Barista: A Framework for Concurrent Speech Processing by USC-SAIL.

    PubMed

    Can, Doğan; Gibson, James; Vaz, Colin; Georgiou, Panayiotis G; Narayanan, Shrikanth S

    2014-05-01

    We present Barista, an open-source framework for concurrent speech processing based on the Kaldi speech recognition toolkit and the libcppa actor library. With Barista, we aim to provide an easy-to-use, extensible framework for constructing highly customizable concurrent (and/or distributed) networks for a variety of speech processing tasks. Each Barista network specifies a flow of data between simple actors, concurrent entities communicating by message passing, modeled after Kaldi tools. Leveraging the fast and reliable concurrency and distribution mechanisms provided by libcppa, Barista lets demanding speech processing tasks, such as real-time speech recognizers and complex training workflows, to be scheduled and executed on parallel (and/or distributed) hardware. Barista is released under the Apache License v2.0.

  13. Recognizing emotional speech in Persian: a validated database of Persian emotional speech (Persian ESD).

    PubMed

    Keshtiari, Niloofar; Kuhlmann, Michael; Eslami, Moharram; Klann-Delius, Gisela

    2015-03-01

    Research on emotional speech often requires valid stimuli for assessing perceived emotion through prosody and lexical content. To date, no comprehensive emotional speech database for Persian is officially available. The present article reports the process of designing, compiling, and evaluating a comprehensive emotional speech database for colloquial Persian. The database contains a set of 90 validated novel Persian sentences classified in five basic emotional categories (anger, disgust, fear, happiness, and sadness), as well as a neutral category. These sentences were validated in two experiments by a group of 1,126 native Persian speakers. The sentences were articulated by two native Persian speakers (one male, one female) in three conditions: (1) congruent (emotional lexical content articulated in a congruent emotional voice), (2) incongruent (neutral sentences articulated in an emotional voice), and (3) baseline (all emotional and neutral sentences articulated in neutral voice). The speech materials comprise about 470 sentences. The validity of the database was evaluated by a group of 34 native speakers in a perception test. Utterances recognized better than five times chance performance (71.4 %) were regarded as valid portrayals of the target emotions. Acoustic analysis of the valid emotional utterances revealed differences in pitch, intensity, and duration, attributes that may help listeners to correctly classify the intended emotion. The database is designed to be used as a reliable material source (for both text and speech) in future cross-cultural or cross-linguistic studies of emotional speech, and it is available for academic research purposes free of charge. To access the database, please contact the first author.

  14. Brain Diseases

    MedlinePlus

    The brain is the control center of the body. It controls thoughts, memory, speech, and movement. It regulates the function of many organs. When the brain is healthy, it works quickly and automatically. However, ...

  15. Recognizing Speech under a Processing Load: Dissociating Energetic from Informational Factors

    ERIC Educational Resources Information Center

    Mattys, Sven L.; Brooks, Joanna; Cooke, Martin

    2009-01-01

    Effects of perceptual and cognitive loads on spoken-word recognition have so far largely escaped investigation. This study lays the foundations of a psycholinguistic approach to speech recognition in adverse conditions that draws upon the distinction between energetic masking, i.e., listening environments leading to signal degradation, and…

  16. The Diagnostic Conference Planning Questionnaire for Speech-Language Pathology.

    ERIC Educational Resources Information Center

    Houle, Gail Ruppert

    1990-01-01

    The article describes a tool to increase professional effectiveness in supervisory conferencing in speech-language pathology based on the dual areas of role expectations for clinicians and personal needs as derived from Maslow's hierarchy of needs. The conferencing questionnaire aids in recognizing the needs of the supervisee, stating problems,…

  17. Recognition of Rapid Speech by Blind and Sighted Older Adults

    ERIC Educational Resources Information Center

    Gordon-Salant, Sandra; Friedman, Sarah A.

    2011-01-01

    Purpose: To determine whether older blind participants recognize time-compressed speech better than older sighted participants. Method: Three groups of adults with normal hearing participated (n = 10/group): (a) older sighted, (b) older blind, and (c) younger sighted listeners. Low-predictability sentences that were uncompressed (0% time…

  18. Individuals with autism spectrum disorders do not use social stereotypes in irony comprehension.

    PubMed

    Zalla, Tiziana; Amsellem, Frederique; Chaste, Pauline; Ervas, Francesca; Leboyer, Marion; Champagne-Lavau, Maud

    2014-01-01

    Social and communication impairments are part of the essential diagnostic criteria used to define Autism Spectrum Disorders (ASDs). Difficulties in appreciating non-literal speech, such as irony in ASDs have been explained as due to impairments in social understanding and in recognizing the speaker's communicative intention. It has been shown that social-interactional factors, such as a listener's beliefs about the speaker's attitudinal propensities (e.g., a tendency to use sarcasm, to be mocking, less sincere and more prone to criticism), as conveyed by an occupational stereotype, do influence a listener's interpretation of potentially ironic remarks. We investigate the effect of occupational stereotype on irony detection in adults with High Functioning Autism or Asperger Syndrome (HFA/AS) and a comparison group of typically developed adults. We used a series of verbally presented stories containing ironic or literal utterances produced by a speaker having either a "sarcastic" or a "non-sarcastic" occupation. Although individuals with HFA/AS were able to recognize ironic intent and occupational stereotypes when the latter are made salient, stereotype information enhanced irony detection and modulated its social meaning (i.e., mockery and politeness) only in comparison participants. We concluded that when stereotype knowledge is not made salient, it does not automatically affect pragmatic communicative processes in individuals with HFA/AS.

  19. 38 CFR 51.31 - Automatic recognition.

    Code of Federal Regulations, 2012 CFR

    2012-07-01

    ...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...

  20. 38 CFR 51.31 - Automatic recognition.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...

  1. 38 CFR 51.31 - Automatic recognition.

    Code of Federal Regulations, 2013 CFR

    2013-07-01

    ...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...

  2. 38 CFR 51.31 - Automatic recognition.

    Code of Federal Regulations, 2014 CFR

    2014-07-01

    ...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...

  3. 38 CFR 51.31 - Automatic recognition.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ...) PER DIEM FOR NURSING HOME CARE OF VETERANS IN STATE HOMES Obtaining Per Diem for Nursing Home Care in... that already is recognized by VA as a State home for nursing home care at the time this part becomes effective, automatically will continue to be recognized as a State home for nursing home care but will be...

  4. A Joint Time-Frequency and Matrix Decomposition Feature Extraction Methodology for Pathological Voice Classification

    NASA Astrophysics Data System (ADS)

    Ghoraani, Behnaz; Krishnan, Sridhar

    2009-12-01

    The number of people affected by speech problems is increasing as the modern world places increasing demands on the human voice via mobile telephones, voice recognition software, and interpersonal verbal communications. In this paper, we propose a novel methodology for automatic pattern classification of pathological voices. The main contribution of this paper is extraction of meaningful and unique features using Adaptive time-frequency distribution (TFD) and nonnegative matrix factorization (NMF). We construct Adaptive TFD as an effective signal analysis domain to dynamically track the nonstationarity in the speech and utilize NMF as a matrix decomposition (MD) technique to quantify the constructed TFD. The proposed method extracts meaningful and unique features from the joint TFD of the speech, and automatically identifies and measures the abnormality of the signal. Depending on the abnormality measure of each signal, we classify the signal into normal or pathological. The proposed method is applied on the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database which consists of 161 pathological and 51 normal speakers, and an overall classification accuracy of 98.6% was achieved.

  5. Computational Modeling of Emotions and Affect in Social-Cultural Interaction

    DTIC Science & Technology

    2013-10-02

    acoustic and textual information sources. Second, a cross-lingual study was performed that shed light on how human perception and automatic recognition...speech is produced, a speaker’s pitch and intonational pattern, and word usage. Better feature representation and advanced approaches were used to...recognition performance, and improved our understanding of language/cultural impact on human perception of emotion and automatic classification. • Units

  6. Automatic Barometric Updates from Ground-Based Navigational Aids

    DTIC Science & Technology

    1990-03-12

    ro fAutomatic Barometric Updates US Department from of Transportation Ground-Based Federal Aviation Administration Navigational Aids Office of Safety...tighter vertical spacing controls , particularly for operations near Terminal Control Areas (TCAs), Airport Radar Service Areas (ARSAs), military climb and...E.F., Ruth, J.C., and Williges, B.H. (1987). Speech Controls and Displays. In Salvendy, G., E. Handbook of Human Factors/Ergonomics, New York, John

  7. International aspirations for speech-language pathologists' practice with multilingual children with speech sound disorders: development of a position paper.

    PubMed

    McLeod, Sharynne; Verdon, Sarah; Bowen, Caroline

    2013-01-01

    A major challenge for the speech-language pathology profession in many cultures is to address the mismatch between the "linguistic homogeneity of the speech-language pathology profession and the linguistic diversity of its clientele" (Caesar & Kohler, 2007, p. 198). This paper outlines the development of the Multilingual Children with Speech Sound Disorders: Position Paper created to guide speech-language pathologists' (SLPs') facilitation of multilingual children's speech. An international expert panel was assembled comprising 57 researchers (SLPs, linguists, phoneticians, and speech scientists) with knowledge about multilingual children's speech, or children with speech sound disorders. Combined, they had worked in 33 countries and used 26 languages in professional practice. Fourteen panel members met for a one-day workshop to identify key points for inclusion in the position paper. Subsequently, 42 additional panel members participated online to contribute to drafts of the position paper. A thematic analysis was undertaken of the major areas of discussion using two data sources: (a) face-to-face workshop transcript (133 pages) and (b) online discussion artifacts (104 pages). Finally, a moderator with international expertise in working with children with speech sound disorders facilitated the incorporation of the panel's recommendations. The following themes were identified: definitions, scope, framework, evidence, challenges, practices, and consideration of a multilingual audience. The resulting position paper contains guidelines for providing services to multilingual children with speech sound disorders (http://www.csu.edu.au/research/multilingual-speech/position-paper). The paper is structured using the International Classification of Functioning, Disability and Health: Children and Youth Version (World Health Organization, 2007) and incorporates recommendations for (a) children and families, (b) SLPs' assessment and intervention, (c) SLPs' professional practice, and (d) SLPs' collaboration with other professionals. Readers will 1. recognize that multilingual children with speech sound disorders have both similar and different needs to monolingual children when working with speech-language pathologists. 2. Describe the challenges for speech-language pathologists who work with multilingual children. 3. Recall the importance of cultural competence for speech-language pathologists. 4. Identify methods for international collaboration and consultation. 5. Recognize the importance of engaging with families and people within their local communities for supporting multilingual children in context. Copyright © 2013 Elsevier Inc. All rights reserved.

  8. Childhood apraxia of speech: A survey of praxis and typical speech characteristics.

    PubMed

    Malmenholt, Ann; Lohmander, Anette; McAllister, Anita

    2017-07-01

    The purpose of this study was to investigate current knowledge of the diagnosis childhood apraxia of speech (CAS) in Sweden and compare speech characteristics and symptoms to those of earlier survey findings in mainly English-speakers. In a web-based questionnaire 178 Swedish speech-language pathologists (SLPs) anonymously answered questions about their perception of typical speech characteristics for CAS. They graded own assessment skills and estimated clinical occurrence. The seven top speech characteristics reported as typical for children with CAS were: inconsistent speech production (85%), sequencing difficulties (71%), oro-motor deficits (63%), vowel errors (62%), voicing errors (61%), consonant cluster deletions (54%), and prosodic disturbance (53%). Motor-programming deficits described as lack of automatization of speech movements were perceived by 82%. All listed characteristics were consistent with the American Speech-Language-Hearing Association (ASHA) consensus-based features, Strand's 10-point checklist, and the diagnostic model proposed by Ozanne. The mode for clinical occurrence was 5%. Number of suspected cases of CAS in the clinical caseload was approximately one new patient/year and SLP. The results support and add to findings from studies of CAS in English-speaking children with similar speech characteristics regarded as typical. Possibly, these findings could contribute to cross-linguistic consensus on CAS characteristics.

  9. Segmental intelligibility of synthetic speech produced by rule.

    PubMed

    Logan, J S; Greene, B G; Pisoni, D B

    1989-08-01

    This paper reports the results of an investigation that employed the modified rhyme test (MRT) to measure the segmental intelligibility of synthetic speech generated automatically by rule. Synthetic speech produced by ten text-to-speech systems was studied and compared to natural speech. A variation of the standard MRT was also used to study the effects of response set size on perceptual confusions. Results indicated that the segmental intelligibility scores formed a continuum. Several systems displayed very high levels of performance that were close to or equal to scores obtained with natural speech; other systems displayed substantially worse performance compared to natural speech. The overall performance of the best system, DECtalk--Paul, was equivalent to the data obtained with natural speech for consonants in syllable-initial position. The findings from this study are discussed in terms of the use of a set of standardized procedures for measuring intelligibility of synthetic speech under controlled laboratory conditions. Recent work investigating the perception of synthetic speech under more severe conditions in which greater demands are made on the listener's processing resources is also considered. The wide range of intelligibility scores obtained in the present study demonstrates important differences in perception and suggests that not all synthetic speech is perceptually equivalent to the listener.

  10. Segmental intelligibility of synthetic speech produced by rule

    PubMed Central

    Logan, John S.; Greene, Beth G.; Pisoni, David B.

    2012-01-01

    This paper reports the results of an investigation that employed the modified rhyme test (MRT) to measure the segmental intelligibility of synthetic speech generated automatically by rule. Synthetic speech produced by ten text-to-speech systems was studied and compared to natural speech. A variation of the standard MRT was also used to study the effects of response set size on perceptual confusions. Results indicated that the segmental intelligibility scores formed a continuum. Several systems displayed very high levels of performance that were close to or equal to scores obtained with natural speech; other systems displayed substantially worse performance compared to natural speech. The overall performance of the best system, DECtalk—Paul, was equivalent to the data obtained with natural speech for consonants in syllable-initial position. The findings from this study are discussed in terms of the use of a set of standardized procedures for measuring intelligibility of synthetic speech under controlled laboratory conditions. Recent work investigating the perception of synthetic speech under more severe conditions in which greater demands are made on the listener’s processing resources is also considered. The wide range of intelligibility scores obtained in the present study demonstrates important differences in perception and suggests that not all synthetic speech is perceptually equivalent to the listener. PMID:2527884

  11. Speech rate in Parkinson's disease: A controlled study.

    PubMed

    Martínez-Sánchez, F; Meilán, J J G; Carro, J; Gómez Íñiguez, C; Millian-Morell, L; Pujante Valverde, I M; López-Alburquerque, T; López, D E

    2016-09-01

    Speech disturbances will affect most patients with Parkinson's disease (PD) over the course of the disease. The origin and severity of these symptoms are of clinical and diagnostic interest. To evaluate the clinical pattern of speech impairment in PD patients and identify significant differences in speech rate and articulation compared to control subjects. Speech rate and articulation in a reading task were measured using an automatic analytical method. A total of 39 PD patients in the 'on' state and 45 age-and sex-matched asymptomatic controls participated in the study. None of the patients experienced dyskinesias or motor fluctuations during the test. The patients with PD displayed a significant reduction in speech and articulation rates; there were no significant correlations between the studied speech parameters and patient characteristics such as L-dopa dose, duration of the disorder, age, and UPDRS III scores and Hoehn & Yahr scales. Patients with PD show a characteristic pattern of declining speech rate. These results suggest that in PD, disfluencies are the result of the movement disorder affecting the physiology of speech production systems. Copyright © 2014 Sociedad Española de Neurología. Publicado por Elsevier España, S.L.U. All rights reserved.

  12. Modular Neural Networks for Speech Recognition.

    DTIC Science & Technology

    1996-08-01

    automatic speech rccogni- tion, understanding and translation since the early 1950’ s . Although researchers have demonstrated impressive results with...nodes. It serves only as a data source for the following hidden layer( s ). Finally, the networks output is computed by neurons in the output layer. The...following update rule for weights in the hidden layer: w (,,•+I) ("’) E/V S (W W k- = wj, -- 7 - / v It is easy to generalize the backpropagation

  13. Application of artifical intelligence principles to the analysis of "crazy" speech.

    PubMed

    Garfield, D A; Rapp, C

    1994-04-01

    Artificial intelligence computer simulation methods can be used to investigate psychotic or "crazy" speech. Here, symbolic reasoning algorithms establish semantic networks that schematize speech. These semantic networks consist of two main structures: case frames and object taxonomies. Node-based reasoning rules apply to object taxonomies and pathway-based reasoning rules apply to case frames. Normal listeners may recognize speech as "crazy talk" based on violations of node- and pathway-based reasoning rules. In this article, three separate segments of schizophrenic speech illustrate violations of these rules. This artificial intelligence approach is compared and contrasted with other neurolinguistic approaches and is discussed as a conceptual link between neurobiological and psychodynamic understandings of psychopathology.

  14. Speech recognition: Acoustic phonetic and lexical knowledge representation

    NASA Astrophysics Data System (ADS)

    Zue, V. W.

    1983-02-01

    The purpose of this program is to develop a speech data base facility under which the acoustic characteristics of speech sounds in various contexts can be studied conveniently; investigate the phonological properties of a large lexicon of, say 10,000 words, and determine to what extent the phontactic constraints can be utilized in speech recognition; study the acoustic cues that are used to mark work boundaries; develop a test bed in the form of a large-vocabulary, IWR system to study the interactions of acoustic, phonetic and lexical knowledge; and develop a limited continuous speech recognition system with the goal of recognizing any English word from its spelling in order to assess the interactions of higher-level knowledge sources.

  15. Neuroscience-inspired computational systems for speech recognition under noisy conditions

    NASA Astrophysics Data System (ADS)

    Schafer, Phillip B.

    Humans routinely recognize speech in challenging acoustic environments with background music, engine sounds, competing talkers, and other acoustic noise. However, today's automatic speech recognition (ASR) systems perform poorly in such environments. In this dissertation, I present novel methods for ASR designed to approach human-level performance by emulating the brain's processing of sounds. I exploit recent advances in auditory neuroscience to compute neuron-based representations of speech, and design novel methods for decoding these representations to produce word transcriptions. I begin by considering speech representations modeled on the spectrotemporal receptive fields of auditory neurons. These representations can be tuned to optimize a variety of objective functions, which characterize the response properties of a neural population. I propose an objective function that explicitly optimizes the noise invariance of the neural responses, and find that it gives improved performance on an ASR task in noise compared to other objectives. The method as a whole, however, fails to significantly close the performance gap with humans. I next consider speech representations that make use of spiking model neurons. The neurons in this method are feature detectors that selectively respond to spectrotemporal patterns within short time windows in speech. I consider a number of methods for training the response properties of the neurons. In particular, I present a method using linear support vector machines (SVMs) and show that this method produces spikes that are robust to additive noise. I compute the spectrotemporal receptive fields of the neurons for comparison with previous physiological results. To decode the spike-based speech representations, I propose two methods designed to work on isolated word recordings. The first method uses a classical ASR technique based on the hidden Markov model. The second method is a novel template-based recognition scheme that takes advantage of the neural representation's invariance in noise. The scheme centers on a speech similarity measure based on the longest common subsequence between spike sequences. The combined encoding and decoding scheme outperforms a benchmark system in extremely noisy acoustic conditions. Finally, I consider methods for decoding spike representations of continuous speech. To help guide the alignment of templates to words, I design a syllable detection scheme that robustly marks the locations of syllabic nuclei. The scheme combines SVM-based training with a peak selection algorithm designed to improve noise tolerance. By incorporating syllable information into the ASR system, I obtain strong recognition results in noisy conditions, although the performance in noiseless conditions is below the state of the art. The work presented here constitutes a novel approach to the problem of ASR that can be applied in the many challenging acoustic environments in which we use computer technologies today. The proposed spike-based processing methods can potentially be exploited in effcient hardware implementations and could significantly reduce the computational costs of ASR. The work also provides a framework for understanding the advantages of spike-based acoustic coding in the human brain.

  16. [Speech syndrome and its course in patients who have sustained an ischemic stroke (clinico-tomographic study)].

    PubMed

    Stoliarova, L G; Varakin, Iu Ia; Vavilov, S B

    1981-01-01

    Clinical and tomographic examinations of 40 patients with aphasia developed after an ischemic stroke were carried out. In more than half of them no correlation between the aphasia gravity and character on the one hand, and the size and localization of the ischemic focus (or foci) in the brain on the other was noted. With similar character and gravity of the speech disorder the size and localization of the ischemic foci may be different, ad vice versa. It has been shown that the interrelations between the focal pathology of the brain and the character and gravity of speech disorders are very complicated. One should take into consideration the possibility of individual organization of the speech functions, the degree of the speech activity automatism before the disease, and the state of the cerebrovascular system as a whole.

  17. An experimental version of the MZT (speech-from-text) system with external F(sub 0) control

    NASA Astrophysics Data System (ADS)

    Nowak, Ignacy

    1994-12-01

    The version of a Polish speech from text system described in this article was developed using the speech-from-text system. The new system has additional functions which make it possible to enter commands in edited orthographic text to control the phrase component and accentuation parameters. This makes it possible to generate a series of modified intonation contours in the texts spoken by the system. The effects obtained are made easier to control by a graphic illustration of the base frequency pattern in phrases that were last 'spoken' by the system. This version of the system was designed as a test prototype which will help us expand and refine our set of rules for automatic generation of intonation contours, which in turn will enable the fully automated speech-from-text system to generate speech with a more varied and precisely formed fundamental frequency pattern.

  18. Older Adults Expend More Listening Effort than Young Adults Recognizing Speech in Noise

    ERIC Educational Resources Information Center

    Gosselin, Penny Anderson; Gagne, Jean-Pierre

    2011-01-01

    Purpose: Listening in noisy situations is a challenging experience for many older adults. The authors hypothesized that older adults exert more listening effort compared with young adults. Listening effort involves the attention and cognitive resources required to understand speech. The purpose was (a) to quantify the amount of listening effort…

  19. Speech, Language, and Audiology Services in Public Schools

    ERIC Educational Resources Information Center

    Sunderland, L.C.

    2004-01-01

    The prevalence of communication disorders (speech, language, and hearing) among school-age children continues to increase, making it imperative that the classroom teacher be able to identify children in need of services. This article provides information that will enable all teachers to recognize when a child is exhibiting signs of a communication…

  20. Word-Form Familiarity Bootstraps Infant Speech Segmentation

    ERIC Educational Resources Information Center

    Altvater-Mackensen, Nicole; Mani, Nivedita

    2013-01-01

    At about 7 months of age, infants listen longer to sentences containing familiar words--but not deviant pronunciations of familiar words (Jusczyk & Aslin, 1995). This finding suggests that infants are able to segment familiar words from fluent speech and that they store words in sufficient phonological detail to recognize deviations from a…

  1. Speech Recognition in Fluctuating and Continuous Maskers: Effects of Hearing Loss and Presentation Level.

    ERIC Educational Resources Information Center

    Summers, Van; Molis, Michelle R.

    2004-01-01

    Listeners with normal-hearing sensitivity recognize speech more accurately in the presence of fluctuating background sounds, such as a single competing voice, than in unmodulated noise at the same overall level. These performance differences ore greatly reduced in listeners with hearing impairment, who generally receive little benefit from…

  2. Words from spontaneous conversational speech can be recognized with human-like accuracy by an error-driven learning algorithm that discriminates between meanings straight from smart acoustic features, bypassing the phoneme as recognition unit.

    PubMed

    Arnold, Denis; Tomaschek, Fabian; Sering, Konstantin; Lopez, Florence; Baayen, R Harald

    2017-01-01

    Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20-44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a 'wide' yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.

  3. Text-to-audiovisual speech synthesizer for children with learning disabilities.

    PubMed

    Mendi, Engin; Bayrak, Coskun

    2013-01-01

    Learning disabilities affect the ability of children to learn, despite their having normal intelligence. Assistive tools can highly increase functional capabilities of children with learning disorders such as writing, reading, or listening. In this article, we describe a text-to-audiovisual synthesizer that can serve as an assistive tool for such children. The system automatically converts an input text to audiovisual speech, providing synchronization of the head, eye, and lip movements of the three-dimensional face model with appropriate facial expressions and word flow of the text. The proposed system can enhance speech perception and help children having learning deficits to improve their chances of success.

  4. Speaker emotion recognition: from classical classifiers to deep neural networks

    NASA Astrophysics Data System (ADS)

    Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri

    2018-04-01

    Speaker emotion recognition is considered among the most challenging tasks in recent years. In fact, automatic systems for security, medicine or education can be improved when considering the speech affective state. In this paper, a twofold approach for speech emotion classification is proposed. At the first side, a relevant set of features is adopted, and then at the second one, numerous supervised training techniques, involving classic methods as well as deep learning, are experimented. Experimental results indicate that deep architecture can improve classification performance on two affective databases, the Berlin Dataset of Emotional Speech and the SAVEE Dataset Surrey Audio-Visual Expressed Emotion.

  5. SAM: speech-aware applications in medicine to support structured data entry.

    PubMed Central

    Wormek, A. K.; Ingenerf, J.; Orthner, H. F.

    1997-01-01

    In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech recognizer to a specific medical application. The goals of our work are first, to develop a speech-aware core component which is able to establish connections to speech recognition engines of different vendors. This is realized in SAM. Second, with applications based on SAM we want to support the physician in his/her routine clinical care activities. Within the STAMP project (STAndardized Multimedia report generator in Pathology), we extend SAM by combining a structured data entry approach with speech recognition technology. Another speech-aware application in the field of Diabetes care is connected to a terminology server. The server delivers a controlled vocabulary which can be used for speech recognition. PMID:9357730

  6. A Development of a System Enables Character Input and PC Operation via Voice for a Physically Disabled Person with a Speech Impediment

    NASA Astrophysics Data System (ADS)

    Tanioka, Toshimasa; Egashira, Hiroyuki; Takata, Mayumi; Okazaki, Yasuhisa; Watanabe, Kenzi; Kondo, Hiroki

    We have designed and implemented a PC operation support system for a physically disabled person with a speech impediment via voice. Voice operation is an effective method for a physically disabled person with involuntary movement of the limbs and the head. We have applied a commercial speech recognition engine to develop our system for practical purposes. Adoption of a commercial engine reduces development cost and will contribute to make our system useful to another speech impediment people. We have customized commercial speech recognition engine so that it can recognize the utterance of a person with a speech impediment. We have restricted the words that the recognition engine recognizes and separated a target words from similar words in pronunciation to avoid misrecognition. Huge number of words registered in commercial speech recognition engines cause frequent misrecognition for speech impediments' utterance, because their utterance is not clear and unstable. We have solved this problem by narrowing the choice of input down in a small number and also by registering their ambiguous pronunciations in addition to the original ones. To realize all character inputs and all PC operation with a small number of words, we have designed multiple input modes with categorized dictionaries and have introduced two-step input in each mode except numeral input to enable correct operation with small number of words. The system we have developed is in practical level. The first author of this paper is physically disabled with a speech impediment. He has been able not only character input into PC but also to operate Windows system smoothly by using this system. He uses this system in his daily life. This paper is written by him with this system. At present, the speech recognition is customized to him. It is, however, possible to customize for other users by changing words and registering new pronunciation according to each user's utterance.

  7. Adaptation of hidden Markov models for recognizing speech of reduced frame rate.

    PubMed

    Lee, Lee-Min; Jean, Fu-Rong

    2013-12-01

    The frame rate of the observation sequence in distributed speech recognition applications may be reduced to suit a resource-limited front-end device. In order to use models trained using full-frame-rate data in the recognition of reduced frame-rate (RFR) data, we propose a method for adapting the transition probabilities of hidden Markov models (HMMs) to match the frame rate of the observation. Experiments on the recognition of clean and noisy connected digits are conducted to evaluate the proposed method. Experimental results show that the proposed method can effectively compensate for the frame-rate mismatch between the training and the test data. Using our adapted model to recognize the RFR speech data, one can significantly reduce the computation time and achieve the same level of accuracy as that of a method, which restores the frame rate using data interpolation.

  8. Visual Benefits in Apparent Motion Displays: Automatically Driven Spatial and Temporal Anticipation Are Partially Dissociated

    PubMed Central

    Ahrens, Merle-Marie; Veniero, Domenica; Gross, Joachim; Harvey, Monika; Thut, Gregor

    2015-01-01

    Many behaviourally relevant sensory events such as motion stimuli and speech have an intrinsic spatio-temporal structure. This will engage intentional and most likely unintentional (automatic) prediction mechanisms enhancing the perception of upcoming stimuli in the event stream. Here we sought to probe the anticipatory processes that are automatically driven by rhythmic input streams in terms of their spatial and temporal components. To this end, we employed an apparent visual motion paradigm testing the effects of pre-target motion on lateralized visual target discrimination. The motion stimuli either moved towards or away from peripheral target positions (valid vs. invalid spatial motion cueing) at a rhythmic or arrhythmic pace (valid vs. invalid temporal motion cueing). Crucially, we emphasized automatic motion-induced anticipatory processes by rendering the motion stimuli non-predictive of upcoming target position (by design) and task-irrelevant (by instruction), and by creating instead endogenous (orthogonal) expectations using symbolic cueing. Our data revealed that the apparent motion cues automatically engaged both spatial and temporal anticipatory processes, but that these processes were dissociated. We further found evidence for lateralisation of anticipatory temporal but not spatial processes. This indicates that distinct mechanisms may drive automatic spatial and temporal extrapolation of upcoming events from rhythmic event streams. This contrasts with previous findings that instead suggest an interaction between spatial and temporal attention processes when endogenously driven. Our results further highlight the need for isolating intentional from unintentional processes for better understanding the various anticipatory mechanisms engaged in processing behaviourally relevant stimuli with predictable spatio-temporal structure such as motion and speech. PMID:26623650

  9. Are You Talking to Me? Dialogue Systems Supporting Mixed Teams of Humans and Robots

    NASA Technical Reports Server (NTRS)

    Dowding, John; Clancey, William J.; Graham, Jeffrey

    2006-01-01

    This position paper describes an approach to building spoken dialogue systems for environments containing multiple human speakers and hearers, and multiple robotic speakers and hearers. We address the issue, for robotic hearers, of whether the speech they hear is intended for them, or more likely to be intended for some other hearer. We will describe data collected during a series of experiments involving teams of multiple human and robots (and other software participants), and some preliminary results for distinguishing robot-directed speech from human-directed speech. The domain of these experiments is Mars-analogue planetary exploration. These Mars-analogue field studies involve two subjects in simulated planetary space suits doing geological exploration with the help of 1-2 robots, supporting software agents, a habitat communicator and links to a remote science team. The two subjects are performing a task (geological exploration) which requires them to speak with each other while also speaking with their assistants. The technique used here is to use a probabilistic context-free grammar language model in the speech recognizer that is trained on prior robot-directed speech. Intuitively, the recognizer will give higher confidence to an utterance if it is similar to utterances that have been directed to the robot in the past.

  10. The Development of the Speaker Independent ARM Continuous Speech Recognition System

    DTIC Science & Technology

    1992-01-01

    spokeTi airborne reconnaissance reports u-ing a speech recognition system based on phoneme-level hidden Markov models (HMMs). Previous versions of the ARM...will involve automatic selection from multiple model sets, corresponding to different speaker types, and that the most rudimen- tary partition of a...The vocabulary size for the ARM task is 497 words. These words are related to the phoneme-level symbols corresponding to the models in the model set

  11. Stochastic Modeling as a Means of Automatic Speech Recognition

    DTIC Science & Technology

    1975-04-01

    companng ihc features of different speech recognition systems, attention is often focused on thc control structures and the methods o’ communication...with no need to use secondary storage . Note that we go from a group of separate knowledge sources to an integrated network representation in...exhaust the available lime or storage . - - - . . 1- .-.-.. mmm^~ i — ■ ■ ’ ■ C haplcr I - IN I ROÜliCl ION Page 13 On the other hand

  12. Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel

    PubMed Central

    Kleinschmidt, Dave F.; Jaeger, T. Florian

    2016-01-01

    Successful speech perception requires that listeners map the acoustic signal to linguistic categories. These mappings are not only probabilistic, but change depending on the situation. For example, one talker’s /p/ might be physically indistinguishable from another talker’s /b/ (cf. lack of invariance). We characterize the computational problem posed by such a subjectively non-stationary world and propose that the speech perception system overcomes this challenge by (1) recognizing previously encountered situations, (2) generalizing to other situations based on previous similar experience, and (3) adapting to novel situations. We formalize this proposal in the ideal adapter framework: (1) to (3) can be understood as inference under uncertainty about the appropriate generative model for the current talker, thereby facilitating robust speech perception despite the lack of invariance. We focus on two critical aspects of the ideal adapter. First, in situations that clearly deviate from previous experience, listeners need to adapt. We develop a distributional (belief-updating) learning model of incremental adaptation. The model provides a good fit against known and novel phonetic adaptation data, including perceptual recalibration and selective adaptation. Second, robust speech recognition requires listeners learn to represent the structured component of cross-situation variability in the speech signal. We discuss how these two aspects of the ideal adapter provide a unifying explanation for adaptation, talker-specificity, and generalization across talkers and groups of talkers (e.g., accents and dialects). The ideal adapter provides a guiding framework for future investigations into speech perception and adaptation, and more broadly language comprehension. PMID:25844873

  13. Noise-immune multisensor transduction of speech

    NASA Astrophysics Data System (ADS)

    Viswanathan, Vishu R.; Henry, Claudia M.; Derr, Alan G.; Roucos, Salim; Schwartz, Richard M.

    1986-08-01

    Two types of configurations of multiple sensors were developed, tested and evaluated in speech recognition application for robust performance in high levels of acoustic background noise: One type combines the individual sensor signals to provide a single speech signal input, and the other provides several parallel inputs. For single-input systems, several configurations of multiple sensors were developed and tested. Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show that each of the two-sensor configurations tested outperforms the constituent individual sensors in high noise. Also presented are results comparing the performance of two-sensor configurations and individual sensors in speaker-dependent, isolated-word speech recognition tests performed using a commercial recognizer (Verbex 4000) in simulated fighter aircraft cockpit noise.

  14. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar

    PubMed Central

    Shin, Young Hoon; Seo, Jiwon

    2016-01-01

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867

  15. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar.

    PubMed

    Shin, Young Hoon; Seo, Jiwon

    2016-10-29

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  16. Speech endpoint detection with non-language speech sounds for generic speech processing applications

    NASA Astrophysics Data System (ADS)

    McClain, Matthew; Romanowski, Brian

    2009-05-01

    Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.

  17. Syntactic error modeling and scoring normalization in speech recognition

    NASA Technical Reports Server (NTRS)

    Olorenshaw, Lex

    1991-01-01

    The objective was to develop the speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Research was performed in the following areas: (1) syntactic error modeling; (2) score normalization; and (3) phoneme error modeling. The study into the types of errors that a reader makes will provide the basis for creating tests which will approximate the use of the system in the real world. NASA-Johnson will develop this technology into a 'Literacy Tutor' in order to bring innovative concepts to the task of teaching adults to read.

  18. Worster-Drought Syndrome: Poorly Recognized despite Severe and Persistent Difficulties with Feeding and Speech

    ERIC Educational Resources Information Center

    Clark, Maria; Harris, Rebecca; Jolleff, Nicola; Price, Katie; Neville, Brian G. R.

    2010-01-01

    Aim: Worster-Drought syndrome (WDS), or congenital suprabulbar paresis, is a permanent movement disorder of the bulbar muscles causing persistent difficulties with swallowing, feeding, speech, and saliva control owing to a non-progressive disturbance in early brain development. As such, it falls within the cerebral palsies. The aim of this study…

  19. Children's Recognition of Their Own Recorded Voice: Influence of Age and Phonological Impairment

    ERIC Educational Resources Information Center

    Strombergsson, Sofia

    2013-01-01

    Children with phonological impairment (PI) often have difficulties perceiving insufficiencies in their own speech. The use of recordings has been suggested as a way of directing the child's attention toward his/her own speech, despite a lack of evidence that children actually recognize their recorded voice as their own. We present two studies of…

  20. Public Address, Cultural Diversity, and Tolerance: Teaching Cultural Diversity in Speech Classes.

    ERIC Educational Resources Information Center

    Byrd, Marquita L.

    While speech instructors work to design appropriate diversity goals in the public speaking class, few have the training for such a task. A review of course objectives and assignments for the basic course may be helpful. Suggestions for instructors working to incorporate diversity in the basic course include: (1) recognize the dominance of the…

  1. Motor Speech Disorders Associated with Primary Progressive Aphasia

    PubMed Central

    Duffy, Joseph R.; Strand, Edythe A.; Josephs, Keith A.

    2014-01-01

    Background Primary progressive aphasia (PPA) and conditions that overlap with it can be accompanied by motor speech disorders. Recognition and understanding of motor speech disorders can contribute to a fuller clinical understanding of PPA and its management as well as its localization and underlying pathology. Aims To review the types of motor speech disorders that may occur with PPA, its primary variants, and its overlap syndromes (progressive supranuclear palsy syndrome, corticobasal syndrome, motor neuron disease), as well as with primary progressive apraxia of speech. Main Contribution The review should assist clinicians' and researchers' understanding of the relationship between motor speech disorders and PPA and its major variants. It also highlights the importance of recognizing neurodegenerative apraxia of speech as a condition that can occur with little or no evidence of aphasia. Conclusion Motor speech disorders can occur with PPA. Their recognition can contribute to clinical diagnosis and management of PPA and to understanding and predicting the localization and pathology associated with PPA variants and conditions that can overlap with them. PMID:25309017

  2. The somatotopy of speech: Phonation and articulation in the human motor cortex

    PubMed Central

    Brown, Steven; Laird, Angela R.; Pfordresher, Peter Q.; Thelen, Sarah M.; Turkeltaub, Peter; Liotti, Mario

    2010-01-01

    A sizable literature on the neuroimaging of speech production has reliably shown activations in the orofacial region of the primary motor cortex. These activations have invariably been interpreted as reflecting “mouth” functioning and thus articulation. We used functional magnetic resonance imaging to compare an overt speech task with tongue movement, lip movement, and vowel phonation. The results showed that the strongest motor activation for speech was the somatotopic larynx area of the motor cortex, thus reflecting the significant contribution of phonation to speech production. In order to analyze further the phonatory component of speech, we performed a voxel-based meta-analysis of neuroimaging studies of syllable-singing (11 studies) and compared the results with a previously-published meta-analysis of oral reading (11 studies), showing again a strong overlap in the larynx motor area. Overall, these findings highlight the under-recognized presence of phonation in imaging studies of speech production, and support the role of the larynx motor cortex in mediating the “melodicity” of speech. PMID:19162389

  3. Auditory-perceptual speech analysis in children with cerebellar tumours: a long-term follow-up study.

    PubMed

    De Smet, Hyo Jung; Catsman-Berrevoets, Coriene; Aarsen, Femke; Verhoeven, Jo; Mariën, Peter; Paquier, Philippe F

    2012-09-01

    Mutism and Subsequent Dysarthria (MSD) and the Posterior Fossa Syndrome (PFS) have become well-recognized clinical entities which may develop after resection of cerebellar tumours. However, speech characteristics following a period of mutism have not been documented in much detail. This study carried out a perceptual speech analysis in 24 children and adolescents (of whom 12 became mute in the immediate postoperative phase) 1-12.2 years after cerebellar tumour resection. The most prominent speech deficits in this study were distorted vowels, slow rate, voice tremor, and monopitch. Factors influencing long-term speech disturbances are presence or absence of postoperative PFS, the localisation of the surgical lesion and the type of adjuvant treatment. Long-term speech deficits may be present up to 12 years post-surgery. The speech deficits found in children and adolescents with cerebellar lesions following cerebellar tumour surgery do not necessarily resemble adult speech characteristics of ataxic dysarthria. Copyright © 2012 European Paediatric Neurology Society. Published by Elsevier Ltd. All rights reserved.

  4. Robust Recognition of Loud and Lombard speech in the Fighter Cockpit Environment

    DTIC Science & Technology

    1988-08-01

    the latter as inter-speaker variability. According to Zue [Z85j, inter-speaker variabilities can be attributed to sociolinguistic background, dialect...34 Journal of the Acoustical Society of America , Vol 50, 1971. [At74I B. S. Atal, "Linear prediction for speaker identification," Journal of the Acoustical...Society of America , Vol 55, 1974. [B771 B. Beek, E. P. Neuberg, and D. C. Hodge, "An Assessment of the Technology of Automatic Speech Recognition for

  5. A hypothetical neurological association between dehumanization and human rights abuses.

    PubMed

    Murrow, Gail B; Murrow, Richard

    2015-07-01

    Dehumanization is anecdotally and historically associated with reduced empathy for the pain of dehumanized individuals and groups and with psychological and legal denial of their human rights and extreme violence against them. We hypothesize that 'empathy' for the pain and suffering of dehumanized social groups is automatically reduced because, as the research we review suggests, an individual's neural mechanisms of pain empathy best respond to (or produce empathy for) the pain of people whom the individual automatically or implicitly associates with her or his own species. This theory has implications for the philosophical conception of 'human' and of 'legal personhood' in human rights jurisprudence. It further has implications for First Amendment free speech jurisprudence, including the doctrine of 'corporate personhood' and consideration of the potential harm caused by dehumanizing hate speech. We suggest that the new, social neuroscience of empathy provides evidence that both the vagaries of the legal definition or legal fiction of 'personhood' and hate speech that explicitly and implicitly dehumanizes may (in their respective capacities to artificially humanize or dehumanize) manipulate the neural mechanisms of pain empathy in ways that could pose more of a true threat to human rights and rights-based democracy than previously appreciated.

  6. Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies.

    PubMed

    Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y; Saltzman, Elliot; Goldstein, Louis

    2010-09-13

    Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed "speech-inversion." This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.

  7. Exploring expressivity and emotion with artificial voice and speech technologies.

    PubMed

    Pauletto, Sandra; Balentine, Bruce; Pidcock, Chris; Jones, Kevin; Bottaci, Leonardo; Aretoulaki, Maria; Wells, Jez; Mundy, Darren P; Balentine, James

    2013-10-01

    Emotion in audio-voice signals, as synthesized by text-to-speech (TTS) technologies, was investigated to formulate a theory of expression for user interface design. Emotional parameters were specified with markup tags, and the resulting audio was further modulated with post-processing techniques. Software was then developed to link a selected TTS synthesizer with an automatic speech recognition (ASR) engine, producing a chatbot that could speak and listen. Using these two artificial voice subsystems, investigators explored both artistic and psychological implications of artificial speech emotion. Goals of the investigation were interdisciplinary, with interest in musical composition, augmentative and alternative communication (AAC), commercial voice announcement applications, human-computer interaction (HCI), and artificial intelligence (AI). The work-in-progress points towards an emerging interdisciplinary ontology for artificial voices. As one study output, HCI tools are proposed for future collaboration.

  8. The benefit of combining a deep neural network architecture with ideal ratio mask estimation in computational speech segregation to improve speech intelligibility.

    PubMed

    Bentsen, Thomas; May, Tobias; Kressner, Abigail A; Dau, Torsten

    2018-01-01

    Computational speech segregation attempts to automatically separate speech from noise. This is challenging in conditions with interfering talkers and low signal-to-noise ratios. Recent approaches have adopted deep neural networks and successfully demonstrated speech intelligibility improvements. A selection of components may be responsible for the success with these state-of-the-art approaches: the system architecture, a time frame concatenation technique and the learning objective. The aim of this study was to explore the roles and the relative contributions of these components by measuring speech intelligibility in normal-hearing listeners. A substantial improvement of 25.4 percentage points in speech intelligibility scores was found going from a subband-based architecture, in which a Gaussian Mixture Model-based classifier predicts the distributions of speech and noise for each frequency channel, to a state-of-the-art deep neural network-based architecture. Another improvement of 13.9 percentage points was obtained by changing the learning objective from the ideal binary mask, in which individual time-frequency units are labeled as either speech- or noise-dominated, to the ideal ratio mask, where the units are assigned a continuous value between zero and one. Therefore, both components play significant roles and by combining them, speech intelligibility improvements were obtained in a six-talker condition at a low signal-to-noise ratio.

  9. Speech intelligibility enhancement after maxillary denture treatment and its impact on quality of life.

    PubMed

    Knipfer, Christian; Riemann, Max; Bocklet, Tobias; Noeth, Elmar; Schuster, Maria; Sokol, Biljana; Eitner, Stephan; Nkenke, Emeka; Stelzle, Florian

    2014-01-01

    Tooth loss and its prosthetic rehabilitation significantly affect speech intelligibility. However, little is known about the influence of speech deficiencies on oral health-related quality of life (OHRQoL). The aim of this study was to investigate whether speech intelligibility enhancement through prosthetic rehabilitation significantly influences OHRQoL in patients wearing complete maxillary dentures. Speech intelligibility by means of an automatic speech recognition system (ASR) was prospectively evaluated and compared with subjectively assessed Oral Health Impact Profile (OHIP) scores. Speech was recorded in 28 edentulous patients 1 week prior to the fabrication of new complete maxillary dentures and 6 months thereafter. Speech intelligibility was computed based on the word accuracy (WA) by means of an ASR and compared with a matched control group. One week before and 6 months after rehabilitation, patients assessed themselves for OHRQoL. Speech intelligibility improved significantly after 6 months. Subjects reported a significantly higher OHRQoL after maxillary rehabilitation with complete dentures. No significant correlation was found between the OHIP sum score or its subscales to the WA. Speech intelligibility enhancement achieved through the fabrication of new complete maxillary dentures might not be in the forefront of the patients' perception of their quality of life. For the improvement of OHRQoL in patients wearing complete maxillary dentures, food intake and mastication as well as freedom from pain play a more prominent role.

  10. Speech recognition-based and automaticity programs to help students with severe reading and spelling problems.

    PubMed

    Higgins, Eleanor L; Raskind, Marshall H

    2004-12-01

    This study was conducted to assess the effectiveness of two programs developed by the Frostig Center Research Department to improve the reading and spelling of students with learning disabilities (LD): a computer Speech Recognition-based Program (SRBP) and a computer and text-based Automaticity Program (AP). Twenty-eight LD students with reading and spelling difficulties (aged 8 to 18) received each program for 17 weeks and were compared with 16 students in a contrast group who did not receive either program. After adjusting for age and IQ, both the SRBP and AP groups showed significant differences over the contrast group in improving word recognition and reading comprehension. Neither program showed significant differences over contrasts in spelling. The SRBP also improved the performance of the target group when compared with the contrast group on phonological elision and nonword reading efficiency tasks. The AP showed significant differences in all process and reading efficiency measures.

  11. Do perceived context pictures automatically activate their phonological code?

    PubMed

    Jescheniak, Jörg D; Oppermann, Frank; Hantsch, Ansgar; Wagner, Valentin; Mädebach, Andreas; Schriefers, Herbert

    2009-01-01

    Morsella and Miozzo (Morsella, E., & Miozzo, M. (2002). Evidence for a cascade model of lexical access in speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 555-563) have reported that the to-be-ignored context pictures become phonologically activated when participants name a target picture, and took this finding as support for cascaded models of lexical retrieval in speech production. In a replication and extension of their experiment in German, we failed to obtain priming effects from context pictures phonologically related to a to-be-named target picture. By contrast, corresponding context words (i.e., the names of the respective pictures) and the same context pictures, when used in an identity condition, did reliably facilitate the naming process. This pattern calls into question the generality of the claim advanced by Morsella and Miozzo that perceptual processing of pictures in the context of a naming task automatically leads to the activation of corresponding lexical-phonological codes.

  12. Carry-over fluency induced by extreme prolongations: A new behavioral paradigm.

    PubMed

    Briley, P M; Barnes, M P; Kalinowski, J S

    2016-04-01

    Extreme prolongations, which can be generated via extreme delayed auditory feedback (DAF) (e.g., 250-500 ms) or mediated cognitively with timing applications (e.g., analog stopwatch) at 2 s per syllable, have long been behavioral techniques used to inhibit stuttering. Some therapies have used this rate solely to establish initial fluency, while others use extremely slowed speech to establish fluency and add other strategic techniques such as easy onsets and diaphragmatic breathing. Extreme prolongations generate effective, efficient, and immediate forward flowing fluent speech, removing the signature behaviors of discrete stuttering (i.e., syllable repetitions and audible and inaudible postural fixations). Prolonged use of extreme prolongations establishes carry-over fluency, which is spontaneous, effortless speech absent of most, if not all, overt and covert manifestations of stuttering. The creation of this immediate fluency and the immense potential of extreme prolongations to generate long periods of carry-over fluency have been overlooked by researchers and clinicians alike. Clinicians depart from these longer prolongation durations as they attempt to achieve the same fluent results at a near normal rate of speech. Clinicians assume they are re-teaching fluency and slow rates will give rise to more normal rates with less control, but without carry-over fluency, controls and cognitive mediation are always needed for the inherently unstable speech systems of persons who stutter to experience fluent speech. The assumption being that the speech system is untenable without some level of cognitive and motoric monitoring that is always necessary. The goal is omnipresent "near normal rate sounding fluency" with continuous mediation via cognitive and motoric processes. This pursuit of "normal sounding fluency" continues despite ever-present relapse. Relapse has become so common that acceptance of stuttering is the new therapy modality because relapse has come to be understood as somewhat inevitable. Researchers and clinicians fail to recognize that immediate amelioration of stuttering and its attendant carry-over fluency are signs of a different pathway to fluency. In this path, clinicians focus on extreme prolongations and the extent of their carry-over. While fluency is automatically generated under these extreme prolongations, the realization is that communication at this rate in routine speaking tasks is not feasible. The perceived solution is a systematic reduction in the duration of these prolongations, which attempts to approximate "normal speech." Typically, the reintroduction of speech at a normalized rate precipitates a laborious style that is undesirable to the person who stutters (PWS) and is discontinued, once departed from the comforts of the clinical setting. The inevitable typically occurs; the well-intentioned therapist instructs the PWS to focus on the techniques while speaking at a rate that is nearest normal speech, but the overlooked extreme prolongations are unlikely to ever be revisited. The foundation of this hypothesis is that the departure from fluency generators (e.g. extreme prolongations) is the cause of regression to the stuttering set point. In turn, we postulate that the continued use of extreme prolongations, as a solitary practice method, will establish and nurture different neural pathways that will create a modality of fluent speech, able to be experienced without cognitive or motoric mediation. This would therefore result in fewer occurrences of stuttering due to a phenomenon called carry-over fluency. Thus, we hypothesize that the use of extreme prolongations fosters neural pathways for fluent speech, which will result in carry-over fluency that does not require mediation by the speaker. Copyright © 2016 Elsevier Ltd. All rights reserved.

  13. Smart command recognizer (SCR) - For development, test, and implementation of speech commands

    NASA Technical Reports Server (NTRS)

    Simpson, Carol A.; Bunnell, John W.; Krones, Robert R.

    1988-01-01

    The SCR, a rapid prototyping system for the development, testing, and implementation of speech commands in a flight simulator or test aircraft, is described. A single unit performs all functions needed during these three phases of system development, while the use of common software and speech command data structure files greatly reduces the preparation time for successive development phases. As a smart peripheral to a simulation or flight host computer, the SCR interprets the pilot's spoken input and passes command codes to the simulation or flight computer.

  14. Effects of emotion on different phoneme classes

    NASA Astrophysics Data System (ADS)

    Lee, Chul Min; Yildirim, Serdar; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Abe; Lee, Sungbok; Narayanan, Shrikanth

    2004-10-01

    This study investigates the effects of emotion on different phoneme classes using short-term spectral features. In the research on emotion in speech, most studies have focused on prosodic features of speech. In this study, based on the hypothesis that different emotions have varying effects on the properties of the different speech sounds, we investigate the usefulness of phoneme-class level acoustic modeling for automatic emotion classification. Hidden Markov models (HMM) based on short-term spectral features for five broad phonetic classes are used for this purpose using data obtained from recordings of two actresses. Each speaker produces 211 sentences with four different emotions (neutral, sad, angry, happy). Using the speech material we trained and compared the performances of two sets of HMM classifiers: a generic set of ``emotional speech'' HMMs (one for each emotion) and a set of broad phonetic-class based HMMs (vowel, glide, nasal, stop, fricative) for each emotion type considered. Comparison of classification results indicates that different phoneme classes were affected differently by emotional change and that the vowel sounds are the most important indicator of emotions in speech. Detailed results and their implications on the underlying speech articulation will be discussed.

  15. Study of acoustic correlates associate with emotional speech

    NASA Astrophysics Data System (ADS)

    Yildirim, Serdar; Lee, Sungbok; Lee, Chul Min; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Ebrahim; Narayanan, Shrikanth

    2004-10-01

    This study investigates the acoustic characteristics of four different emotions expressed in speech. The aim is to obtain detailed acoustic knowledge on how a speech signal is modulated by changes from neutral to a certain emotional state. Such knowledge is necessary for automatic emotion recognition and classification and emotional speech synthesis. Speech data obtained from two semi-professional actresses are analyzed and compared. Each subject produces 211 sentences with four different emotions; neutral, sad, angry, happy. We analyze changes in temporal and acoustic parameters such as magnitude and variability of segmental duration, fundamental frequency and the first three formant frequencies as a function of emotion. Acoustic differences among the emotions are also explored with mutual information computation, multidimensional scaling and acoustic likelihood comparison with normal speech. Results indicate that speech associated with anger and happiness is characterized by longer duration, shorter interword silence, higher pitch and rms energy with wider ranges. Sadness is distinguished from other emotions by lower rms energy and longer interword silence. Interestingly, the difference in formant pattern between [happiness/anger] and [neutral/sadness] are better reflected in back vowels such as /a/(/father/) than in front vowels. Detailed results on intra- and interspeaker variability will be reported.

  16. Toward dynamic magnetic resonance imaging of the vocal tract during speech production.

    PubMed

    Ventura, Sandra M Rua; Freitas, Diamantino Rui S; Tavares, João Manuel R S

    2011-07-01

    The most recent and significant magnetic resonance imaging (MRI) improvements allow for the visualization of the vocal tract during speech production, which has been revealed to be a powerful tool in dynamic speech research. However, a synchronization technique with enhanced temporal resolution is still required. The study design was transversal in nature. Throughout this work, a technique for the dynamic study of the vocal tract with MRI by using the heart's signal to synchronize and trigger the imaging-acquisition process is presented and described. The technique in question is then used in the measurement of four speech articulatory parameters to assess three different syllables (articulatory gestures) of European Portuguese Language. The acquired MR images are automatically reconstructed so as to result in a variable sequence of images (slices) of different vocal tract shapes in articulatory positions associated with Portuguese speech sounds. The knowledge obtained as a result of the proposed technique represents a direct contribution to the improvement of speech synthesis algorithms, thereby allowing for novel perceptions in coarticulation studies, in addition to providing further efficient clinical guidelines in the pursuit of more proficient speech rehabilitation processes. Copyright © 2011 The Voice Foundation. Published by Mosby, Inc. All rights reserved.

  17. Phonologically-based biomarkers for major depressive disorder

    NASA Astrophysics Data System (ADS)

    Trevino, Andrea Carolina; Quatieri, Thomas Francis; Malyska, Nicolas

    2011-12-01

    Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.

  18. Characteristics of speaking style and implications for speech recognition.

    PubMed

    Shinozaki, Takahiro; Ostendorf, Mari; Atlas, Les

    2009-09-01

    Differences in speaking style are associated with more or less spectral variability, as well as different modulation characteristics. The greater variation in some styles (e.g., spontaneous speech and infant-directed speech) poses challenges for recognition but possibly also opportunities for learning more robust models, as evidenced by prior work and motivated by child language acquisition studies. In order to investigate this possibility, this work proposes a new method for characterizing speaking style (the modulation spectrum), examines spontaneous, read, adult-directed, and infant-directed styles in this space, and conducts pilot experiments in style detection and sampling for improved speech recognizer training. Speaking style classification is improved by using the modulation spectrum in combination with standard pitch and energy variation. Speech recognition experiments on a small vocabulary conversational speech recognition task show that sampling methods for training with a small amount of data benefit from the new features.

  19. System transfer modelling for automatic target recognizer evaluations

    NASA Astrophysics Data System (ADS)

    Clark, Lloyd G.

    1991-11-01

    Image processing to accomplish automatic recognition of military vehicles has promised increased weapons systems effectiveness and reduced timelines for a number of Department of Defense missions. Automatic Target Recognizers (ATR) are often claimed to be able to recognize many different ground vehicles as possible targets in military air-to- surface targeting applications. The targeting scenario conditions include different vehicle poses and histories as well as a variety of imaging geometries, intervening atmospheres, and background environments. Testing these ATR subsystems in most cases has been limited to a handful of the scenario conditions of interest, as is represented by imagery collected with the desired imaging sensor. The question naturally arises as to how robust the performance of the ATR is for all scenario conditions of interest, not just for the set of imagery upon which an algorithm was trained.

  20. Automatic measurement and representation of prosodic features

    NASA Astrophysics Data System (ADS)

    Ying, Goangshiuan Shawn

    Effective measurement and representation of prosodic features of the acoustic signal for use in automatic speech recognition and understanding systems is the goal of this work. Prosodic features-stress, duration, and intonation-are variations of the acoustic signal whose domains are beyond the boundaries of each individual phonetic segment. Listeners perceive prosodic features through a complex combination of acoustic correlates such as intensity, duration, and fundamental frequency (F0). We have developed new tools to measure F0 and intensity features. We apply a probabilistic global error correction routine to an Average Magnitude Difference Function (AMDF) pitch detector. A new short-term frequency-domain Teager energy algorithm is used to measure the energy of a speech signal. We have conducted a series of experiments performing lexical stress detection on words in continuous English speech from two speech corpora. We have experimented with two different approaches, a segment-based approach and a rhythm unit-based approach, in lexical stress detection. The first approach uses pattern recognition with energy- and duration-based measurements as features to build Bayesian classifiers to detect the stress level of a vowel segment. In the second approach we define rhythm unit and use only the F0-based measurement and a scoring system to determine the stressed segment in the rhythm unit. A duration-based segmentation routine was developed to break polysyllabic words into rhythm units. The long-term goal of this work is to develop a system that can effectively detect the stress pattern for each word in continuous speech utterances. Stress information will be integrated as a constraint for pruning the word hypotheses in a word recognition system based on hidden Markov models.

  1. Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers

    PubMed Central

    Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng

    2014-01-01

    Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004

  2. Individuals with Autism Spectrum Disorders Do Not Use Social Stereotypes in Irony Comprehension

    PubMed Central

    Zalla, Tiziana; Amsellem, Frederique; Chaste, Pauline; Ervas, Francesca; Leboyer, Marion; Champagne-Lavau, Maud

    2014-01-01

    Social and communication impairments are part of the essential diagnostic criteria used to define Autism Spectrum Disorders (ASDs). Difficulties in appreciating non-literal speech, such as irony in ASDs have been explained as due to impairments in social understanding and in recognizing the speaker’s communicative intention. It has been shown that social-interactional factors, such as a listener’s beliefs about the speaker’s attitudinal propensities (e.g., a tendency to use sarcasm, to be mocking, less sincere and more prone to criticism), as conveyed by an occupational stereotype, do influence a listener’s interpretation of potentially ironic remarks. We investigate the effect of occupational stereotype on irony detection in adults with High Functioning Autism or Asperger Syndrome (HFA/AS) and a comparison group of typically developed adults. We used a series of verbally presented stories containing ironic or literal utterances produced by a speaker having either a “sarcastic” or a “non-sarcastic” occupation. Although individuals with HFA/AS were able to recognize ironic intent and occupational stereotypes when the latter are made salient, stereotype information enhanced irony detection and modulated its social meaning (i.e., mockery and politeness) only in comparison participants. We concluded that when stereotype knowledge is not made salient, it does not automatically affect pragmatic communicative processes in individuals with HFA/AS. PMID:24748103

  3. On the Development of Speech Resources for the Mixtec Language

    PubMed Central

    2013-01-01

    The Mixtec language is one of the main native languages in Mexico. In general, due to urbanization, discrimination, and limited attempts to promote the culture, the native languages are disappearing. Most of the information available about the Mixtec language is in written form as in dictionaries which, although including examples about how to pronounce the Mixtec words, are not as reliable as listening to the correct pronunciation from a native speaker. Formal acoustic resources, as speech corpora, are almost non-existent for the Mixtec, and no speech technologies are known to have been developed for it. This paper presents the development of the following resources for the Mixtec language: (1) a speech database of traditional narratives of the Mixtec culture spoken by a native speaker (labelled at the phonetic and orthographic levels by means of spectral analysis) and (2) a native speaker-adaptive automatic speech recognition (ASR) system (trained with the speech database) integrated with a Mixtec-to-Spanish/Spanish-to-Mixtec text translator. The speech database, although small and limited to a single variant, was reliable enough to build the multiuser speech application which presented a mean recognition/translation performance up to 94.36% in experiments with non-native speakers (the target users). PMID:23710134

  4. Is talking to an automated teller machine natural and fun?

    PubMed

    Chan, F Y; Khalid, H M

    Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.

  5. Clinical evidence of the role of the cerebellum in the suppression of overt articulatory movements during reading. A study of reading in children and adolescents treated for cerebellar pilocytic astrocytoma.

    PubMed

    Ait Khelifa-Gallois, N; Puget, S; Longaud, A; Laroussinie, F; Soria, C; Sainte-Rose, C; Dellatolas, G

    2015-04-01

    It has been suggested that the cerebellum is involved in reading acquisition and in particular in the progression from automatic grapheme-phoneme conversion to the internalization of speech required for silent reading. This idea is in line with clinical and neuroimaging data showing a cerebellar role in subvocal rehearsal for printed verbalizable material and with computational "internal models" of the cerebellum suggesting its role in inner speech (i.e. covert speech without mouthing the words). However, studies examining a possible cerebellar role in the suppression of articulatory movements during silent reading acquisition in children are lacking. Here, we report clinical evidence that the cerebellum plays a part in this transition. Reading performances were compared between a group of 17 paediatric patients treated for benign cerebellar tumours and a group of controls matched for age, gender, and parental socio-educational level. The patients scored significantly lower on all reading, but the most striking difference concerned silent reading, perfectly acquired by almost all controls, contrasting with 41 % of the patients who were unable to read any item silently. Silent reading was correlated with the Working Memory Index. The present findings converge with previous reports on an implication of the cerebellum in inner speech and in the automatization of reading. This cerebellar implication is probably not specific to reading, as it also seems to affect non-reading tasks such as counting.

  6. Ongoing slow oscillatory phase modulates speech intelligibility in cooperation with motor cortical activity.

    PubMed

    Onojima, Takayuki; Kitajo, Keiichi; Mizuhara, Hiroaki

    2017-01-01

    Neural oscillation is attracting attention as an underlying mechanism for speech recognition. Speech intelligibility is enhanced by the synchronization of speech rhythms and slow neural oscillation, which is typically observed as human scalp electroencephalography (EEG). In addition to the effect of neural oscillation, it has been proposed that speech recognition is enhanced by the identification of a speaker's motor signals, which are used for speech production. To verify the relationship between the effect of neural oscillation and motor cortical activity, we measured scalp EEG, and simultaneous EEG and functional magnetic resonance imaging (fMRI) during a speech recognition task in which participants were required to recognize spoken words embedded in noise sound. We proposed an index to quantitatively evaluate the EEG phase effect on behavioral performance. The results showed that the delta and theta EEG phase before speech inputs modulated the participant's response time when conducting speech recognition tasks. The simultaneous EEG-fMRI experiment showed that slow EEG activity was correlated with motor cortical activity. These results suggested that the effect of the slow oscillatory phase was associated with the activity of the motor cortex during speech recognition.

  7. Neuromorphic crossbar circuit with nanoscale filamentary-switching binary memristors for speech recognition.

    PubMed

    Truong, Son Ngoc; Ham, Seok-Jin; Min, Kyeong-Sik

    2014-01-01

    In this paper, a neuromorphic crossbar circuit with binary memristors is proposed for speech recognition. The binary memristors which are based on filamentary-switching mechanism can be found more popularly and are easy to be fabricated than analog memristors that are rare in materials and need a more complicated fabrication process. Thus, we develop a neuromorphic crossbar circuit using filamentary-switching binary memristors not using interface-switching analog memristors. The proposed binary memristor crossbar can recognize five vowels with 4-bit 64 input channels. The proposed crossbar is tested by 2,500 speech samples and verified to be able to recognize 89.2% of the tested samples. From the statistical simulation, the recognition rate of the binary memristor crossbar is estimated to be degraded very little from 89.2% to 80%, though the percentage variation in memristance is increased very much from 0% to 15%. In contrast, the analog memristor crossbar loses its recognition rate significantly from 96% to 9% for the same percentage variation in memristance.

  8. Creating speech-synchronized animation.

    PubMed

    King, Scott A; Parent, Richard E

    2005-01-01

    We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements.

  9. Effects of stimulus response compatibility on covert imitation of vowels.

    PubMed

    Adank, Patti; Nuttall, Helen; Bekkering, Harold; Maegherman, Gwijde

    2018-03-13

    When we observe someone else speaking, we tend to automatically activate the corresponding speech motor patterns. When listening, we therefore covertly imitate the observed speech. Simulation theories of speech perception propose that covert imitation of speech motor patterns supports speech perception. Covert imitation of speech has been studied with interference paradigms, including the stimulus-response compatibility paradigm (SRC). The SRC paradigm measures covert imitation by comparing articulation of a prompt following exposure to a distracter. Responses tend to be faster for congruent than for incongruent distracters; thus, showing evidence of covert imitation. Simulation accounts propose a key role for covert imitation in speech perception. However, covert imitation has thus far only been demonstrated for a select class of speech sounds, namely consonants, and it is unclear whether covert imitation extends to vowels. We aimed to demonstrate that covert imitation effects as measured with the SRC paradigm extend to vowels, in two experiments. We examined whether covert imitation occurs for vowels in a consonant-vowel-consonant context in visual, audio, and audiovisual modalities. We presented the prompt at four time points to examine how covert imitation varied over the distracter's duration. The results of both experiments clearly demonstrated covert imitation effects for vowels, thus supporting simulation theories of speech perception. Covert imitation was not affected by stimulus modality and was maximal for later time points.

  10. Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis

    NASA Astrophysics Data System (ADS)

    Büchler, Michael; Allegro, Silvia; Launer, Stefan; Dillier, Norbert

    2005-12-01

    A sound classification system for the automatic recognition of the acoustic environment in a hearing aid is discussed. The system distinguishes the four sound classes "clean speech," "speech in noise," "noise," and "music." A number of features that are inspired by auditory scene analysis are extracted from the sound signal. These features describe amplitude modulations, spectral profile, harmonicity, amplitude onsets, and rhythm. They are evaluated together with different pattern classifiers. Simple classifiers, such as rule-based and minimum-distance classifiers, are compared with more complex approaches, such as Bayes classifier, neural network, and hidden Markov model. Sounds from a large database are employed for both training and testing of the system. The achieved recognition rates are very high except for the class "speech in noise." Problems arise in the classification of compressed pop music, strongly reverberated speech, and tonal or fluctuating noises.

  11. Abnormal laughter-like vocalisations replacing speech in primary progressive aphasia

    PubMed Central

    Rohrer, Jonathan D.; Warren, Jason D.; Rossor, Martin N.

    2009-01-01

    We describe ten patients with a clinical diagnosis of primary progressive aphasia (PPA) (pathologically confirmed in three cases) who developed abnormal laughter-like vocalisations in the context of progressive speech output impairment leading to mutism. Failure of speech output was accompanied by increasing frequency of the abnormal vocalisations until ultimately they constituted the patient's only extended utterance. The laughter-like vocalisations did not show contextual sensitivity but occurred as an automatic vocal output that replaced speech. Acoustic analysis of the vocalisations in two patients revealed abnormal motor features including variable note duration and inter-note interval, loss of temporal symmetry of laugh notes and loss of the normal decrescendo. Abnormal laughter-like vocalisations may be a hallmark of a subgroup in the PPA spectrum with impaired control and production of nonverbal vocal behaviour due to disruption of fronto-temporal networks mediating vocalisation. PMID:19435636

  12. Abnormal laughter-like vocalisations replacing speech in primary progressive aphasia.

    PubMed

    Rohrer, Jonathan D; Warren, Jason D; Rossor, Martin N

    2009-09-15

    We describe ten patients with a clinical diagnosis of primary progressive aphasia (PPA) (pathologically confirmed in three cases) who developed abnormal laughter-like vocalisations in the context of progressive speech output impairment leading to mutism. Failure of speech output was accompanied by increasing frequency of the abnormal vocalisations until ultimately they constituted the patient's only extended utterance. The laughter-like vocalisations did not show contextual sensitivity but occurred as an automatic vocal output that replaced speech. Acoustic analysis of the vocalisations in two patients revealed abnormal motor features including variable note duration and inter-note interval, loss of temporal symmetry of laugh notes and loss of the normal decrescendo. Abnormal laughter-like vocalisations may be a hallmark of a subgroup in the PPA spectrum with impaired control and production of nonverbal vocal behaviour due to disruption of fronto-temporal networks mediating vocalisation.

  13. Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting.

    PubMed

    Wöllmer, Martin; Marchi, Erik; Squartini, Stefano; Schuller, Björn

    2011-09-01

    Highly spontaneous, conversational, and potentially emotional and noisy speech is known to be a challenge for today's automatic speech recognition (ASR) systems, which highlights the need for advanced algorithms that improve speech features and models. Histogram Equalization is an efficient method to reduce the mismatch between clean and noisy conditions by normalizing all moments of the probability distribution of the feature vector components. In this article, we propose to combine histogram equalization and multi-condition training for robust keyword detection in noisy speech. To better cope with conversational speaking styles, we show how contextual information can be effectively exploited in a multi-stream ASR framework that dynamically models context-sensitive phoneme estimates generated by a long short-term memory neural network. The proposed techniques are evaluated on the SEMAINE database-a corpus containing emotionally colored conversations with a cognitive system for "Sensitive Artificial Listening".

  14. Cost-Effectiveness of Interventions for Children with Speech, Language and Communication Needs (SLCN): A Review Using the Drummond and Jefferson (1996) "Referee's Checklist"

    ERIC Educational Resources Information Center

    Law, James; Zeng, Biao; Lindsay, Geoff; Beecham, Jennifer

    2012-01-01

    Background: Although economic evaluation has been widely recognized as a key feature of both health services and educational research, for many years there has been a paucity of such studies relevant to services for children with speech, language and communication needs (SLCN), making the application of economic arguments to the development of…

  15. Analysis of speech sounds is left-hemisphere predominant at 100-150ms after sound onset.

    PubMed

    Rinne, T; Alho, K; Alku, P; Holi, M; Sinkkonen, J; Virtanen, J; Bertrand, O; Näätänen, R

    1999-04-06

    Hemispheric specialization of human speech processing has been found in brain imaging studies using fMRI and PET. Due to the restricted time resolution, these methods cannot, however, determine the stage of auditory processing at which this specialization first emerges. We used a dense electrode array covering the whole scalp to record the mismatch negativity (MMN), an event-related brain potential (ERP) automatically elicited by occasional changes in sounds, which ranged from non-phonetic (tones) to phonetic (vowels). MMN can be used to probe auditory central processing on a millisecond scale with no attention-dependent task requirements. Our results indicate that speech processing occurs predominantly in the left hemisphere at the early, pre-attentive level of auditory analysis.

  16. Speech and language support: How physicians can identify and treat speech and language delays in the office setting.

    PubMed

    Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo

    2014-01-01

    Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society's Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children's speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation.

  17. Some Neurocognitive Correlates of Noise-Vocoded Speech Perception in Children With Normal Hearing: A Replication and Extension of ).

    PubMed

    Roman, Adrienne S; Pisoni, David B; Kronenberger, William G; Faulkner, Kathleen F

    Noise-vocoded speech is a valuable research tool for testing experimental hypotheses about the effects of spectral degradation on speech recognition in adults with normal hearing (NH). However, very little research has utilized noise-vocoded speech with children with NH. Earlier studies with children with NH focused primarily on the amount of spectral information needed for speech recognition without assessing the contribution of neurocognitive processes to speech perception and spoken word recognition. In this study, we first replicated the seminal findings reported by ) who investigated effects of lexical density and word frequency on noise-vocoded speech perception in a small group of children with NH. We then extended the research to investigate relations between noise-vocoded speech recognition abilities and five neurocognitive measures: auditory attention (AA) and response set, talker discrimination, and verbal and nonverbal short-term working memory. Thirty-one children with NH between 5 and 13 years of age were assessed on their ability to perceive lexically controlled words in isolation and in sentences that were noise-vocoded to four spectral channels. Children were also administered vocabulary assessments (Peabody Picture Vocabulary test-4th Edition and Expressive Vocabulary test-2nd Edition) and measures of AA (NEPSY AA and response set and a talker discrimination task) and short-term memory (visual digit and symbol spans). Consistent with the findings reported in the original ) study, we found that children perceived noise-vocoded lexically easy words better than lexically hard words. Words in sentences were also recognized better than the same words presented in isolation. No significant correlations were observed between noise-vocoded speech recognition scores and the Peabody Picture Vocabulary test-4th Edition using language quotients to control for age effects. However, children who scored higher on the Expressive Vocabulary test-2nd Edition recognized lexically easy words better than lexically hard words in sentences. Older children perceived noise-vocoded speech better than younger children. Finally, we found that measures of AA and short-term memory capacity were significantly correlated with a child's ability to perceive noise-vocoded isolated words and sentences. First, we successfully replicated the major findings from the ) study. Because familiarity, phonological distinctiveness and lexical competition affect word recognition, these findings provide additional support for the proposal that several foundational elementary neurocognitive processes underlie the perception of spectrally degraded speech. Second, we found strong and significant correlations between performance on neurocognitive measures and children's ability to recognize words and sentences noise-vocoded to four spectral channels. These findings extend earlier research suggesting that perception of spectrally degraded speech reflects early peripheral auditory processes, as well as additional contributions of executive function, specifically, selective attention and short-term memory processes in spoken word recognition. The present findings suggest that AA and short-term memory support robust spoken word recognition in children with NH even under compromised and challenging listening conditions. These results are relevant to research carried out with listeners who have hearing loss, because they are routinely required to encode, process, and understand spectrally degraded acoustic signals.

  18. Some Neurocognitive Correlates of Noise-Vocoded Speech Perception in Children with Normal Hearing: A Replication and Extension of Eisenberg et al., 2002

    PubMed Central

    Roman, Adrienne S.; Pisoni, David B.; Kronenberger, William G.; Faulkner, Kathleen F.

    2016-01-01

    Objectives Noise-vocoded speech is a valuable research tool for testing experimental hypotheses about the effects of spectral-degradation on speech recognition in adults with normal hearing (NH). However, very little research has utilized noise-vocoded speech with children with NH. Earlier studies with children with NH focused primarily on the amount of spectral information needed for speech recognition without assessing the contribution of neurocognitive processes to speech perception and spoken word recognition. In this study, we first replicated the seminal findings reported by Eisenberg et al. (2002) who investigated effects of lexical density and word frequency on noise-vocoded speech perception in a small group of children with NH. We then extended the research to investigate relations between noise-vocoded speech recognition abilities and five neurocognitive measures: auditory attention and response set, talker discrimination and verbal and nonverbal short-term working memory. Design Thirty-one children with NH between 5 and 13 years of age were assessed on their ability to perceive lexically controlled words in isolation and in sentences that were noise-vocoded to four spectral channels. Children were also administered vocabulary assessments (PPVT-4 and EVT-2) and measures of auditory attention (NEPSY Auditory Attention (AA) and Response Set (RS) and a talker discrimination task (TD)) and short-term memory (visual digit and symbol spans). Results Consistent with the findings reported in the original Eisenberg et al. (2002) study, we found that children perceived noise-vocoded lexically easy words better than lexically hard words. Words in sentences were also recognized better than the same words presented in isolation. No significant correlations were observed between noise-vocoded speech recognition scores and the PPVT-4 using language quotients to control for age effects. However, children who scored higher on the EVT-2 recognized lexically easy words better than lexically hard words in sentences. Older children perceived noise-vocoded speech better than younger children. Finally, we found that measures of auditory attention and short-term memory capacity were significantly correlated with a child’s ability to perceive noise-vocoded isolated words and sentences. Conclusions First, we successfully replicated the major findings from the Eisenberg et al. (2002) study. Because familiarity, phonological distinctiveness and lexical competition affect word recognition, these findings provide additional support for the proposal that several foundational elementary neurocognitive processes underlie the perception of spectrally-degraded speech. Second, we found strong and significant correlations between performance on neurocognitive measures and children’s ability to recognize words and sentences noise-vocoded to four spectral channels. These findings extend earlier research suggesting that perception of spectrally-degraded speech reflects early peripheral auditory processes as well as additional contributions of executive function, specifically, selective attention and short-term memory processes in spoken word recognition. The present findings suggest that auditory attention and short-term memory support robust spoken word recognition in children with NH even under compromised and challenging listening conditions. These results are relevant to research carried out with listeners who have hearing loss, since they are routinely required to encode, process and understand spectrally-degraded acoustic signals. PMID:28045787

  19. Speech recognition features for EEG signal description in detection of neonatal seizures.

    PubMed

    Temko, A; Boylan, G; Marnane, W; Lightbody, G

    2010-01-01

    In this work, features which are usually employed in automatic speech recognition (ASR) are used for the detection of neonatal seizures in newborn EEG. Three conventional ASR feature sets are compared to the feature set which has been previously developed for this task. The results indicate that the thoroughly-studied spectral envelope based ASR features perform reasonably well on their own. Additionally, the SVM Recursive Feature Elimination routine is applied to all extracted features pooled together. It is shown that ASR features consistently appear among the top-rank features.

  20. Implementation and Performance Exploration of a Cross-Genre Part of Speech Tagging Methodology to Determine Dialog Act Tags in the Chat Domain

    DTIC Science & Technology

    2010-09-01

    the flies.”) or a present tense verb when describing what an airplane does (“An airplane flies.”) This disambiguation is, in general, computationally...as part-of-speech and dialog-act tagging, and yet the volume of data created makes human analysis impractical. We present a cross-genre part-of...acceptable automatic dialog-act determination. Furthermore, we show that a simple naı̈ve Bayes classifier achieves the same performance in a fraction of

  1. Robotics control using isolated word recognition of voice input

    NASA Technical Reports Server (NTRS)

    Weiner, J. M.

    1977-01-01

    A speech input/output system is presented that can be used to communicate with a task oriented system. Human speech commands and synthesized voice output extend conventional information exchange capabilities between man and machine by utilizing audio input and output channels. The speech input facility is comprised of a hardware feature extractor and a microprocessor implemented isolated word or phrase recognition system. The recognizer offers a medium sized (100 commands), syntactically constrained vocabulary, and exhibits close to real time performance. The major portion of the recognition processing required is accomplished through software, minimizing the complexity of the hardware feature extractor.

  2. Corporate Speech and the Constitution: The Deregulation of Tobacco Advertising

    PubMed Central

    Gostin, Lawrence O.

    2002-01-01

    In a series of recent cases, the Supreme Court has given businesses powerful new First Amendment rights to advertise hazardous products. Most recently, in Lorillard Tobacco Co v Reilly (121 SCt 2404 [2001]), the court invalidated Massachusetts regulations intended to reduce underage smoking. The future prospects for commercial speech regulation appear dim, but the reasoning in commercial speech cases is supported by only a plurality of the court. A different First Amendment theory should recognize the importance of population health and the low value of corporate speech. In particular, a future court should consider the low informational value of tobacco advertising, the availability of alternative channels of communication, the unlawful practice of targeting minors, and the magnitude of the social harms. PMID:11867306

  3. Design and development of an AAC app based on a speech-to-symbol technology.

    PubMed

    Radici, Elena; Bonacina, Stefano; De Leo, Gianluca

    2016-08-01

    The purpose of this paper is to present the design and the development of an Augmentative and Alternative Communication app that uses a speech to symbol technology to model language, i.e. to recognize the speech and display the text or the graphic content related to it. Our app is intended to be adopted by communication partners who want to engage in interventions focused on improving communication skills. Our app has the goal of translating simple speech sentences in a set of symbols that are understandable by children with complex communication needs. We moderated a focus group among six AAC communication partners. Then, we developed a prototype. We are currently starting testing our app in an AAC Centre in Milan, Italy.

  4. Corporate speech and the Constitution: the deregulation of tobacco advertising.

    PubMed

    Gostin, Lawrence O

    2002-03-01

    In a series of recent cases, the Supreme Court has given businesses powerful new First Amendment rights to advertise hazardous products. Most recently, in Lorillard Tobacco Co v Reilly (121 SCt 2404 [2001]), the court invalidated Massachusetts regulations intended to reduce underage smoking. The future prospects for commercial speech regulation appear dim, but the reasoning in commercial speech cases is supported by only a plurality of the court. A different First Amendment theory should recognize the importance of population health and the low value of corporate speech. In particular, a future court should consider the low informational value of tobacco advertising, the availability of alternative channels of communication, the unlawful practice of targeting minors, and the magnitude of the social harms.

  5. A Generative Model of Speech Production in Broca’s and Wernicke’s Areas

    PubMed Central

    Price, Cathy J.; Crinion, Jenny T.; MacSweeney, Mairéad

    2011-01-01

    Speech production involves the generation of an auditory signal from the articulators and vocal tract. When the intended auditory signal does not match the produced sounds, subsequent articulatory commands can be adjusted to reduce the difference between the intended and produced sounds. This requires an internal model of the intended speech output that can be compared to the produced speech. The aim of this functional imaging study was to identify brain activation related to the internal model of speech production after activation related to vocalization, auditory feedback, and movement in the articulators had been controlled. There were four conditions: silent articulation of speech, non-speech mouth movements, finger tapping, and visual fixation. In the speech conditions, participants produced the mouth movements associated with the words “one” and “three.” We eliminated auditory feedback from the spoken output by instructing participants to articulate these words without producing any sound. The non-speech mouth movement conditions involved lip pursing and tongue protrusions to control for movement in the articulators. The main difference between our speech and non-speech mouth movement conditions is that prior experience producing speech sounds leads to the automatic and covert generation of auditory and phonological associations that may play a role in predicting auditory feedback. We found that, relative to non-speech mouth movements, silent speech activated Broca’s area in the left dorsal pars opercularis and Wernicke’s area in the left posterior superior temporal sulcus. We discuss these results in the context of a generative model of speech production and propose that Broca’s and Wernicke’s areas may be involved in predicting the speech output that follows articulation. These predictions could provide a mechanism by which rapid movement of the articulators is precisely matched to the intended speech outputs during future articulations. PMID:21954392

  6. Alterations in attention capture to auditory emotional stimuli in job burnout: an event-related potential study.

    PubMed

    Sokka, Laura; Huotilainen, Minna; Leinikka, Marianne; Korpela, Jussi; Henelius, Andreas; Alain, Claude; Müller, Kiti; Pakarinen, Satu

    2014-12-01

    Job burnout is a significant cause of work absenteeism. Evidence from behavioral studies and patient reports suggests that job burnout is associated with impairments of attention and decreased working capacity, and it has overlapping elements with depression, anxiety and sleep disturbances. Here, we examined the electrophysiological correlates of automatic sound change detection and involuntary attention allocation in job burnout using scalp recordings of event-related potentials (ERP). Volunteers with job burnout symptoms but without severe depression and anxiety disorders and their non-burnout controls were presented with natural speech sound stimuli (standard and nine deviants), as well as three rarely occurring speech sounds with strong emotional prosody. All stimuli elicited mismatch negativity (MMN) responses that were comparable in both groups. The groups differed with respect to the P3a, an ERP component reflecting involuntary shift of attention: job burnout group showed a shorter P3a latency in response to the emotionally negative stimulus, and a longer latency in response to the positive stimulus. Results indicate that in job burnout, automatic speech sound discrimination is intact, but there is an attention capture tendency that is faster for negative, and slower to positive information compared to that of controls. Copyright © 2014 Elsevier B.V. All rights reserved.

  7. Rhythm in disguise: why singing may not hold the key to recovery from aphasia

    PubMed Central

    Kotz, Sonja A.; Henseler, Ilona; Turner, Robert; Geyer, Stefan

    2011-01-01

    The question of whether singing may be helpful for stroke patients with non-fluent aphasia has been debated for many years. However, the role of rhythm in speech recovery appears to have been neglected. In the current lesion study, we aimed to assess the relative importance of melody and rhythm for speech production in 17 non-fluent aphasics. Furthermore, we systematically alternated the lyrics to test for the influence of long-term memory and preserved motor automaticity in formulaic expressions. We controlled for vocal frequency variability, pitch accuracy, rhythmicity, syllable duration, phonetic complexity and other relevant factors, such as learning effects or the acoustic setting. Contrary to some opinion, our data suggest that singing may not be decisive for speech production in non-fluent aphasics. Instead, our results indicate that rhythm may be crucial, particularly for patients with lesions including the basal ganglia. Among the patients we studied, basal ganglia lesions accounted for more than 50% of the variance related to rhythmicity. Our findings therefore suggest that benefits typically attributed to melodic intoning in the past could actually have their roots in rhythm. Moreover, our data indicate that lyric production in non-fluent aphasics may be strongly mediated by long-term memory and motor automaticity, irrespective of whether lyrics are sung or spoken. PMID:21948939

  8. A hypothetical neurological association between dehumanization and human rights abuses

    PubMed Central

    Murrow, Gail B.; Murrow, Richard

    2015-01-01

    Dehumanization is anecdotally and historically associated with reduced empathy for the pain of dehumanized individuals and groups and with psychological and legal denial of their human rights and extreme violence against them. We hypothesize that ‘empathy’ for the pain and suffering of dehumanized social groups is automatically reduced because, as the research we review suggests, an individual's neural mechanisms of pain empathy best respond to (or produce empathy for) the pain of people whom the individual automatically or implicitly associates with her or his own species. This theory has implications for the philosophical conception of ‘human’ and of ‘legal personhood’ in human rights jurisprudence. It further has implications for First Amendment free speech jurisprudence, including the doctrine of ‘corporate personhood’ and consideration of the potential harm caused by dehumanizing hate speech. We suggest that the new, social neuroscience of empathy provides evidence that both the vagaries of the legal definition or legal fiction of ‘personhood’ and hate speech that explicitly and implicitly dehumanizes may (in their respective capacities to artificially humanize or dehumanize) manipulate the neural mechanisms of pain empathy in ways that could pose more of a true threat to human rights and rights-based democracy than previously appreciated. PMID:27774198

  9. The software for automatic creation of the formal grammars used by speech recognition, computer vision, editable text conversion systems, and some new functions

    NASA Astrophysics Data System (ADS)

    Kardava, Irakli; Tadyszak, Krzysztof; Gulua, Nana; Jurga, Stefan

    2017-02-01

    For more flexibility of environmental perception by artificial intelligence it is needed to exist the supporting software modules, which will be able to automate the creation of specific language syntax and to make a further analysis for relevant decisions based on semantic functions. According of our proposed approach, of which implementation it is possible to create the couples of formal rules of given sentences (in case of natural languages) or statements (in case of special languages) by helping of computer vision, speech recognition or editable text conversion system for further automatic improvement. In other words, we have developed an approach, by which it can be achieved to significantly improve the training process automation of artificial intelligence, which as a result will give us a higher level of self-developing skills independently from us (from users). At the base of our approach we have developed a software demo version, which includes the algorithm and software code for the entire above mentioned component's implementation (computer vision, speech recognition and editable text conversion system). The program has the ability to work in a multi - stream mode and simultaneously create a syntax based on receiving information from several sources.

  10. Neural network-based recognition of whistlers on spectrograms detected by satellite

    NASA Astrophysics Data System (ADS)

    Conti, Livio

    2016-04-01

    We present a system to automatically recognize and classify the occurrence of whistler waves on spectrograms of electric field measurements performed by satellite. Whistlers - VLF waves generated by lightning, with a specific spectral dispersion relation - can induce precipitation of trapped Van Allen particles and have a role in the chemistry of some atmospheric components (mainly NOx). Moreover, it has also been suggested that the increase of the number of anomalous whistlers (i.e. whistlers with high value of dispersion constant) could be induced by disturbances in the Earth-ionosphere wave-guide, generated by seismo-electromagnetic emissions. On satellite, the recognition of whistlers asks for analyzing high-resolution spectrograms that cannot be downloaded to Earth, due to the limits of data transmission. For this reason, a real time identification and classification must be performed on satellite, by avoiding downloading all the unprocessed data. The procedure that we have developed is based on a Time Delay Neural Network (TDNN). The TDNN, proposed some years ago for speech recognition, can be fruitfully also applied in real-time analysis of electromagnetic spectrograms in order to detect phenomena characterized by a specific shape/signature such as those of the whistler waves. Some studies have been performed by the RNF experiment on board of the DEMETER satellite and our algorithm could be adopted on board of the satellite CSES (China Seismo-Electromagnetic Satellite), launch scheduled by the end of 2016. Moreover, the procedure can be also adopted to automatic analysis of whistlers detected on ground.

  11. Using voice to create hospital progress notes: Description of a mobile application and supporting system integrated with a commercial electronic health record.

    PubMed

    Payne, Thomas H; Alonso, W David; Markiel, J Andrew; Lybarger, Kevin; White, Andrew A

    2018-01-01

    We describe the development and design of a smartphone app-based system to create inpatient progress notes using voice, commercial automatic speech recognition software, with text processing to recognize spoken voice commands and format the note, and integration with a commercial EHR. This new system fits hospital rounding workflow and was used to support a randomized clinical trial testing whether use of voice to create notes improves timeliness of note availability, note quality, and physician satisfaction with the note creation process. The system was used to create 709 notes which were placed in the corresponding patient's EHR record. The median time from pressing the Send button to appearance of the formatted note in the Inbox was 8.8 min. It was generally very reliable, accepted by physician users, and secure. This approach provides an alternative to use of keyboard and templates to create progress notes and may appeal to physicians who prefer voice to typing. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. Automatic recognition and analysis of synapses. [in brain tissue

    NASA Technical Reports Server (NTRS)

    Ungerleider, J. A.; Ledley, R. S.; Bloom, F. E.

    1976-01-01

    An automatic system for recognizing synaptic junctions would allow analysis of large samples of tissue for the possible classification of specific well-defined sets of synapses based upon structural morphometric indices. In this paper the three steps of our system are described: (1) cytochemical tissue preparation to allow easy recognition of the synaptic junctions; (2) transmitting the tissue information to a computer; and (3) analyzing each field to recognize the synapses and make measurements on them.

  13. A novel probabilistic framework for event-based speech recognition

    NASA Astrophysics Data System (ADS)

    Juneja, Amit; Espy-Wilson, Carol

    2003-10-01

    One of the reasons for unsatisfactory performance of the state-of-the-art automatic speech recognition (ASR) systems is the inferior acoustic modeling of low-level acoustic-phonetic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal, but such a system for continuous speech recognition (CSR) is not known to exist. A probabilistic and statistical framework for CSR based on the idea of the representation of speech sounds by bundles of binary valued articulatory phonetic features is proposed. Multiple probabilistic sequences of linguistically motivated landmarks are obtained using binary classifiers of manner phonetic features-syllabic, sonorant and continuant-and the knowledge-based acoustic parameters (APs) that are acoustic correlates of those features. The landmarks are then used for the extraction of knowledge-based APs for source and place phonetic features and their binary classification. Probabilistic landmark sequences are constrained using manner class language models for isolated or connected word recognition. The proposed method could overcome the disadvantages encountered by the early acoustic-phonetic knowledge-based systems that led the ASR community to switch to systems highly dependent on statistical pattern analysis methods and probabilistic language or grammar models.

  14. Feasibility of automated speech sample collection with stuttering children using interactive voice response (IVR) technology.

    PubMed

    Vogel, Adam P; Block, Susan; Kefalianos, Elaina; Onslow, Mark; Eadie, Patricia; Barth, Ben; Conway, Laura; Mundt, James C; Reilly, Sheena

    2015-04-01

    To investigate the feasibility of adopting automated interactive voice response (IVR) technology for remotely capturing standardized speech samples from stuttering children. Participants were 10 6-year-old stuttering children. Their parents called a toll-free number from their homes and were prompted to elicit speech from their children using a standard protocol involving conversation, picture description and games. The automated IVR system was implemented using an off-the-shelf telephony software program and delivered by a standard desktop computer. The software infrastructure utilizes voice over internet protocol. Speech samples were automatically recorded during the calls. Video recordings were simultaneously acquired in the home at the time of the call to evaluate the fidelity of the telephone collected samples. Key outcome measures included syllables spoken, percentage of syllables stuttered and an overall rating of stuttering severity using a 10-point scale. Data revealed a high level of relative reliability in terms of intra-class correlation between the video and telephone acquired samples on all outcome measures during the conversation task. Findings were less consistent for speech samples during picture description and games. Results suggest that IVR technology can be used successfully to automate remote capture of child speech samples.

  15. Automatic detection of obstructive sleep apnea using speech signals.

    PubMed

    Goldshtein, Evgenia; Tarasiuk, Ariel; Zigel, Yaniv

    2011-05-01

    Obstructive sleep apnea (OSA) is a common disorder associated with anatomical abnormalities of the upper airways that affects 5% of the population. Acoustic parameters may be influenced by the vocal tract structure and soft tissue properties. We hypothesize that speech signal properties of OSA patients will be different than those of control subjects not having OSA. Using speech signal processing techniques, we explored acoustic speech features of 93 subjects who were recorded using a text-dependent speech protocol and a digital audio recorder immediately prior to polysomnography study. Following analysis of the study, subjects were divided into OSA (n=67) and non-OSA (n=26) groups. A Gaussian mixture model-based system was developed to model and classify between the groups; discriminative features such as vocal tract length and linear prediction coefficients were selected using feature selection technique. Specificity and sensitivity of 83% and 79% were achieved for the male OSA and 86% and 84% for the female OSA patients, respectively. We conclude that acoustic features from speech signals during wakefulness can detect OSA patients with good specificity and sensitivity. Such a system can be used as a basis for future development of a tool for OSA screening. © 2011 IEEE

  16. Speech rhythm alterations in Spanish-speaking individuals with Alzheimer's disease.

    PubMed

    Martínez-Sánchez, Francisco; Meilán, Juan J G; Vera-Ferrandiz, Juan Antonio; Carro, Juan; Pujante-Valverde, Isabel M; Ivanova, Olga; Carcavilla, Nuria

    2017-07-01

    Rhythm is the speech property related to the temporal organization of sounds. Considerable evidence is now available for suggesting that dementia of Alzheimer's type is associated with impairments in speech rhythm. The aim of this study is to assess the use of an automatic computerized system for measuring speech rhythm characteristics in an oral reading task performed by 45 patients with Alzheimer's disease (AD) compared with those same characteristics among 82 healthy older adults without a diagnosis of dementia, and matched by age, sex and cultural background. Ranges of rhythmic-metric and clinical measurements were applied. The results show rhythmic differences between the groups, with higher variability of syllabic intervals in AD patients. Signal processing algorithms applied to oral reading recordings prove to be capable of differentiating between AD patients and older adults without dementia with an accuracy of 87% (specificity 81.7%, sensitivity 82.2%), based on the standard deviation of the duration of syllabic intervals. Experimental results show that the syllabic variability measurements extracted from the speech signal can be used to distinguish between older adults without a diagnosis of dementia and those with AD, and may be useful as a tool for the objective study and quantification of speech deficits in AD.

  17. When speaker identity is unavoidable: Neural processing of speaker identity cues in natural speech.

    PubMed

    Tuninetti, Alba; Chládková, Kateřina; Peter, Varghese; Schiller, Niels O; Escudero, Paola

    2017-11-01

    Speech sound acoustic properties vary largely across speakers and accents. When perceiving speech, adult listeners normally disregard non-linguistic variation caused by speaker or accent differences, in order to comprehend the linguistic message, e.g. to correctly identify a speech sound or a word. Here we tested whether the process of normalizing speaker and accent differences, facilitating the recognition of linguistic information, is found at the level of neural processing, and whether it is modulated by the listeners' native language. In a multi-deviant oddball paradigm, native and nonnative speakers of Dutch were exposed to naturally-produced Dutch vowels varying in speaker, sex, accent, and phoneme identity. Unexpectedly, the analysis of mismatch negativity (MMN) amplitudes elicited by each type of change shows a large degree of early perceptual sensitivity to non-linguistic cues. This finding on perception of naturally-produced stimuli contrasts with previous studies examining the perception of synthetic stimuli wherein adult listeners automatically disregard acoustic cues to speaker identity. The present finding bears relevance to speech normalization theories, suggesting that at an unattended level of processing, listeners are indeed sensitive to changes in fundamental frequency in natural speech tokens. Copyright © 2017 Elsevier Inc. All rights reserved.

  18. Intelligibility Evaluation of Pathological Speech through Multigranularity Feature Extraction and Optimization.

    PubMed

    Fang, Chunying; Li, Haifeng; Ma, Lin; Zhang, Mancai

    2017-01-01

    Pathological speech usually refers to speech distortion resulting from illness or other biological insults. The assessment of pathological speech plays an important role in assisting the experts, while automatic evaluation of speech intelligibility is difficult because it is usually nonstationary and mutational. In this paper, we carry out an independent innovation of feature extraction and reduction, and we describe a multigranularity combined feature scheme which is optimized by the hierarchical visual method. A novel method of generating feature set based on S -transform and chaotic analysis is proposed. There are BAFS (430, basic acoustics feature), local spectral characteristics MSCC (84, Mel S -transform cepstrum coefficients), and chaotic features (12). Finally, radar chart and F -score are proposed to optimize the features by the hierarchical visual fusion. The feature set could be optimized from 526 to 96 dimensions based on NKI-CCRT corpus and 104 dimensions based on SVD corpus. The experimental results denote that new features by support vector machine (SVM) have the best performance, with a recognition rate of 84.4% on NKI-CCRT corpus and 78.7% on SVD corpus. The proposed method is thus approved to be effective and reliable for pathological speech intelligibility evaluation.

  19. Applications of sub-audible speech recognition based upon electromyographic signals

    NASA Technical Reports Server (NTRS)

    Jorgensen, C. Charles (Inventor); Betts, Bradley J. (Inventor)

    2009-01-01

    Method and system for generating electromyographic or sub-audible signals (''SAWPs'') and for transmitting and recognizing the SAWPs that represent the original words and/or phrases. The SAWPs may be generated in an environment that interferes excessively with normal speech or that requires stealth communications, and may be transmitted using encoded, enciphered or otherwise transformed signals that are less subject to signal distortion or degradation in the ambient environment.

  20. How Are Pronunciation Variants of Spoken Words Recognized? A Test of Generalization to Newly Learned Words

    ERIC Educational Resources Information Center

    Pitt, Mark A.

    2009-01-01

    One account of how pronunciation variants of spoken words (center-> "senner" or "sennah") are recognized is that sublexical processes use information about variation in the same phonological environments to recover the intended segments [Gaskell, G., & Marslen-Wilson, W. D. (1998). Mechanisms of phonological inference in speech perception.…

  1. Connotative Meaning of Military Chat Communications

    DTIC Science & Technology

    2009-09-01

    humans recognize connotative cues expressing uncertainty, perception of personal threat, and urgency; formulate linguistic and non-linguistic means for...built a matrix of speech “cues” representative of uncertainty, perception of personal threat, and urgency, but also applied maximum entropy analysis...results. This project proposed to: (1) conduct a study of how humans recognize connotative cues expressing uncertainty, perception of personal

  2. When Half a Word Is Enough: Infants Can Recognize Spoken Words Using Partial Phonetic Information.

    ERIC Educational Resources Information Center

    Fernald, Anne; Swingley, Daniel; Pinto, John P.

    2001-01-01

    Two experiments tracked infants' eye movements to examine use of word-initial information to understand fluent speech. Results indicated that 21- and 18-month-olds recognized partial words as quickly and reliably as whole words. Infants' productive vocabulary and reaction time were related to word recognition accuracy. Results show that…

  3. A speech-controlled environmental control system for people with severe dysarthria.

    PubMed

    Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca

    2007-06-01

    Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.

  4. Automatic detection of service initiation signals used in bars

    PubMed Central

    Loth, Sebastian; Huth, Kerstin; De Ruiter, Jan P.

    2013-01-01

    Recognizing the intention of others is important in all social interactions, especially in the service domain. Enabling a bartending robot to serve customers is particularly challenging as the system has to recognize the social signals produced by customers and respond appropriately. Detecting whether a customer would like to order is essential for the service encounter to succeed. This detection is particularly challenging in a noisy environment with multiple customers. Thus, a bartending robot has to be able to distinguish between customers intending to order, chatting with friends or just passing by. In order to study which signals customers use to initiate a service interaction in a bar, we recorded real-life customer-staff interactions in several German bars. These recordings were used to generate initial hypotheses about the signals customers produce when bidding for the attention of bar staff. Two experiments using snapshots and short video sequences then tested the validity of these hypothesized candidate signals. The results revealed that bar staff responded to a set of two non-verbal signals: first, customers position themselves directly at the bar counter and, secondly, they look at a member of staff. Both signals were necessary and, when occurring together, sufficient. The participants also showed a strong agreement about when these cues occurred in the videos. Finally, a signal detection analysis revealed that ignoring a potential order is deemed worse than erroneously inviting customers to order. We conclude that (a) these two easily recognizable actions are sufficient for recognizing the intention of customers to initiate a service interaction, but other actions such as gestures and speech were not necessary, and (b) the use of reaction time experiments using natural materials is feasible and provides ecologically valid results. PMID:24009594

  5. Developing a bilingual "persian cued speech" website for parents and professionals of children with hearing impairment.

    PubMed

    Movallali, Guita; Sajedi, Firoozeh

    2014-03-01

    The use of the internet as a source of information gathering, self-help and support is becoming increasingly recognized. Parents and professionals of children with hearing impairment have been shown to seek information about different communication approaches online. Cued Speech is a very new approach to Persian speaking pupils. Our aim was to develop a useful website to give related information about Persian Cued Speech to parents and professionals of children with hearing impairment. All Cued Speech websites from different countries that fell within the first ten pages of Google and Yahoo search-engines were assessed. Main subjects and links were studied. All related information was gathered from the websites, textbooks, articles etc. Using a framework that combined several criteria for health-information websites, we developed the Persian Cued Speech website for three distinct audiences (parents, professionals and children). An accurate, complete, accessible and readable resource about Persian Cued Speech for parents and professionals is available now.

  6. Spectral analysis method and sample generation for real time visualization of speech

    NASA Astrophysics Data System (ADS)

    Hobohm, Klaus

    A method for translating speech signals into optical models, characterized by high sound discrimination and learnability and designed to provide to deaf persons a feedback towards control of their way of speaking, is presented. Important properties of speech production and perception processes and organs involved in these mechanisms are recalled in order to define requirements for speech visualization. It is established that the spectral representation of time, frequency and amplitude resolution of hearing must be fair and continuous variations of acoustic parameters of speech signal must be depicted by a continuous variation of images. A color table was developed for dynamic illustration and sonograms were generated with five spectral analysis methods such as Fourier transformations and linear prediction coding. For evaluating sonogram quality, test persons had to recognize consonant/vocal/consonant words and an optimized analysis method was achieved with a fast Fourier transformation and a postprocessor. A hardware concept of a real time speech visualization system, based on multiprocessor technology in a personal computer, is presented.

  7. Elizabeth Usher Memorial Lecture: How do we change our profession? Using the lens of behavioural economics to improve evidence-based practice in speech-language pathology.

    PubMed

    McCabe, Patricia J

    2018-06-01

    Evidence-based practice (EBP) is a well-accepted theoretical framework around which speech-language pathologists strive to build their clinical decisions. The profession's conceptualisation of EBP has been evolving over the last 20 years with the practice of EBP now needing to balance research evidence, clinical data and informed patient choices. However, although EBP is not a new concept, as a profession, we seem to be no closer to closing the gap between research evidence and practice than we were at the start of the movement toward EBP in the late 1990s. This paper examines why speech-language pathologists find it difficult to change our own practice when we are experts in changing the behaviour of others. Using the lens of behavioural economics to examine the heuristics and cognitive processes which facilitate and inhibit change, the paper explores research showing how inconsistency of belief and action, or cognitive dissonance, is inevitable unless we act reflectively instead of automatically. The paper argues that heuristics that prevent us changing our practice toward EBP include the sunk cost fallacy, loss aversion, social desirability bias, choice overload and inertia. These automatic cognitive processes work to inhibit change and may partially account for the slow translation of research into practice. Fortunately, understanding and using other heuristics such as the framing effect, reciprocity, social proof, consistency and commitment may help us to understand our own behaviour as speech-language pathologists and help the profession, and those we work with, move towards EBP.

  8. V2S: Voice to Sign Language Translation System for Malaysian Deaf People

    NASA Astrophysics Data System (ADS)

    Mean Foong, Oi; Low, Tang Jung; La, Wai Wan

    The process of learning and understand the sign language may be cumbersome to some, and therefore, this paper proposes a solution to this problem by providing a voice (English Language) to sign language translation system using Speech and Image processing technique. Speech processing which includes Speech Recognition is the study of recognizing the words being spoken, regardless of whom the speaker is. This project uses template-based recognition as the main approach in which the V2S system first needs to be trained with speech pattern based on some generic spectral parameter set. These spectral parameter set will then be stored as template in a database. The system will perform the recognition process through matching the parameter set of the input speech with the stored templates to finally display the sign language in video format. Empirical results show that the system has 80.3% recognition rate.

  9. Dysarthria and broader motor speech deficits in Dravet syndrome.

    PubMed

    Turner, Samantha J; Brown, Amy; Arpone, Marta; Anderson, Vicki; Morgan, Angela T; Scheffer, Ingrid E

    2017-02-21

    To analyze the oral motor, speech, and language phenotype in 20 children and adults with Dravet syndrome (DS) associated with mutations in SCN1A . Fifteen verbal and 5 minimally verbal DS patients with SCN1A mutations (aged 15 months-28 years) underwent a tailored assessment battery. Speech was characterized by imprecise articulation, abnormal nasal resonance, voice, and pitch, and prosody errors. Half of verbal patients had moderate to severely impaired conversational speech intelligibility. Oral motor impairment, motor planning/programming difficulties, and poor postural control were typical. Nonverbal individuals had intentional communication. Cognitive skills varied markedly, with intellectual functioning ranging from the low average range to severe intellectual disability. Language impairment was congruent with cognition. We describe a distinctive speech, language, and oral motor phenotype in children and adults with DS associated with mutations in SCN1A. Recognizing this phenotype will guide therapeutic intervention in patients with DS. © 2017 American Academy of Neurology.

  10. Dysarthria and broader motor speech deficits in Dravet syndrome

    PubMed Central

    Turner, Samantha J.; Brown, Amy; Arpone, Marta; Anderson, Vicki; Morgan, Angela T.

    2017-01-01

    Objective: To analyze the oral motor, speech, and language phenotype in 20 children and adults with Dravet syndrome (DS) associated with mutations in SCN1A. Methods: Fifteen verbal and 5 minimally verbal DS patients with SCN1A mutations (aged 15 months-28 years) underwent a tailored assessment battery. Results: Speech was characterized by imprecise articulation, abnormal nasal resonance, voice, and pitch, and prosody errors. Half of verbal patients had moderate to severely impaired conversational speech intelligibility. Oral motor impairment, motor planning/programming difficulties, and poor postural control were typical. Nonverbal individuals had intentional communication. Cognitive skills varied markedly, with intellectual functioning ranging from the low average range to severe intellectual disability. Language impairment was congruent with cognition. Conclusions: We describe a distinctive speech, language, and oral motor phenotype in children and adults with DS associated with mutations in SCN1A. Recognizing this phenotype will guide therapeutic intervention in patients with DS. PMID:28148630

  11. Speech waveform perturbation analysis: a perceptual-acoustical comparison of seven measures.

    PubMed

    Askenfelt, A G; Hammarberg, B

    1986-03-01

    The performance of seven acoustic measures of cycle-to-cycle variations (perturbations) in the speech waveform was compared. All measures were calculated automatically and applied on running speech. Three of the measures refer to the frequency of occurrence and severity of waveform perturbations in special selected parts of the speech, identified by means of the rate of change in the fundamental frequency. Three other measures refer to statistical properties of the distribution of the relative frequency differences between adjacent pitch periods. One perturbation measure refers to the percentage of consecutive pitch period differences with alternating signs. The acoustic measures were tested on tape recorded speech samples from 41 voice patients, before and after successful therapy. Scattergrams of acoustic waveform perturbation data versus an average of perceived deviant voice qualities, as rated by voice clinicians, are presented. The perturbation measures were compared with regard to the acoustic-perceptual correlation and their ability to discriminate between normal and pathological voice status. The standard deviation of the distribution of the relative frequency differences was suggested as the most useful acoustic measure of waveform perturbations for clinical applications.

  12. Application of automatic threshold in dynamic target recognition with low contrast

    NASA Astrophysics Data System (ADS)

    Miao, Hua; Guo, Xiaoming; Chen, Yu

    2014-11-01

    Hybrid photoelectric joint transform correlator can realize automatic real-time recognition with high precision through the combination of optical devices and electronic devices. When recognizing targets with low contrast using photoelectric joint transform correlator, because of the difference of attitude, brightness and grayscale between target and template, only four to five frames of dynamic targets can be recognized without any processing. CCD camera is used to capture the dynamic target images and the capturing speed of CCD is 25 frames per second. Automatic threshold has many advantages like fast processing speed, effectively shielding noise interference, enhancing diffraction energy of useful information and better reserving outline of target and template, so this method plays a very important role in target recognition with optical correlation method. However, the automatic obtained threshold by program can not achieve the best recognition results for dynamic targets. The reason is that outline information is broken to some extent. Optimal threshold is obtained by manual intervention in most cases. Aiming at the characteristics of dynamic targets, the processing program of improved automatic threshold is finished by multiplying OTSU threshold of target and template by scale coefficient of the processed image, and combining with mathematical morphology. The optimal threshold can be achieved automatically by improved automatic threshold processing for dynamic low contrast target images. The recognition rate of dynamic targets is improved through decreased background noise effect and increased correlation information. A series of dynamic tank images with the speed about 70 km/h are adapted as target images. The 1st frame of this series of tanks can correlate only with the 3rd frame without any processing. Through OTSU threshold, the 80th frame can be recognized. By automatic threshold processing of the joint images, this number can be increased to 89 frames. Experimental results show that the improved automatic threshold processing has special application value for the recognition of dynamic target with low contrast.

  13. Neuronal networks and self-organizing maps: new computer techniques in the acoustic evaluation of the infant cry.

    PubMed

    Schönweiler, R; Kaese, S; Möller, S; Rinscheid, A; Ptok, M

    1996-12-05

    Neuronal networks are computer-based techniques for the evaluation and control of complex information systems and processes. So far, they have been used in engineering, telecommunications, artificial speech and speech recognition. A new approach in neuronal network is the self-organizing map (Kohonen map). In the phase of 'learning', the map adapts to the patterns of the primary signals. If, the phase of 'using the map', the input signal hits the field of the primary signals, it resembles them and is called a 'winner'. In our study, we recorded the cries of newborns and young infants using digital audio tape (DAT) and a high quality microphone. The cries were elicited by tactile stimuli wearing headphones. In 27 cases, delayed auditory feedback was presented to the children using a headphone and an additional three-head tape-recorder. Spectrographic characteristics of the cries were classified by 20-step bark spectra and then applied to the neuronal networks. It was possible to recognize similarities of different cries of the same children as well as interindividual differences, which are also audible to experienced listeners. Differences were obvious in profound hearing loss. We know much about the cries of both healthy and sick infants, but a reliable investigation regimen, which can be used for clinical routine purposes, has yet not been developed. If, in the future, it becomes possible to classify spectrographic characteristics automatically, even if they are not audible, neuronal networks may be helpful in the early diagnosis of infant diseases.

  14. Approximated mutual information training for speech recognition using myoelectric signals.

    PubMed

    Guo, Hua J; Chan, A D C

    2006-01-01

    A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.

  15. Highlight summarization in golf videos using audio signals

    NASA Astrophysics Data System (ADS)

    Kim, Hyoung-Gook; Kim, Jin Young

    2008-01-01

    In this paper, we present an automatic summarization of highlights in golf videos based on audio information alone without video information. The proposed highlight summarization system is carried out based on semantic audio segmentation and detection on action units from audio signals. Studio speech, field speech, music, and applause are segmented by means of sound classification. Swing is detected by the methods of impulse onset detection. Sounds like swing and applause form a complete action unit, while studio speech and music parts are used to anchor the program structure. With the advantage of highly precise detection of applause, highlights are extracted effectively. Our experimental results obtain high classification precision on 18 golf games. It proves that the proposed system is very effective and computationally efficient to apply the technology to embedded consumer electronic devices.

  16. Identifying Residual Speech Sound Disorders in Bilingual Children: A Japanese-English Case Study

    PubMed Central

    Preston, Jonathan L.; Seki, Ayumi

    2012-01-01

    Purpose The purposes are to (1) describe the assessment of residual speech sound disorders (SSD) in bilinguals by distinguishing speech patterns associated with second language acquisition from patterns associated with misarticulations, and (2) describe how assessment of domains such as speech motor control and phonological awareness can provide a more complete understanding of SSDs in bilinguals. Method A review of Japanese phonology is provided to offer a context for understanding the transfer of Japanese to English productions. A case study of an 11-year-old is presented, demonstrating parallel speech assessments in English and Japanese. Speech motor and phonological awareness tasks were conducted in both languages. Results Several patterns were observed in the participant’s English that could be plausibly explained by the influence of Japanese phonology. However, errors indicating a residual SSD were observed in both Japanese and English. A speech motor assessment suggested possible speech motor control problems, and phonological awareness was judged to be within the typical range of performance in both languages. Conclusion Understanding the phonological characteristics of L1 can help clinicians recognize speech patterns in L2 associated with transfer. Once these differences are understood, patterns associated with a residual SSD can be identified. Supplementing a relational speech analysis with measures of speech motor control and phonological awareness can provide a more comprehensive understanding of a client’s strengths and needs. PMID:21386046

  17. Speech and language support: How physicians can identify and treat speech and language delays in the office setting

    PubMed Central

    Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo

    2014-01-01

    Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. The tool aimed to help physicians achieve three main goals: early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society’s Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children’s speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation. PMID:24627648

  18. Loudness perception and speech intensity control in Parkinson's disease.

    PubMed

    Clark, Jenna P; Adams, Scott G; Dykstra, Allyson D; Moodie, Shane; Jog, Mandar

    2014-01-01

    The aim of this study was to examine loudness perception in individuals with hypophonia and Parkinson's disease. The participants included 17 individuals with hypophonia related to Parkinson's disease (PD) and 25 age-equivalent controls. The three loudness perception tasks included a magnitude estimation procedure involving a sentence spoken at 60, 65, 70, 75 and 80 dB SPL, an imitation task involving a sentence spoken at 60, 65, 70, 75 and 80 dB SPL, and a magnitude production procedure involving the production of a sentence at five different loudness levels (habitual, two and four times louder and two and four times quieter). The participants with PD produced a significantly different pattern and used a more restricted range than the controls in their perception of speech loudness, imitation of speech intensity, and self-generated estimates of speech loudness. The results support a speech loudness perception deficit in PD involving an abnormal perception of externally generated and self-generated speech intensity. Readers will recognize that individuals with hypophonia related to Parkinson's disease may demonstrate a speech loudness perception deficit involving the abnormal perception of externally generated and self-generated speech intensity. Copyright © 2014 Elsevier Inc. All rights reserved.

  19. A new time-adaptive discrete bionic wavelet transform for enhancing speech from adverse noise environment

    NASA Astrophysics Data System (ADS)

    Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui

    2012-04-01

    Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.

  20. Gradient language dominance affects talker learning.

    PubMed

    Bregman, Micah R; Creel, Sarah C

    2014-01-01

    Traditional conceptions of spoken language assume that speech recognition and talker identification are computed separately. Neuropsychological and neuroimaging studies imply some separation between the two faculties, but recent perceptual studies suggest better talker recognition in familiar languages than unfamiliar languages. A familiar-language benefit in talker recognition potentially implies strong ties between the two domains. However, little is known about the nature of this language familiarity effect. The current study investigated the relationship between speech and talker processing by assessing bilingual and monolingual listeners' ability to learn voices as a function of language familiarity and age of acquisition. Two effects emerged. First, bilinguals learned to recognize talkers in their first language (Korean) more rapidly than they learned to recognize talkers in their second language (English), while English-speaking participants showed the opposite pattern (learning English talkers faster than Korean talkers). Second, bilinguals' learning rate for talkers in their second language (English) correlated with age of English acquisition. Taken together, these results suggest that language background materially affects talker encoding, implying a tight relationship between speech and talker representations. Copyright © 2013 Elsevier B.V. All rights reserved.

  1. Methods for eliciting, annotating, and analyzing databases for child speech development.

    PubMed

    Beckman, Mary E; Plummer, Andrew R; Munson, Benjamin; Reidy, Patrick F

    2017-09-01

    Methods from automatic speech recognition (ASR), such as segmentation and forced alignment, have facilitated the rapid annotation and analysis of very large adult speech databases and databases of caregiver-infant interaction, enabling advances in speech science that were unimaginable just a few decades ago. This paper centers on two main problems that must be addressed in order to have analogous resources for developing and exploiting databases of young children's speech. The first problem is to understand and appreciate the differences between adult and child speech that cause ASR models developed for adult speech to fail when applied to child speech. These differences include the fact that children's vocal tracts are smaller than those of adult males and also changing rapidly in size and shape over the course of development, leading to between-talker variability across age groups that dwarfs the between-talker differences between adult men and women. Moreover, children do not achieve fully adult-like speech motor control until they are young adults, and their vocabularies and phonological proficiency are developing as well, leading to considerably more within-talker variability as well as more between-talker variability. The second problem then is to determine what annotation schemas and analysis techniques can most usefully capture relevant aspects of this variability. Indeed, standard acoustic characterizations applied to child speech reveal that adult-centered annotation schemas fail to capture phenomena such as the emergence of covert contrasts in children's developing phonological systems, while also revealing children's nonuniform progression toward community speech norms as they acquire the phonological systems of their native languages. Both problems point to the need for more basic research into the growth and development of the articulatory system (as well as of the lexicon and phonological system) that is oriented explicitly toward the construction of age-appropriate computational models.

  2. An automatic speech recognition system with speaker-independent identification support

    NASA Astrophysics Data System (ADS)

    Caranica, Alexandru; Burileanu, Corneliu

    2015-02-01

    The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.

  3. Research Directory for Manpower, Personnel, Training, and Human Factors.

    DTIC Science & Technology

    1991-01-01

    Enhance Automatic Recognition of Speech in Noisy, Highly Stressful Environments Cofod R* Lica Systems Inc 703-359-0996 Smart Contract Preparation...Lab 301-278-2946 Smart Contract Preparation Expediter Frezell T LTCOL Human Engineering Lab 301-278-5998 Impulse Noise Hazard Information Processing R&D

  4. Variable frame rate transmission - A review of methodology and application to narrow-band LPC speech coding

    NASA Astrophysics Data System (ADS)

    Viswanathan, V. R.; Makhoul, J.; Schwartz, R. M.; Huggins, A. W. F.

    1982-04-01

    The variable frame rate (VFR) transmission methodology developed, implemented, and tested in the years 1973-1978 for efficiently transmitting linear predictive coding (LPC) vocoder parameters extracted from the input speech at a fixed frame rate is reviewed. With the VFR method, parameters are transmitted only when their values have changed sufficiently over the interval since their preceding transmission. Two distinct approaches to automatic implementation of the VFR method are discussed. The first bases the transmission decisions on comparisons between the parameter values of the present frame and the last transmitted frame. The second, which is based on a functional perceptual model of speech, compares the parameter values of all the frames that lie in the interval between the present frame and the last transmitted frame against a linear model of parameter variation over that interval. Also considered is the application of VFR transmission to the design of narrow-band LPC speech coders with average bit rates of 2000-2400 bts/s.

  5. Sound and speech detection and classification in a Health Smart Home.

    PubMed

    Fleury, A; Noury, N; Vacher, M; Glasson, H; Seri, J F

    2008-01-01

    Improvements in medicine increase life expectancy in the world and create a new bottleneck at the entrance of specialized and equipped institutions. To allow elderly people to stay at home, researchers work on ways to monitor them in their own environment, with non-invasive sensors. To meet this goal, smart homes, equipped with lots of sensors, deliver information on the activities of the person and can help detect distress situations. In this paper, we present a global speech and sound recognition system that can be set-up in a flat. We placed eight microphones in the Health Smart Home of Grenoble (a real living flat of 47m(2)) and we automatically analyze and sort out the different sounds recorded in the flat and the speech uttered (to detect normal or distress french sentences). We introduce the methods for the sound and speech recognition, the post-processing of the data and finally the experimental results obtained in real conditions in the flat.

  6. Study of wavelet packet energy entropy for emotion classification in speech and glottal signals

    NASA Astrophysics Data System (ADS)

    He, Ling; Lech, Margaret; Zhang, Jing; Ren, Xiaomei; Deng, Lihua

    2013-07-01

    The automatic speech emotion recognition has important applications in human-machine communication. Majority of current research in this area is focused on finding optimal feature parameters. In recent studies, several glottal features were examined as potential cues for emotion differentiation. In this study, a new type of feature parameter is proposed, which calculates energy entropy on values within selected Wavelet Packet frequency bands. The modeling and classification tasks are conducted using the classical GMM algorithm. The experiments use two data sets: the Speech Under Simulated Emotion (SUSE) data set annotated with three different emotions (angry, neutral and soft) and Berlin Emotional Speech (BES) database annotated with seven different emotions (angry, bored, disgust, fear, happy, sad and neutral). The average classification accuracy achieved for the SUSE data (74%-76%) is significantly higher than the accuracy achieved for the BES data (51%-54%). In both cases, the accuracy was significantly higher than the respective random guessing levels (33% for SUSE and 14.3% for BES).

  7. Language-independent talker-specificity in first-language and second-language speech production by bilingual talkers: L1 speaking rate predicts L2 speaking rate

    PubMed Central

    Bradlow, Ann R.; Kim, Midam; Blasingame, Michael

    2017-01-01

    Second-language (L2) speech is consistently slower than first-language (L1) speech, and L1 speaking rate varies within- and across-talkers depending on many individual, situational, linguistic, and sociolinguistic factors. It is asked whether speaking rate is also determined by a language-independent talker-specific trait such that, across a group of bilinguals, L1 speaking rate significantly predicts L2 speaking rate. Two measurements of speaking rate were automatically extracted from recordings of read and spontaneous speech by English monolinguals (n = 27) and bilinguals from ten L1 backgrounds (n = 86): speech rate (syllables/second), and articulation rate (syllables/second excluding silent pauses). Replicating prior work, L2 speaking rates were significantly slower than L1 speaking rates both across-groups (monolinguals' L1 English vs bilinguals' L2 English), and across L1 and L2 within bilinguals. Critically, within the bilingual group, L1 speaking rate significantly predicted L2 speaking rate, suggesting that a significant portion of inter-talker variation in L2 speech is derived from inter-talker variation in L1 speech, and that individual variability in L2 spoken language production may be best understood within the context of individual variability in L1 spoken language production. PMID:28253679

  8. Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem

    PubMed Central

    Liu, Xunying; Zhang, Chao; Woodland, Phil; Fonteneau, Elisabeth

    2017-01-01

    There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain. PMID:28945744

  9. Technological evaluation of gesture and speech interfaces for enabling dismounted soldier-robot dialogue

    NASA Astrophysics Data System (ADS)

    Kattoju, Ravi Kiran; Barber, Daniel J.; Abich, Julian; Harris, Jonathan

    2016-05-01

    With increasing necessity for intuitive Soldier-robot communication in military operations and advancements in interactive technologies, autonomous robots have transitioned from assistance tools to functional and operational teammates able to service an array of military operations. Despite improvements in gesture and speech recognition technologies, their effectiveness in supporting Soldier-robot communication is still uncertain. The purpose of the present study was to evaluate the performance of gesture and speech interface technologies to facilitate Soldier-robot communication during a spatial-navigation task with an autonomous robot. Gesture and speech semantically based spatial-navigation commands leveraged existing lexicons for visual and verbal communication from the U.S Army field manual for visual signaling and a previously established Squad Level Vocabulary (SLV). Speech commands were recorded by a Lapel microphone and Microsoft Kinect, and classified by commercial off-the-shelf automatic speech recognition (ASR) software. Visual signals were captured and classified using a custom wireless gesture glove and software. Participants in the experiment commanded a robot to complete a simulated ISR mission in a scaled down urban scenario by delivering a sequence of gesture and speech commands, both individually and simultaneously, to the robot. Performance and reliability of gesture and speech hardware interfaces and recognition tools were analyzed and reported. Analysis of experimental results demonstrated the employed gesture technology has significant potential for enabling bidirectional Soldier-robot team dialogue based on the high classification accuracy and minimal training required to perform gesture commands.

  10. A real-time phoneme counting algorithm and application for speech rate monitoring.

    PubMed

    Aharonson, Vered; Aharonson, Eran; Raichlin-Levi, Katia; Sotzianu, Aviv; Amir, Ofer; Ovadia-Blechman, Zehava

    2017-03-01

    Adults who stutter can learn to control and improve their speech fluency by modifying their speaking rate. Existing speech therapy technologies can assist this practice by monitoring speaking rate and providing feedback to the patient, but cannot provide an accurate, quantitative measurement of speaking rate. Moreover, most technologies are too complex and costly to be used for home practice. We developed an algorithm and a smartphone application that monitor a patient's speaking rate in real time and provide user-friendly feedback to both patient and therapist. Our speaking rate computation is performed by a phoneme counting algorithm which implements spectral transition measure extraction to estimate phoneme boundaries. The algorithm is implemented in real time in a mobile application that presents its results in a user-friendly interface. The application incorporates two modes: one provides the patient with visual feedback of his/her speech rate for self-practice and another provides the speech therapist with recordings, speech rate analysis and tools to manage the patient's practice. The algorithm's phoneme counting accuracy was validated on ten healthy subjects who read a paragraph at slow, normal and fast paces, and was compared to manual counting of speech experts. Test-retest and intra-counter reliability were assessed. Preliminary results indicate differences of -4% to 11% between automatic and human phoneme counting. Differences were largest for slow speech. The application can thus provide reliable, user-friendly, real-time feedback for speaking rate control practice. Copyright © 2017 Elsevier Inc. All rights reserved.

  11. Interactive segmentation of tongue contours in ultrasound video sequences using quality maps

    NASA Astrophysics Data System (ADS)

    Ghrenassia, Sarah; Ménard, Lucie; Laporte, Catherine

    2014-03-01

    Ultrasound (US) imaging is an effective and non invasive way of studying the tongue motions involved in normal and pathological speech, and the results of US studies are of interest for the development of new strategies in speech therapy. State-of-the-art tongue shape analysis techniques based on US images depend on semi-automated tongue segmentation and tracking techniques. Recent work has mostly focused on improving the accuracy of the tracking techniques themselves. However, occasional errors remain inevitable, regardless of the technique used, and the tongue tracking process must thus be supervised by a speech scientist who will correct these errors manually or semi-automatically. This paper proposes an interactive framework to facilitate this process. In this framework, the user is guided towards potentially problematic portions of the US image sequence by a segmentation quality map that is based on the normalized energy of an active contour model and automatically produced during tracking. When a problematic segmentation is identified, corrections to the segmented contour can be made on one image and propagated both forward and backward in the problematic subsequence, thereby improving the user experience. The interactive tools were tested in combination with two different tracking algorithms. Preliminary results illustrate the potential of the proposed framework, suggesting that the proposed framework generally improves user interaction time, with little change in segmentation repeatability.

  12. Modeling the Perceptual Learning of Novel Dialect Features

    ERIC Educational Resources Information Center

    Tatman, Rachael

    2017-01-01

    All language use reflects the user's social identity in systematic ways. While humans can easily adapt to this sociolinguistic variation, automatic speech recognition (ASR) systems continue to struggle with it. This dissertation makes three main contributions. The first is to provide evidence that modern state-of-the-art commercial ASR systems…

  13. Morphosyntactic Neural Analysis for Generalized Lexical Normalization

    ERIC Educational Resources Information Center

    Leeman-Munk, Samuel Paul

    2016-01-01

    The phenomenal growth of social media, web forums, and online reviews has spurred a growing interest in automated analysis of user-generated text. At the same time, a proliferation of voice recordings and efforts to archive culture heritage documents are fueling demand for effective automatic speech recognition (ASR) and optical character…

  14. Automatic Intention Recognition in Conversation Processing

    ERIC Educational Resources Information Center

    Holtgraves, Thomas

    2008-01-01

    A fundamental assumption of many theories of conversation is that comprehension of a speaker's utterance involves recognition of the speaker's intention in producing that remark. However, the nature of intention recognition is not clear. One approach is to conceptualize a speaker's intention in terms of speech acts [Searle, J. (1969). "Speech…

  15. Multilingual Videos for MOOCs and OER

    ERIC Educational Resources Information Center

    Valor Miró, Juan Daniel; Baquero-Arnal, Pau; Civera, Jorge; Turró, Carlos; Juan, Alfons

    2018-01-01

    Massive Open Online Courses (MOOCs) and Open Educational Resources (OER) are rapidly growing, but are not usually offered in multiple languages due to the lack of cost-effective solutions to translate the different objects comprising them and particularly videos. However, current state-of-the-art automatic speech recognition (ASR) and machine…

  16. Information Extraction from Multiple Syntactic Sources

    DTIC Science & Technology

    2004-05-01

    Performance of SVM and KNN (k=3) on different kernel setups. Types are ordered in decreasing order of frequency of occurrence in the ACE corpus. For SVM, the...name. But it is not easy to recognize “A Real New York Bargain” as a company name. In other languages or transcripts of English speech where...symbolic rules for extraction of posted computer jobs. It only assumed simple syntactic preprocessing such as tokeniza- tion and Part-of- Speech tagging

  17. Open Microphone Speech Understanding: Correct Discrimination Of In Domain Speech

    NASA Technical Reports Server (NTRS)

    Hieronymus, James; Aist, Greg; Dowding, John

    2006-01-01

    An ideal spoken dialogue system listens continually and determines which utterances were spoken to it, understands them and responds appropriately while ignoring the rest This paper outlines a simple method for achieving this goal which involves trading a slightly higher false rejection rate of in domain utterances for a higher correct rejection rate of Out of Domain (OOD) utterances. The system recognizes semantic entities specified by a unification grammar which is specialized by Explanation Based Learning (EBL). so that it only uses rules which are seen in the training data. The resulting grammar has probabilities assigned to each construct so that overgeneralizations are not a problem. The resulting system only recognizes utterances which reduce to a valid logical form which has meaning for the system and rejects the rest. A class N-gram grammar has been trained on the same training data. This system gives good recognition performance and offers good Out of Domain discrimination when combined with the semantic analysis. The resulting systems were tested on a Space Station Robot Dialogue Speech Database and a subset of the OGI conversational speech database. Both systems run in real time on a PC laptop and the present performance allows continuous listening with an acceptably low false acceptance rate. This type of open microphone system has been used in the Clarissa procedure reading and navigation spoken dialogue system which is being tested on the International Space Station.

  18. Accurate visible speech synthesis based on concatenating variable length motion capture data.

    PubMed

    Ma, Jiyong; Cole, Ron; Pellom, Bryan; Ward, Wayne; Wise, Barbara

    2006-01-01

    We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.

  19. Studies in automatic speech recognition and its application in aerospace

    NASA Astrophysics Data System (ADS)

    Taylor, Michael Robinson

    Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.

  20. Immediate integration of prosodic information from speech and visual information from pictures in the absence of focused attention: a mismatch negativity study.

    PubMed

    Li, X; Yang, Y; Ren, G

    2009-06-16

    Language is often perceived together with visual information. Recent experimental evidences indicated that, during spoken language comprehension, the brain can immediately integrate visual information with semantic or syntactic information from speech. Here we used the mismatch negativity to further investigate whether prosodic information from speech could be immediately integrated into a visual scene context or not, and especially the time course and automaticity of this integration process. Sixteen Chinese native speakers participated in the study. The materials included Chinese spoken sentences and picture pairs. In the audiovisual situation, relative to the concomitant pictures, the spoken sentence was appropriately accented in the standard stimuli, but inappropriately accented in the two kinds of deviant stimuli. In the purely auditory situation, the speech sentences were presented without pictures. It was found that the deviants evoked mismatch responses in both audiovisual and purely auditory situations; the mismatch negativity in the purely auditory situation peaked at the same time as, but was weaker than that evoked by the same deviant speech sounds in the audiovisual situation. This pattern of results suggested immediate integration of prosodic information from speech and visual information from pictures in the absence of focused attention.

  1. Effects of a metronome on the filled pauses of fluent speakers.

    PubMed

    Christenfeld, N

    1996-12-01

    Filled pauses (the "ums" and "uhs" that litter spontaneous speech) seem to be a product of the speaker paying deliberate attention to the normally automatic act of talking. This is the same sort of explanation that has been offered for stuttering. In this paper we explore whether a manipulation that has long been known to decrease stuttering, synchronizing speech to the beats of a metronome, will then also decrease filled pauses. Two experiments indicate that a metronome has a dramatic effect on the production of filled pauses. This effect is not due to any simplification or slowing of the speech and supports the view that a metronome causes speakers to attend more to how they are talking and less to what they are saying. It also lends support to the connection between stutters and filled pauses.

  2. GRIN2A: an aptly named gene for speech dysfunction.

    PubMed

    Turner, Samantha J; Mayes, Angela K; Verhoeven, Andrea; Mandelstam, Simone A; Morgan, Angela T; Scheffer, Ingrid E

    2015-02-10

    To delineate the specific speech deficits in individuals with epilepsy-aphasia syndromes associated with mutations in the glutamate receptor subunit gene GRIN2A. We analyzed the speech phenotype associated with GRIN2A mutations in 11 individuals, aged 16 to 64 years, from 3 families. Standardized clinical speech assessments and perceptual analyses of conversational samples were conducted. Individuals showed a characteristic phenotype of dysarthria and dyspraxia with lifelong impact on speech intelligibility in some. Speech was typified by imprecise articulation (11/11, 100%), impaired pitch (monopitch 10/11, 91%) and prosody (stress errors 7/11, 64%), and hypernasality (7/11, 64%). Oral motor impairments and poor performance on maximum vowel duration (8/11, 73%) and repetition of monosyllables (10/11, 91%) and trisyllables (7/11, 64%) supported conversational speech findings. The speech phenotype was present in one individual who did not have seizures. Distinctive features of dysarthria and dyspraxia are found in individuals with GRIN2A mutations, often in the setting of epilepsy-aphasia syndromes; dysarthria has not been previously recognized in these disorders. Of note, the speech phenotype may occur in the absence of a seizure disorder, reinforcing an important role for GRIN2A in motor speech function. Our findings highlight the need for precise clinical speech assessment and intervention in this group. By understanding the mechanisms involved in GRIN2A disorders, targeted therapy may be designed to improve chronic lifelong deficits in intelligibility. © 2015 American Academy of Neurology.

  3. Neural Recruitment for the Production of Native and Novel Speech Sounds

    PubMed Central

    Moser, Dana; Fridriksson, Julius; Bonilha, Leonardo; Healy, Eric W.; Baylis, Gordon; Baker, Julie; Rorden, Chris

    2010-01-01

    Two primary areas of damage have been implicated in apraxia of speech (AOS) based on the time post-stroke: (1) the left inferior frontal gyrus (IFG) in acute patients, and (2) the left anterior insula (aIns) in chronic patients. While AOS is widely characterized as a disorder in motor speech planning, little is known about the specific contributions of each of these regions in speech. The purpose of this study was to investigate cortical activation during speech production with a specific focus on the aIns and the IFG in normal adults. While undergoing sparse fMRI, 30 normal adults completed a 30-minute speech-repetition task consisting of three-syllable nonwords that contained either (a) English (native) syllables or (b) Non-English (novel) syllables. When the novel syllable productions were compared to the native syllable productions, greater neural activation was observed in the aIns and IFG, particularly during the first 10 minutes of the task when novelty was the greatest. Although activation in the aIns remained high throughout the task for novel productions, greater activation was clearly demonstrated when the initial 10 minutes were compared to the final 10 minutes of the task. These results suggest increased activity within an extensive neural network, including the aIns and IFG, when the motor speech system is taxed, such as during the production of novel speech. We speculate that the amount of left aIns recruitment during speech production may be related to the internal construction of the motor speech unit such that the degree of novelty/automaticity would result in more or less demands respectively. The role of the IFG as a storehouse and integrative processor for previously acquired routines is also discussed. PMID:19385020

  4. Multi-sensory learning and learning to read.

    PubMed

    Blomert, Leo; Froyen, Dries

    2010-09-01

    The basis of literacy acquisition in alphabetic orthographies is the learning of the associations between the letters and the corresponding speech sounds. In spite of this primacy in learning to read, there is only scarce knowledge on how this audiovisual integration process works and which mechanisms are involved. Recent electrophysiological studies of letter-speech sound processing have revealed that normally developing readers take years to automate these associations and dyslexic readers hardly exhibit automation of these associations. It is argued that the reason for this effortful learning may reside in the nature of the audiovisual process that is recruited for the integration of in principle arbitrarily linked elements. It is shown that letter-speech sound integration does not resemble the processes involved in the integration of natural audiovisual objects such as audiovisual speech. The automatic symmetrical recruitment of the assumedly uni-sensory visual and auditory cortices in audiovisual speech integration does not occur for letter and speech sound integration. It is also argued that letter-speech sound integration only partly resembles the integration of arbitrarily linked unfamiliar audiovisual objects. Letter-sound integration and artificial audiovisual objects share the necessity of a narrow time window for integration to occur. However, they differ from these artificial objects, because they constitute an integration of partly familiar elements which acquire meaning through the learning of an orthography. Although letter-speech sound pairs share similarities with audiovisual speech processing as well as with unfamiliar, arbitrary objects, it seems that letter-speech sound pairs develop into unique audiovisual objects that furthermore have to be processed in a unique way in order to enable fluent reading and thus very likely recruit other neurobiological learning mechanisms than the ones involved in learning natural or arbitrary unfamiliar audiovisual associations. Copyright 2010 Elsevier B.V. All rights reserved.

  5. Automated Electroglottographic Inflection Events Detection. A Pilot Study.

    PubMed

    Codino, Juliana; Torres, María Eugenia; Rubin, Adam; Jackson-Menaldi, Cristina

    2016-11-01

    Vocal-fold vibration can be analyzed in a noninvasive way by registering impedance changes within the glottis, through electroglottography. The morphology of the electroglottographic (EGG) signal is related to different vibratory patterns. In the literature, a characteristic knee in the descending portion of the signal has been reported. Some EGG signals do not exhibit this particular knee and have other types of events (inflection events) throughout the ascending and/or descending portion of the vibratory cycle. The goal of this work is to propose an automatic method to identify and classify these events. A computational algorithm was developed based on the mathematical properties of the EGG signal, which detects and reports events throughout the contact phase. Retrospective analysis of EGG signals obtained during routine voice evaluation of adult individuals with a variety of voice disorders was performed using the algorithm as well as human raters. Two judges, both experts in clinical voice analysis, and three general speech pathologists performed manual and visual evaluation of the sample set. The results obtained by the automatic method were compared with those of the human raters. Statistical analysis revealed a significant level of agreement. This automatic tool could allow professionals in the clinical setting to obtain an automatic quantitative and qualitative report of such events present in a voice sample, without having to manually analyze the whole EGG signal. In addition, it might provide the speech pathologist with more information that would complement the standard voice evaluation. It could also be a valuable tool in voice research. Copyright © 2016 The Voice Foundation. Published by Elsevier Inc. All rights reserved.

  6. Robust relationship between reading span and speech recognition in noise

    PubMed Central

    Souza, Pamela; Arehart, Kathryn

    2015-01-01

    Objective Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. Design The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. Study sample The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Results Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Conclusions Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition. PMID:25975360

  7. Robust relationship between reading span and speech recognition in noise.

    PubMed

    Souza, Pamela; Arehart, Kathryn

    2015-01-01

    Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition.

  8. An Analysis of Individual Differences in Recognizing Monosyllabic Words Under the Speech Intelligibility Index Framework

    PubMed Central

    Shen, Yi; Kern, Allison B.

    2018-01-01

    Individual differences in the recognition of monosyllabic words, either in isolation (NU6 test) or in sentence context (SPIN test), were investigated under the theoretical framework of the speech intelligibility index (SII). An adaptive psychophysical procedure, namely the quick-band-importance-function procedure, was developed to enable the fitting of the SII model to individual listeners. Using this procedure, the band importance function (i.e., the relative weights of speech information across the spectrum) and the link function relating the SII to recognition scores can be simultaneously estimated while requiring only 200 to 300 trials of testing. Octave-frequency band importance functions and link functions were estimated separately for NU6 and SPIN materials from 30 normal-hearing listeners who were naïve to speech recognition experiments. For each type of speech material, considerable individual differences in the spectral weights were observed in some but not all frequency regions. At frequencies where the greatest intersubject variability was found, the spectral weights were correlated between the two speech materials, suggesting that the variability in spectral weights reflected listener-originated factors. PMID:29532711

  9. SPEEDY babies: A putative new behavioral syndrome of unbalanced motor-speech development

    PubMed Central

    Haapanen, Marja-Leena; Aro, Tuomo; Isotalo, Elina

    2008-01-01

    Even though difficulties in motor development in children with speech and language disorders are widely known, hardly any attention is paid to the association between atypically rapidly occurring unassisted walking and delayed speech development. The four children described here presented with a developmental behavioral triad: 1) atypically speedy motor development, 2) impaired expressive speech, and 3) tongue carriage dysfunction resulting in related misarticulations. Those characteristics might be phenotypically or genetically clustered. These children didn’t have impaired cognition, neurological or mental disease, defective sense organs, craniofacial dysmorphology or susceptibility to upper respiratory infections, particularly recurrent otitis media. Attention should be paid on discordant and unbalanced achievement of developmental milestones. Present children are termed SPEEDY babies, where SPEEDY refers to rapid independent walking, SPEE and DY to dyspractic or dysfunctional speech development and lingual dysfunction resulting in linguoalveolar misarticulations. SPEEDY babies require health care that recognizes and respects their motor skills and supports their needs for motor activities and on the other hand include treatment for impaired speech. The parents may need advice and support with these children. PMID:19337462

  10. Musical background not associated with self-perceived hearing performance or speech perception in postlingual cochlear-implant users.

    PubMed

    Fuller, Christina; Free, Rolien; Maat, Bert; Başkent, Deniz

    2012-08-01

    In normal-hearing listeners, musical background has been observed to change the sound representation in the auditory system and produce enhanced performance in some speech perception tests. Based on these observations, it has been hypothesized that musical background can influence sound and speech perception, and as an extension also the quality of life, by cochlear-implant users. To test this hypothesis, this study explored musical background [using the Dutch Musical Background Questionnaire (DMBQ)], and self-perceived sound and speech perception and quality of life [using the Nijmegen Cochlear Implant Questionnaire (NCIQ) and the Speech Spatial and Qualities of Hearing Scale (SSQ)] in 98 postlingually deafened adult cochlear-implant recipients. In addition to self-perceived measures, speech perception scores (percentage of phonemes recognized in words presented in quiet) were obtained from patient records. The self-perceived hearing performance was associated with the objective speech perception. Forty-one respondents (44% of 94 respondents) indicated some form of formal musical training. Fifteen respondents (18% of 83 respondents) judged themselves as having musical training, experience, and knowledge. No association was observed between musical background (quantified by DMBQ), and self-perceived hearing-related performance or quality of life (quantified by NCIQ and SSQ), or speech perception in quiet.

  11. Microscopic prediction of speech intelligibility in spatially distributed speech-shaped noise for normal-hearing listeners.

    PubMed

    Geravanchizadeh, Masoud; Fallah, Ali

    2015-12-01

    A binaural and psychoacoustically motivated intelligibility model, based on a well-known monaural microscopic model is proposed. This model simulates a phoneme recognition task in the presence of spatially distributed speech-shaped noise in anechoic scenarios. In the proposed model, binaural advantage effects are considered by generating a feature vector for a dynamic-time-warping speech recognizer. This vector consists of three subvectors incorporating two monaural subvectors to model the better-ear hearing, and a binaural subvector to simulate the binaural unmasking effect. The binaural unit of the model is based on equalization-cancellation theory. This model operates blindly, which means separate recordings of speech and noise are not required for the predictions. Speech intelligibility tests were conducted with 12 normal hearing listeners by collecting speech reception thresholds (SRTs) in the presence of single and multiple sources of speech-shaped noise. The comparison of the model predictions with the measured binaural SRTs, and with the predictions of a macroscopic binaural model called extended equalization-cancellation, shows that this approach predicts the intelligibility in anechoic scenarios with good precision. The square of the correlation coefficient (r(2)) and the mean-absolute error between the model predictions and the measurements are 0.98 and 0.62 dB, respectively.

  12. Lexical influences on competing speech perception in younger, middle-aged, and older adults

    PubMed Central

    Helfer, Karen S.; Jesse, Alexandra

    2015-01-01

    The influence of lexical characteristics of words in to-be-attended and to-be-ignored speech streams was examined in a competing speech task. Older, middle-aged, and younger adults heard pairs of low-cloze probability sentences in which the frequency or neighborhood density of words was manipulated in either the target speech stream or the masking speech stream. All participants also completed a battery of cognitive measures. As expected, for all groups, target words that occur frequently or that are from sparse lexical neighborhoods were easier to recognize than words that are infrequent or from dense neighborhoods. Compared to other groups, these neighborhood density effects were largest for older adults; the frequency effect was largest for middle-aged adults. Lexical characteristics of words in the to-be-ignored speech stream also affected recognition of to-be-attended words, but only when overall performance was relatively good (that is, when younger participants listened to the speech streams at a more advantageous signal-to-noise ratio). For these listeners, to-be-ignored masker words from sparse neighborhoods interfered with recognition of target speech more than masker words from dense neighborhoods. Amount of hearing loss and cognitive abilities relating to attentional control modulated overall performance as well as the strength of lexical influences. PMID:26233036

  13. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hogden, J.

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation maymore » decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.« less

  14. A hardware experimental platform for neural circuits in the auditory cortex

    NASA Astrophysics Data System (ADS)

    Rodellar-Biarge, Victoria; García-Dominguez, Pablo; Ruiz-Rizaldos, Yago; Gómez-Vilda, Pedro

    2011-05-01

    Speech processing in the human brain is a very complex process far from being fully understood although much progress has been done recently. Neuromorphic Speech Processing is a new research orientation in bio-inspired systems approach to find solutions to automatic treatment of specific problems (recognition, synthesis, segmentation, diarization, etc) which can not be adequately solved using classical algorithms. In this paper a neuromorphic speech processing architecture is presented. The systematic bottom-up synthesis of layered structures reproduce the dynamic feature detection of speech related to plausible neural circuits which work as interpretation centres located in the Auditory Cortex. The elementary model is based on Hebbian neuron-like units. For the computation of the architecture a flexible framework is proposed in the environment of Matlab®/Simulink®/HDL, which allows building models in different description styles, complexity and implementation levels. It provides a flexible platform for experimenting on the influence of the number of neurons and interconnections, in the precision of the results and in performance evaluation. The experimentation with different architecture configurations may help both in better understanding how neural circuits may work in the brain as well as in how speech processing can benefit from this understanding.

  15. Computational validation of the motor contribution to speech perception.

    PubMed

    Badino, Leonardo; D'Ausilio, Alessandro; Fadiga, Luciano; Metta, Giorgio

    2014-07-01

    Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion information onto observers' motor representations. This process of matching the actions of others onto our own sensorimotor repertoire is thought to be important for action recognition, providing a non-mediated "motor perception" based on a bidirectional flow of information along the mirror parieto-frontal circuits. State-of-the-art machine learning strategies for hand action identification have shown better performances when sensorimotor data, as opposed to visual information only, are available during learning. As speech is a particular type of action (with acoustic targets), it is expected to activate a mirror neuron mechanism. Indeed, in speech perception, motor centers have been shown to be causally involved in the discrimination of speech sounds. In this paper, we review recent neurophysiological and machine learning-based studies showing (a) the specific contribution of the motor system to speech perception and (b) that automatic phone recognition is significantly improved when motor data are used during training of classifiers (as opposed to learning from purely auditory data). Copyright © 2014 Cognitive Science Society, Inc.

  16. Performance in noise: Impact of reduced speech intelligibility on Sailor performance in a Navy command and control environment.

    PubMed

    Keller, M David; Ziriax, John M; Barns, William; Sheffield, Benjamin; Brungart, Douglas; Thomas, Tony; Jaeger, Bobby; Yankaskas, Kurt

    2017-06-01

    Noise, hearing loss, and electronic signal distortion, which are common problems in military environments, can impair speech intelligibility and thereby jeopardize mission success. The current study investigated the impact that impaired communication has on operational performance in a command and control environment by parametrically degrading speech intelligibility in a simulated shipborne Combat Information Center. Experienced U.S. Navy personnel served as the study participants and were required to monitor information from multiple sources and respond appropriately to communications initiated by investigators playing the roles of other personnel involved in a realistic Naval scenario. In each block of the scenario, an adaptive intelligibility modification system employing automatic gain control was used to adjust the signal-to-noise ratio to achieve one of four speech intelligibility levels on a Modified Rhyme Test: No Loss, 80%, 60%, or 40%. Objective and subjective measures of operational performance suggested that performance systematically degraded with decreasing speech intelligibility, with the largest drop occurring between 80% and 60%. These results confirm the importance of noise reduction, good communication design, and effective hearing conservation programs to maximize the operational effectiveness of military personnel. Published by Elsevier B.V.

  17. Current trends in small vocabulary speech recognition for equipment control

    NASA Astrophysics Data System (ADS)

    Doukas, Nikolaos; Bardis, Nikolaos G.

    2017-09-01

    Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.

  18. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    NASA Astrophysics Data System (ADS)

    Selouani, Sid-Ahmed; O'Shaughnessy, Douglas

    2003-12-01

    Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  19. Classification of speech dysfluencies using LPC based parameterization techniques.

    PubMed

    Hariharan, M; Chee, Lim Sin; Ai, Ooi Chia; Yaacob, Sazali

    2012-06-01

    The goal of this paper is to discuss and compare three feature extraction methods: Linear Predictive Coefficients (LPC), Linear Prediction Cepstral Coefficients (LPCC) and Weighted Linear Prediction Cepstral Coefficients (WLPCC) for recognizing the stuttered events. Speech samples from the University College London Archive of Stuttered Speech (UCLASS) were used for our analysis. The stuttered events were identified through manual segmentation and were used for feature extraction. Two simple classifiers namely, k-nearest neighbour (kNN) and Linear Discriminant Analysis (LDA) were employed for speech dysfluencies classification. Conventional validation method was used for testing the reliability of the classifier results. The study on the effect of different frame length, percentage of overlapping, value of ã in a first order pre-emphasizer and different order p were discussed. The speech dysfluencies classification accuracy was found to be improved by applying statistical normalization before feature extraction. The experimental investigation elucidated LPC, LPCC and WLPCC features can be used for identifying the stuttered events and WLPCC features slightly outperforms LPCC features and LPC features.

  20. Automatic recognition of lactating sow behaviors through depth image processing

    USDA-ARS?s Scientific Manuscript database

    Manual observation and classification of animal behaviors is laborious, time-consuming, and of limited ability to process large amount of data. A computer vision-based system was developed that automatically recognizes sow behaviors (lying, sitting, standing, kneeling, feeding, drinking, and shiftin...

  1. The attention-getting capacity of whines and child-directed speech.

    PubMed

    Chang, Rosemarie Sokol; Thompson, Nicholas S

    2010-06-03

    The current study tested the ability of whines and child-directed speech to attract the attention of listeners involved in a story repetition task. Twenty non-parents and 17 parents were presented with two dull stories, each playing to a separate ear, and asked to repeat one of the stories verbatim. The story that participants were instructed to ignore was interrupted occasionally with the reader whining and using child-directed speech. While repeating the passage, participants were monitored for Galvanic skin response, heart rate, and blood pressure. Based on 4 measures, participants tuned in more to whining, and to a lesser extent child-directed speech, than neutral speech segments that served as a control. Participants, regardless of gender or parental status, made more mistakes when presented with the whine or child-directed speech, they recalled hearing those vocalizations, they recognized more words from the whining segment than the neutral control segment, and they exhibited higher Galvanic skin response during the presence of whines and child- directed speech than neutral speech segments. Whines and child-directed speech appear to be integral members of a suite of vocalizations designed to get the attention of attachment partners by playing to an auditory sensitivity among humans. Whines in particular may serve the function of eliciting care at a time when caregivers switch from primarily mothers to greater care from other caregivers.

  2. Automated Steering Control Design by Visual Feedback Approach —System Identification and Control Experiments with a Radio-Controlled Car—

    NASA Astrophysics Data System (ADS)

    Fujiwara, Yukihiro; Yoshii, Masakazu; Arai, Yasuhito; Adachi, Shuichi

    Advanced safety vehicle(ASV)assists drivers’ manipulation to avoid trafic accidents. A variety of researches on automatic driving systems are necessary as an element of ASV. Among them, we focus on visual feedback approach in which the automatic driving system is realized by recognizing road trajectory using image information. The purpose of this paper is to examine the validity of this approach by experiments using a radio-controlled car. First, a practical image processing algorithm to recognize white lines on the road is proposed. Second, a model of the radio-controlled car is built by system identication experiments. Third, an automatic steering control system is designed based on H∞ control theory. Finally, the effectiveness of the designed control system is examined via traveling experiments.

  3. An acoustic feature-based similarity scoring system for speech rehabilitation assistance.

    PubMed

    Syauqy, Dahnial; Wu, Chao-Min; Setyawati, Onny

    2016-08-01

    The purpose of this study is to develop a tool to assist speech therapy and rehabilitation, which focused on automatic scoring based on the comparison of the patient's speech with another normal speech on several aspects including pitch, vowel, voiced-unvoiced segments, strident fricative and sound intensity. The pitch estimation employed the use of cepstrum-based algorithm for its robustness; the vowel classification used multilayer perceptron (MLP) to classify vowel from pitch and formants; and the strident fricative detection was based on the major peak spectral intensity, location and the pitch existence in the segment. In order to evaluate the performance of the system, this study analyzed eight patient's speech recordings (four males, four females; 4-58-years-old), which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Taoyuan General Hospital. The experiment result on pitch algorithm showed that the cepstrum method had 5.3% of gross pitch error from a total of 2086 frames. On the vowel classification algorithm, MLP method provided 93% accuracy (men), 87% (women) and 84% (children). In total, the overall results showed that 156 tool's grading results (81%) were consistent compared to 192 audio and visual observations done by four experienced respondents. Implication for Rehabilitation Difficulties in communication may limit the ability of a person to transfer and exchange information. The fact that speech is one of the primary means of communication has encouraged the needs of speech diagnosis and rehabilitation. The advances of technology in computer-assisted speech therapy (CAST) improve the quality, time efficiency of the diagnosis and treatment of the disorders. The present study attempted to develop tool to assist speech therapy and rehabilitation, which provided simple interface to let the assessment be done even by the patient himself without the need of particular knowledge of speech processing while at the same time, also provided further deep analysis of the speech, which can be useful for the speech therapist.

  4. A discrete speech recognition system for dermatology: 8 years of daily experience in a medical dermatology office.

    PubMed

    Smith, Kevin C

    2002-09-01

    A discrete speech recognition system was customized to recognize approximately 9,000 phrases and terms commonly used in dermatology, and has been used on a daily basis since 1984 to produce as many as 65 consult and follow up letters daily. The systems and procedures necessary for the practical use of this system are described, together with a discussion of the advantages and disadvantages of using this system to automate the transcription of dermatological dictation.

  5. The Timing and Effort of Lexical Access in Natural and Degraded Speech

    PubMed Central

    Wagner, Anita E.; Toffanin, Paolo; Başkent, Deniz

    2016-01-01

    Understanding speech is effortless in ideal situations, and although adverse conditions, such as caused by hearing impairment, often render it an effortful task, they do not necessarily suspend speech comprehension. A prime example of this is speech perception by cochlear implant users, whose hearing prostheses transmit speech as a significantly degraded signal. It is yet unknown how mechanisms of speech processing deal with such degraded signals, and whether they are affected by effortful processing of speech. This paper compares the automatic process of lexical competition between natural and degraded speech, and combines gaze fixations, which capture the course of lexical disambiguation, with pupillometry, which quantifies the mental effort involved in processing speech. Listeners’ ocular responses were recorded during disambiguation of lexical embeddings with matching and mismatching durational cues. Durational cues were selected due to their substantial role in listeners’ quick limitation of the number of lexical candidates for lexical access in natural speech. Results showed that lexical competition increased mental effort in processing natural stimuli in particular in presence of mismatching cues. Signal degradation reduced listeners’ ability to quickly integrate durational cues in lexical selection, and delayed and prolonged lexical competition. The effort of processing degraded speech was increased overall, and because it had its sources at the pre-lexical level this effect can be attributed to listening to degraded speech rather than to lexical disambiguation. In sum, the course of lexical competition was largely comparable for natural and degraded speech, but showed crucial shifts in timing, and different sources of increased mental effort. We argue that well-timed progress of information from sensory to pre-lexical and lexical stages of processing, which is the result of perceptual adaptation during speech development, is the reason why in ideal situations speech is perceived as an undemanding task. Degradation of the signal or the receiver channel can quickly bring this well-adjusted timing out of balance and lead to increase in mental effort. Incomplete and effortful processing at the early pre-lexical stages has its consequences on lexical processing as it adds uncertainty to the forming and revising of lexical hypotheses. PMID:27065901

  6. Influence of phonetic context on the dysphonic event: contribution of new methodologies for the analysis of pathological voice.

    PubMed

    Revis, J; Galant, C; Fredouille, C; Ghio, A; Giovanni, A

    2012-01-01

    Widely studied in terms of perception, acoustics or aerodynamics, dysphonia stays nevertheless a speech phenomenon, closely related to the phonetic composition of the message conveyed by the voice. In this paper, we present a series of three works with the aim to understand the implications of the phonetic manifestation of dysphonia. Our first study proposes a new approach to the perceptual analysis of dysphonia (the phonetic labeling), which principle is to listen and evaluate each phoneme in a sentence separately. This study confirms the hypothesis of Laver that the dysphonia is not a constant noise added to the speech signal, but a discontinuous phenomenon, occurring on certain phonemes, based on the phonetic context. However, the burden of executing the task has led us to look to the techniques of automatic speaker recognition (ASR) to automate the procedure. With the collaboration of the LIA, we have developed a system for automatic classification of dysphonia from the techniques of ASR. This is the subject of our second study. The first results obtained with this system suggest that the unvoiced consonants show predominant performance in the task of automatic classification of dysphonia. This result is surprising since it is often assumed that dysphonia occurs only on laryngeal vibration. We started looking for explanations of this phenomenon and we present our assumptions and experiences in the third work we present.

  7. GRIN2A

    PubMed Central

    Turner, Samantha J.; Mayes, Angela K.; Verhoeven, Andrea; Mandelstam, Simone A.; Morgan, Angela T.

    2015-01-01

    Objective: To delineate the specific speech deficits in individuals with epilepsy-aphasia syndromes associated with mutations in the glutamate receptor subunit gene GRIN2A. Methods: We analyzed the speech phenotype associated with GRIN2A mutations in 11 individuals, aged 16 to 64 years, from 3 families. Standardized clinical speech assessments and perceptual analyses of conversational samples were conducted. Results: Individuals showed a characteristic phenotype of dysarthria and dyspraxia with lifelong impact on speech intelligibility in some. Speech was typified by imprecise articulation (11/11, 100%), impaired pitch (monopitch 10/11, 91%) and prosody (stress errors 7/11, 64%), and hypernasality (7/11, 64%). Oral motor impairments and poor performance on maximum vowel duration (8/11, 73%) and repetition of monosyllables (10/11, 91%) and trisyllables (7/11, 64%) supported conversational speech findings. The speech phenotype was present in one individual who did not have seizures. Conclusions: Distinctive features of dysarthria and dyspraxia are found in individuals with GRIN2A mutations, often in the setting of epilepsy-aphasia syndromes; dysarthria has not been previously recognized in these disorders. Of note, the speech phenotype may occur in the absence of a seizure disorder, reinforcing an important role for GRIN2A in motor speech function. Our findings highlight the need for precise clinical speech assessment and intervention in this group. By understanding the mechanisms involved in GRIN2A disorders, targeted therapy may be designed to improve chronic lifelong deficits in intelligibility. PMID:25596506

  8. Perception and performance in flight simulators: The contribution of vestibular, visual, and auditory information

    NASA Technical Reports Server (NTRS)

    1979-01-01

    The pilot's perception and performance in flight simulators is examined. The areas investigated include: vestibular stimulation, flight management and man cockpit information interfacing, and visual perception in flight simulation. The effects of higher levels of rotary acceleration on response time to constant acceleration, tracking performance, and thresholds for angular acceleration are examined. Areas of flight management examined are cockpit display of traffic information, work load, synthetic speech call outs during the landing phase of flight, perceptual factors in the use of a microwave landing system, automatic speech recognition, automation of aircraft operation, and total simulation of flight training.

  9. Subliminal speech priming.

    PubMed

    Kouider, Sid; Dupoux, Emmanuel

    2005-08-01

    We present a novel subliminal priming technique that operates in the auditory modality. Masking is achieved by hiding a spoken word within a stream of time-compressed speechlike sounds with similar spectral characteristics. Participants were unable to consciously identify the hidden words, yet reliable repetition priming was found. This effect was unaffected by a change in the speaker's voice and remained restricted to lexical processing. The results show that the speech modality, like the written modality, involves the automatic extraction of abstract word-form representations that do not include nonlinguistic details. In both cases, priming operates at the level of discrete and abstract lexical entries and is little influenced by overlap in form or semantics.

  10. Neural networks: Alternatives to conventional techniques for automatic docking

    NASA Technical Reports Server (NTRS)

    Vinz, Bradley L.

    1994-01-01

    Automatic docking of orbiting spacecraft is a crucial operation involving the identification of vehicle orientation as well as complex approach dynamics. The chaser spacecraft must be able to recognize the target spacecraft within a scene and achieve accurate closing maneuvers. In a video-based system, a target scene must be captured and transformed into a pattern of pixels. Successful recognition lies in the interpretation of this pattern. Due to their powerful pattern recognition capabilities, artificial neural networks offer a potential role in interpretation and automatic docking processes. Neural networks can reduce the computational time required by existing image processing and control software. In addition, neural networks are capable of recognizing and adapting to changes in their dynamic environment, enabling enhanced performance, redundancy, and fault tolerance. Most neural networks are robust to failure, capable of continued operation with a slight degradation in performance after minor failures. This paper discusses the particular automatic docking tasks neural networks can perform as viable alternatives to conventional techniques.

  11. The Roles of Suprasegmental Features in Predicting English Oral Proficiency with an Automated System

    ERIC Educational Resources Information Center

    Kang, Okim; Johnson, David

    2018-01-01

    Suprasegmental features have received growing attention in the field of oral assessment. In this article we describe a set of computer algorithms that automatically scores the oral proficiency of non-native speakers using unconstrained English speech. The algorithms employ machine learning and 11 suprasegmental measures divided into four groups…

  12. Spoken Grammar Practice and Feedback in an ASR-Based CALL System

    ERIC Educational Resources Information Center

    de Vries, Bart Penning; Cucchiarini, Catia; Bodnar, Stephen; Strik, Helmer; van Hout, Roeland

    2015-01-01

    Speaking practice is important for learners of a second language. Computer assisted language learning (CALL) systems can provide attractive opportunities for speaking practice when combined with automatic speech recognition (ASR) technology. In this paper, we present a CALL system that offers spoken practice of word order, an important aspect of…

  13. Effectiveness of Feedback for Enhancing English Pronunciation in an ASR-Based CALL System

    ERIC Educational Resources Information Center

    Wang, Y.-H.; Young, S. S.-C.

    2015-01-01

    This paper presents a study on implementing the ASR-based CALL (computer-assisted language learning based upon automatic speech recognition) system embedded with both formative and summative feedback approaches and using implicit and explicit strategies to enhance adult and young learners' English pronunciation. Two groups of learners including 18…

  14. The Influence of Anticipation of Word Misrecognition on the Likelihood of Stuttering

    ERIC Educational Resources Information Center

    Brocklehurst, Paul H.; Lickley, Robin J.; Corley, Martin

    2012-01-01

    This study investigates whether the experience of stuttering can result from the speaker's anticipation of his words being misrecognized. Twelve adults who stutter (AWS) repeated single words into what appeared to be an automatic speech-recognition system. Following each iteration of each word, participants provided a self-rating of whether they…

  15. Exploration of Acoustic Features for Automatic Vowel Discrimination in Spontaneous Speech

    ERIC Educational Resources Information Center

    Tyson, Na'im R.

    2012-01-01

    In an attempt to understand what acoustic/auditory feature sets motivated transcribers towards certain labeling decisions, I built machine learning models that were capable of discriminating between canonical and non-canonical vowels excised from the Buckeye Corpus. Specifically, I wanted to model when the dictionary form and the transcribed-form…

  16. Rapid and automatic speech-specific learning mechanism in human neocortex.

    PubMed

    Kimppa, Lilli; Kujala, Teija; Leminen, Alina; Vainio, Martti; Shtyrov, Yury

    2015-09-01

    A unique feature of human communication system is our ability to rapidly acquire new words and build large vocabularies. However, its neurobiological foundations remain largely unknown. In an electrophysiological study optimally designed to probe this rapid formation of new word memory circuits, we employed acoustically controlled novel word-forms incorporating native and non-native speech sounds, while manipulating the subjects' attention on the input. We found a robust index of neurolexical memory-trace formation: a rapid enhancement of the brain's activation elicited by novel words during a short (~30min) perceptual exposure, underpinned by fronto-temporal cortical networks, and, importantly, correlated with behavioural learning outcomes. Crucially, this neural memory trace build-up took place regardless of focused attention on the input or any pre-existing or learnt semantics. Furthermore, it was found only for stimuli with native-language phonology, but not for acoustically closely matching non-native words. These findings demonstrate a specialised cortical mechanism for rapid, automatic and phonology-dependent formation of neural word memory circuits. Copyright © 2015. Published by Elsevier Inc.

  17. The influence of tone inventory on ERP without focal attention: a cross-language study.

    PubMed

    Zheng, Hong-Ying; Peng, Gang; Chen, Jian-Yong; Zhang, Caicai; Minett, James W; Wang, William S-Y

    2014-01-01

    This study investigates the effect of tone inventories on brain activities underlying pitch without focal attention. We find that the electrophysiological responses to across-category stimuli are larger than those to within-category stimuli when the pitch contours are superimposed on nonspeech stimuli; however, there is no electrophysiological response difference associated with category status in speech stimuli. Moreover, this category effect in nonspeech stimuli is stronger for Cantonese speakers. Results of previous and present studies lead us to conclude that brain activities to the same native lexical tone contrasts are modulated by speakers' language experiences not only in active phonological processing but also in automatic feature detection without focal attention. In contrast to the condition with focal attention, where phonological processing is stronger for speech stimuli, the feature detection (pitch contours in this study) without focal attention as shaped by language background is superior in relatively regular stimuli, that is, the nonspeech stimuli. The results suggest that Cantonese listeners outperform Mandarin listeners in automatic detection of pitch features because of the denser Cantonese tone system.

  18. Second-language learning effects on automaticity of speech processing of Japanese phonetic contrasts: An MEG study.

    PubMed

    Hisagi, Miwako; Shafer, Valerie L; Miyagawa, Shigeru; Kotek, Hadas; Sugawara, Ayaka; Pantazis, Dimitrios

    2016-12-01

    We examined discrimination of a second-language (L2) vowel duration contrast in English learners of Japanese (JP) with different amounts of experience using the magnetoencephalography mismatch field (MMF) component. Twelve L2 learners were tested before and after a second semester of college-level JP; half attended a regular rate course and half an accelerated course with more hours per week. Results showed no significant change in MMF for either the regular or accelerated learning group from beginning to end of the course. We also compared these groups against nine L2 learners who had completed four semesters of college-level JP. These 4-semester learners did not significantly differ from 2-semester learners, in that only a difference in hemisphere activation (interacting with time) between the two groups approached significance. These findings suggest that targeted training of L2 phonology may be necessary to allow for changes in processing of L2 speech contrasts at an early, automatic level. Copyright © 2016 Elsevier B.V. All rights reserved.

  19. A Spondee Recognition Test for Young Hearing-Impaired Children

    ERIC Educational Resources Information Center

    Cramer, Kathryn D.; Erber, Norman P.

    1974-01-01

    An auditory test of 10 spondaic words recorded on Language Master cards was presented monaurally, through insert receivers to 58 hearing-impaired young children to evaluate their ability to recognize familiar speech material. (MYS)

  20. Intra-oral pressure-based voicing control of electrolaryngeal speech with intra-oral vibrator.

    PubMed

    Takahashi, Hirokazu; Nakao, Masayuki; Kikuchi, Yataro; Kaga, Kimitaka

    2008-07-01

    In normal speech, coordinated activities of intrinsic laryngeal muscles suspend a glottal sound at utterance of voiceless consonants, automatically realizing a voicing control. In electrolaryngeal speech, however, the lack of voicing control is one of the causes of unclear voice, voiceless consonants tending to be misheard as the corresponding voiced consonants. In the present work, we developed an intra-oral vibrator with an intra-oral pressure sensor that detected utterance of voiceless phonemes during the intra-oral electrolaryngeal speech, and demonstrated that an intra-oral pressure-based voicing control could improve the intelligibility of the speech. The test voices were obtained from one electrolaryngeal speaker and one normal speaker. We first investigated on the speech analysis software how a voice onset time (VOT) and first formant (F1) transition of the test consonant-vowel syllables contributed to voiceless/voiced contrasts, and developed an adequate voicing control strategy. We then compared the intelligibility of consonant-vowel syllables among the intra-oral electrolaryngeal speech with and without online voicing control. The increase of intra-oral pressure, typically with a peak ranging from 10 to 50 gf/cm2, could reliably identify utterance of voiceless consonants. The speech analysis and intelligibility test then demonstrated that a short VOT caused the misidentification of the voiced consonants due to a clear F1 transition. Finally, taking these results together, the online voicing control, which suspended the prosthetic tone while the intra-oral pressure exceeded 2.5 gf/cm2 and during the 35 milliseconds that followed, proved efficient to improve the voiceless/voiced contrast.

  1. Implementation of the Intelligent Voice System for Kazakh

    NASA Astrophysics Data System (ADS)

    Yessenbayev, Zh; Saparkhojayev, N.; Tibeyev, T.

    2014-04-01

    Modern speech technologies are highly advanced and widely used in day-to-day applications. However, this is mostly concerned with the languages of well-developed countries such as English, German, Japan, Russian, etc. As for Kazakh, the situation is less prominent and research in this field is only starting to evolve. In this research and application-oriented project, we introduce an intelligent voice system for the fast deployment of call-centers and information desks supporting Kazakh speech. The demand on such a system is obvious if the country's large size and small population is considered. The landline and cell phones become the only means of communication for the distant villages and suburbs. The system features Kazakh speech recognition and synthesis modules as well as a web-GUI for efficient dialog management. For speech recognition we use CMU Sphinx engine and for speech synthesis- MaryTTS. The web-GUI is implemented in Java enabling operators to quickly create and manage the dialogs in user-friendly graphical environment. The call routines are handled by Asterisk PBX and JBoss Application Server. The system supports such technologies and protocols as VoIP, VoiceXML, FastAGI, Java SpeechAPI and J2EE. For the speech recognition experiments we compiled and used the first Kazakh speech corpus with the utterances from 169 native speakers. The performance of the speech recognizer is 4.1% WER on isolated word recognition and 6.9% WER on clean continuous speech recognition tasks. The speech synthesis experiments include the training of male and female voices.

  2. Sensory Intelligence for Extraction of an Abstract Auditory Rule: A Cross-Linguistic Study.

    PubMed

    Guo, Xiao-Tao; Wang, Xiao-Dong; Liang, Xiu-Yuan; Wang, Ming; Chen, Lin

    2018-02-21

    In a complex linguistic environment, while speech sounds can greatly vary, some shared features are often invariant. These invariant features constitute so-called abstract auditory rules. Our previous study has shown that with auditory sensory intelligence, the human brain can automatically extract the abstract auditory rules in the speech sound stream, presumably serving as the neural basis for speech comprehension. However, whether the sensory intelligence for extraction of abstract auditory rules in speech is inherent or experience-dependent remains unclear. To address this issue, we constructed a complex speech sound stream using auditory materials in Mandarin Chinese, in which syllables had a flat lexical tone but differed in other acoustic features to form an abstract auditory rule. This rule was occasionally and randomly violated by the syllables with the rising, dipping or falling tone. We found that both Chinese and foreign speakers detected the violations of the abstract auditory rule in the speech sound stream at a pre-attentive stage, as revealed by the whole-head recordings of mismatch negativity (MMN) in a passive paradigm. However, MMNs peaked earlier in Chinese speakers than in foreign speakers. Furthermore, Chinese speakers showed different MMN peak latencies for the three deviant types, which paralleled recognition points. These findings indicate that the sensory intelligence for extraction of abstract auditory rules in speech sounds is innate but shaped by language experience. Copyright © 2018 IBRO. Published by Elsevier Ltd. All rights reserved.

  3. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    NASA Astrophysics Data System (ADS)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  4. Dissecting auditory verbal hallucinations into two components: audibility (Gedankenlautwerden) and alienation (thought insertion).

    PubMed

    Sommer, Iris E; Selten, Jean-Paul; Diederen, Kelly M; Blom, Jan Dirk

    2010-01-01

    This study proposes a theoretical framework which dissects auditory verbal hallucinations (AVH) into 2 essential components: audibility and alienation. Audibility, the perceptual aspect of AVH, may result from a disinhibition of the auditory cortex in response to self-generated speech. In isolation, this aspect leads to audible thoughts: Gedankenlautwerden. The second component is alienation, which is the failure to recognize the content of AVH as self-generated. This failure may be related to the fact that cerebral activity associated with AVH is predominantly present in the speech production area of the right hemisphere. Since normal inner speech is derived from the left speech area, an aberrant source may lead to confusion about the origin of the language fragments. When alienation is not accompanied by audibility, it will result in the experience of thought insertion. The 2 hypothesized components are illustrated using case vignettes. Copyright 2010 S. Karger AG, Basel.

  5. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    NASA Astrophysics Data System (ADS)

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  6. An Innovative Speech-Based User Interface for Smarthomes and IoT Solutions to Help People with Speech and Motor Disabilities.

    PubMed

    Malavasi, Massimiliano; Turri, Enrico; Atria, Jose Joaquin; Christensen, Heidi; Marxer, Ricard; Desideri, Lorenzo; Coy, Andre; Tamburini, Fabio; Green, Phil

    2017-01-01

    A better use of the increasing functional capabilities of home automation systems and Internet of Things (IoT) devices to support the needs of users with disability, is the subject of a research project currently conducted by Area Ausili (Assistive Technology Area), a department of Polo Tecnologico Regionale Corte Roncati of the Local Health Trust of Bologna (Italy), in collaboration with AIAS Ausilioteca Assistive Technology (AT) Team. The main aim of the project is to develop experimental low cost systems for environmental control through simplified and accessible user interfaces. Many of the activities are focused on automatic speech recognition and are developed in the framework of the CloudCAST project. In this paper we report on the first technical achievements of the project and discuss future possible developments and applications within and outside CloudCAST.

  7. Automatic mechanisms for measuring subjective unit of discomfort.

    PubMed

    Hartanto, D W I; Kang, Ni; Brinkman, Willem-Paul; Kampmann, Isabel L; Morina, Nexhmedin; Emmelkamp, Paul G M; Neerincx, Mark A

    2012-01-01

    Current practice in Virtual Reality Exposure Therapy (VRET) is that therapists ask patients about their anxiety level by means of the Subjective Unit of Discomfort (SUD) scale. With an aim of developing a home-based VRET system, this measurement ideally should be done using speech technology. In a VRET system for social phobia with scripted avatar-patient dialogues, the timing of asking patients to give their SUD score becomes relevant. This study examined three timing mechanisms: (1) dialogue dependent (i.e. naturally in the flow of the dialogue); (2) speech dependent (i.e. when both patient and avatar are silent); and (3) context independent (i.e. randomly). Results of an experiment with non-patients (n = 24) showed a significant effect for the timing mechanisms on the perceived dialogue flow, user preference, reported presence and user dialog replies. Overall, dialogue dependent timing mechanism seems superior followed by the speech dependent and context independent timing mechanism.

  8. Rapid tuning shifts in human auditory cortex enhance speech intelligibility

    PubMed Central

    Holdgraf, Christopher R.; de Heer, Wendy; Pasley, Brian; Rieger, Jochem; Crone, Nathan; Lin, Jack J.; Knight, Robert T.; Theunissen, Frédéric E.

    2016-01-01

    Experience shapes our perception of the world on a moment-to-moment basis. This robust perceptual effect of experience parallels a change in the neural representation of stimulus features, though the nature of this representation and its plasticity are not well-understood. Spectrotemporal receptive field (STRF) mapping describes the neural response to acoustic features, and has been used to study contextual effects on auditory receptive fields in animal models. We performed a STRF plasticity analysis on electrophysiological data from recordings obtained directly from the human auditory cortex. Here, we report rapid, automatic plasticity of the spectrotemporal response of recorded neural ensembles, driven by previous experience with acoustic and linguistic information, and with a neurophysiological effect in the sub-second range. This plasticity reflects increased sensitivity to spectrotemporal features, enhancing the extraction of more speech-like features from a degraded stimulus and providing the physiological basis for the observed ‘perceptual enhancement' in understanding speech. PMID:27996965

  9. Emotion Recognition of Weblog Sentences Based on an Ensemble Algorithm of Multi-label Classification and Word Emotions

    NASA Astrophysics Data System (ADS)

    Li, Ji; Ren, Fuji

    Weblogs have greatly changed the communication ways of mankind. Affective analysis of blog posts is found valuable for many applications such as text-to-speech synthesis or computer-assisted recommendation. Traditional emotion recognition in text based on single-label classification can not satisfy higher requirements of affective computing. In this paper, the automatic identification of sentence emotion in weblogs is modeled as a multi-label text categorization task. Experiments are carried out on 12273 blog sentences from the Chinese emotion corpus Ren_CECps with 8-dimension emotion annotation. An ensemble algorithm RAKEL is used to recognize dominant emotions from the writer's perspective. Our emotion feature using detailed intensity representation for word emotions outperforms the other main features such as the word frequency feature and the traditional lexicon-based feature. In order to deal with relatively complex sentences, we integrate grammatical characteristics of punctuations, disjunctive connectives, modification relations and negation into features. It achieves 13.51% and 12.49% increases for Micro-averaged F1 and Macro-averaged F1 respectively compared to the traditional lexicon-based feature. Result shows that multiple-dimension emotion representation with grammatical features can efficiently classify sentence emotion in a multi-label problem.

  10. Evaluation of an Expert System for the Generation of Speech and Language Therapy Plans.

    PubMed

    Robles-Bykbaev, Vladimir; López-Nores, Martín; García-Duque, Jorge; Pazos-Arias, José J; Arévalo-Lucero, Daysi

    2016-07-01

    Speech and language pathologists (SLPs) deal with a wide spectrum of disorders, arising from many different conditions, that affect voice, speech, language, and swallowing capabilities in different ways. Therefore, the outcomes of Speech and Language Therapy (SLT) are highly dependent on the accurate, consistent, and complete design of personalized therapy plans. However, SLPs often have very limited time to work with their patients and to browse the large (and growing) catalogue of activities and specific exercises that can be put into therapy plans. As a consequence, many plans are suboptimal and fail to address the specific needs of each patient. We aimed to evaluate an expert system that automatically generates plans for speech and language therapy, containing semiannual activities in the five areas of hearing, oral structure and function, linguistic formulation, expressive language and articulation, and receptive language. The goal was to assess whether the expert system speeds up the SLPs' work and leads to more accurate, consistent, and complete therapy plans for their patients. We examined the evaluation results of the SPELTA expert system in supporting the decision making of 4 SLPs treating children in three special education institutions in Ecuador. The expert system was first trained with data from 117 cases, including medical data; diagnosis for voice, speech, language and swallowing capabilities; and therapy plans created manually by the SLPs. It was then used to automatically generate new therapy plans for 13 new patients. The SLPs were finally asked to evaluate the accuracy, consistency, and completeness of those plans. A four-fold cross-validation experiment was also run on the original corpus of 117 cases in order to assess the significance of the results. The evaluation showed that 87% of the outputs provided by the SPELTA expert system were considered valid therapy plans for the different areas. The SLPs rated the overall accuracy, consistency, and completeness of the proposed activities with 4.65, 4.6, and 4.6 points (to a maximum of 5), respectively. The ratings for the subplans generated for the areas of hearing, oral structure and function, and linguistic formulation were nearly perfect, whereas the subplans for expressive language and articulation and for receptive language failed to deal properly with some of the subject cases. Overall, the SLPs indicated that over 90% of the subplans generated automatically were "better than" or "as good as" what the SLPs would have created manually if given the average time they can devote to the task. The cross-validation experiment yielded very similar results. The results show that the SPELTA expert system provides valuable input for SLPs to design proper therapy plans for their patients, in a shorter time and considering a larger set of activities than proceeding manually. The algorithms worked well even in the presence of a sparse corpus, and the evidence suggests that the system will become more reliable as it is trained with more subjects.

  11. Evaluation of an Expert System for the Generation of Speech and Language Therapy Plans

    PubMed Central

    López-Nores, Martín; García-Duque, Jorge; Pazos-Arias, José J; Arévalo-Lucero, Daysi

    2016-01-01

    Background Speech and language pathologists (SLPs) deal with a wide spectrum of disorders, arising from many different conditions, that affect voice, speech, language, and swallowing capabilities in different ways. Therefore, the outcomes of Speech and Language Therapy (SLT) are highly dependent on the accurate, consistent, and complete design of personalized therapy plans. However, SLPs often have very limited time to work with their patients and to browse the large (and growing) catalogue of activities and specific exercises that can be put into therapy plans. As a consequence, many plans are suboptimal and fail to address the specific needs of each patient. Objective We aimed to evaluate an expert system that automatically generates plans for speech and language therapy, containing semiannual activities in the five areas of hearing, oral structure and function, linguistic formulation, expressive language and articulation, and receptive language. The goal was to assess whether the expert system speeds up the SLPs’ work and leads to more accurate, consistent, and complete therapy plans for their patients. Methods We examined the evaluation results of the SPELTA expert system in supporting the decision making of 4 SLPs treating children in three special education institutions in Ecuador. The expert system was first trained with data from 117 cases, including medical data; diagnosis for voice, speech, language and swallowing capabilities; and therapy plans created manually by the SLPs. It was then used to automatically generate new therapy plans for 13 new patients. The SLPs were finally asked to evaluate the accuracy, consistency, and completeness of those plans. A four-fold cross-validation experiment was also run on the original corpus of 117 cases in order to assess the significance of the results. Results The evaluation showed that 87% of the outputs provided by the SPELTA expert system were considered valid therapy plans for the different areas. The SLPs rated the overall accuracy, consistency, and completeness of the proposed activities with 4.65, 4.6, and 4.6 points (to a maximum of 5), respectively. The ratings for the subplans generated for the areas of hearing, oral structure and function, and linguistic formulation were nearly perfect, whereas the subplans for expressive language and articulation and for receptive language failed to deal properly with some of the subject cases. Overall, the SLPs indicated that over 90% of the subplans generated automatically were “better than” or “as good as” what the SLPs would have created manually if given the average time they can devote to the task. The cross-validation experiment yielded very similar results. Conclusions The results show that the SPELTA expert system provides valuable input for SLPs to design proper therapy plans for their patients, in a shorter time and considering a larger set of activities than proceeding manually. The algorithms worked well even in the presence of a sparse corpus, and the evidence suggests that the system will become more reliable as it is trained with more subjects. PMID:27370070

  12. Flexible, rapid and automatic neocortical word form acquisition mechanism in children as revealed by neuromagnetic brain response dynamics.

    PubMed

    Partanen, Eino; Leminen, Alina; de Paoli, Stine; Bundgaard, Anette; Kingo, Osman Skjold; Krøjgaard, Peter; Shtyrov, Yury

    2017-07-15

    Children learn new words and word forms with ease, often acquiring a new word after very few repetitions. Recent neurophysiological research on word form acquisition in adults indicates that novel words can be acquired within minutes of repetitive exposure to them, regardless of the individual's focused attention on the speech input. Although it is well-known that children surpass adults in language acquisition, the developmental aspects of such rapid and automatic neural acquisition mechanisms remain unexplored. To address this open question, we used magnetoencephalography (MEG) to scrutinise brain dynamics elicited by spoken words and word-like sounds in healthy monolingual (Danish) children throughout a 20-min repetitive passive exposure session. We found rapid neural dynamics manifested as an enhancement of early (~100ms) brain activity over the short exposure session, with distinct spatiotemporal patterns for different novel sounds. For novel Danish word forms, signs of such enhancement were seen in the left temporal regions only, suggesting reliance on pre-existing language circuits for acquisition of novel word forms with native phonology. In contrast, exposure both to novel word forms with non-native phonology and to novel non-speech sounds led to activity enhancement in both left and right hemispheres, suggesting that more wide-spread cortical networks contribute to the build-up of memory traces for non-native and non-speech sounds. Similar studies in adults have previously reported more sluggish (~15-25min, as opposed to 4min in the present study) or non-existent neural dynamics for non-native sound acquisition, which might be indicative of a higher degree of plasticity in the children's brain. Overall, the results indicate a rapid and highly plastic mechanism for a dynamic build-up of memory traces for novel acoustic information in the children's brain that operates automatically and recruits bilateral temporal cortical circuits. Copyright © 2017 Elsevier Inc. All rights reserved.

  13. Automatic conversational scene analysis in children with Asperger syndrome/high-functioning autism and typically developing peers.

    PubMed

    Tavano, Alessandro; Pesarin, Anna; Murino, Vittorio; Cristani, Marco

    2014-01-01

    Individuals with Asperger syndrome/High Functioning Autism fail to spontaneously attribute mental states to the self and others, a life-long phenotypic characteristic known as mindblindness. We hypothesized that mindblindness would affect the dynamics of conversational interaction. Using generative models, in particular Gaussian mixture models and observed influence models, conversations were coded as interacting Markov processes, operating on novel speech/silence patterns, termed Steady Conversational Periods (SCPs). SCPs assume that whenever an agent's process changes state (e.g., from silence to speech), it causes a general transition of the entire conversational process, forcing inter-actant synchronization. SCPs fed into observed influence models, which captured the conversational dynamics of children and adolescents with Asperger syndrome/High Functioning Autism, and age-matched typically developing participants. Analyzing the parameters of the models by means of discriminative classifiers, the dialogs of patients were successfully distinguished from those of control participants. We conclude that meaning-free speech/silence sequences, reflecting inter-actant synchronization, at least partially encode typical and atypical conversational dynamics. This suggests a direct influence of theory of mind abilities onto basic speech initiative behavior.

  14. Sensory-motor relationships in speech production in post-lingually deaf cochlear-implanted adults and normal-hearing seniors: Evidence from phonetic convergence and speech imitation.

    PubMed

    Scarbel, Lucie; Beautemps, Denis; Schwartz, Jean-Luc; Sato, Marc

    2017-07-01

    Speech communication can be viewed as an interactive process involving a functional coupling between sensory and motor systems. One striking example comes from phonetic convergence, when speakers automatically tend to mimic their interlocutor's speech during communicative interaction. The goal of this study was to investigate sensory-motor linkage in speech production in postlingually deaf cochlear implanted participants and normal hearing elderly adults through phonetic convergence and imitation. To this aim, two vowel production tasks, with or without instruction to imitate an acoustic vowel, were proposed to three groups of young adults with normal hearing, elderly adults with normal hearing and post-lingually deaf cochlear-implanted patients. Measure of the deviation of each participant's f 0 from their own mean f 0 was measured to evaluate the ability to converge to each acoustic target. showed that cochlear-implanted participants have the ability to converge to an acoustic target, both intentionally and unintentionally, albeit with a lower degree than young and elderly participants with normal hearing. By providing evidence for phonetic convergence and speech imitation, these results suggest that, as in young adults, perceptuo-motor relationships are efficient in elderly adults with normal hearing and that cochlear-implanted adults recovered significant perceptuo-motor abilities following cochlear implantation. Copyright © 2017 Elsevier Ltd. All rights reserved.

  15. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation

    PubMed Central

    Banks, Briony; Gowen, Emma; Munro, Kevin J.; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker’s facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants’ eye gaze was recorded to verify that they looked at the speaker’s face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation. PMID:26283946

  16. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

    PubMed

    Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.

  17. Long Term Suboxone™ Emotional Reactivity As Measured by Automatic Detection in Speech

    PubMed Central

    Hill, Edward; Han, David; Dumouchel, Pierre; Dehak, Najim; Quatieri, Thomas; Moehs, Charles; Oscar-Berman, Marlene; Giordano, John; Simpatico, Thomas; Blum, Kenneth

    2013-01-01

    Addictions to illicit drugs are among the nation’s most critical public health and societal problems. The current opioid prescription epidemic and the need for buprenorphine/naloxone (Suboxone®; SUBX) as an opioid maintenance substance, and its growing street diversion provided impetus to determine affective states (“true ground emotionality”) in long-term SUBX patients. Toward the goal of effective monitoring, we utilized emotion-detection in speech as a measure of “true” emotionality in 36 SUBX patients compared to 44 individuals from the general population (GP) and 33 members of Alcoholics Anonymous (AA). Other less objective studies have investigated emotional reactivity of heroin, methadone and opioid abstinent patients. These studies indicate that current opioid users have abnormal emotional experience, characterized by heightened response to unpleasant stimuli and blunted response to pleasant stimuli. However, this is the first study to our knowledge to evaluate “true ground” emotionality in long-term buprenorphine/naloxone combination (Suboxone™). We found in long-term SUBX patients a significantly flat affect (p<0.01), and they had less self-awareness of being happy, sad, and anxious compared to both the GP and AA groups. We caution definitive interpretation of these seemingly important results until we compare the emotional reactivity of an opioid abstinent control using automatic detection in speech. These findings encourage continued research strategies in SUBX patients to target the specific brain regions responsible for relapse prevention of opioid addiction. PMID:23874860

  18. Effects of noise on speech recognition: Challenges for communication by service members.

    PubMed

    Le Prell, Colleen G; Clavier, Odile H

    2017-06-01

    Speech communication often takes place in noisy environments; this is an urgent issue for military personnel who must communicate in high-noise environments. The effects of noise on speech recognition vary significantly according to the sources of noise, the number and types of talkers, and the listener's hearing ability. In this review, speech communication is first described as it relates to current standards of hearing assessment for military and civilian populations. The next section categorizes types of noise (also called maskers) according to their temporal characteristics (steady or fluctuating) and perceptive effects (energetic or informational masking). Next, speech recognition difficulties experienced by listeners with hearing loss and by older listeners are summarized, and questions on the possible causes of speech-in-noise difficulty are discussed, including recent suggestions of "hidden hearing loss". The final section describes tests used by military and civilian researchers, audiologists, and hearing technicians to assess performance of an individual in recognizing speech in background noise, as well as metrics that predict performance based on a listener and background noise profile. This article provides readers with an overview of the challenges associated with speech communication in noisy backgrounds, as well as its assessment and potential impact on functional performance, and provides guidance for important new research directions relevant not only to military personnel, but also to employees who work in high noise environments. Copyright © 2016 Elsevier B.V. All rights reserved.

  19. Improved segregation of simultaneous talkers differentially affects perceptual and cognitive capacity demands for recognizing speech in competing speech.

    PubMed

    Francis, Alexander L

    2010-02-01

    Perception of speech in competing speech is facilitated by spatial separation of the target and distracting speech, but this benefit may arise at either a perceptual or a cognitive level of processing. Load theory predicts different effects of perceptual and cognitive (working memory) load on selective attention in flanker task contexts, suggesting that this paradigm may be used to distinguish levels of interference. Two experiments examined interference from competing speech during a word recognition task under different perceptual and working memory loads in a dual-task paradigm. Listeners identified words produced by a talker of one gender while ignoring a talker of the other gender. Perceptual load was manipulated using a nonspeech response cue, with response conditional upon either one or two acoustic features (pitch and modulation). Memory load was manipulated with a secondary task consisting of one or six visually presented digits. In the first experiment, the target and distractor were presented at different virtual locations (0 degrees and 90 degrees , respectively), whereas in the second, all the stimuli were presented from the same apparent location. Results suggest that spatial cues improve resistance to distraction in part by reducing working memory demand.

  20. Emerging technologies with potential for objectively evaluating speech recognition skills.

    PubMed

    Rawool, Vishakha Waman

    2016-01-01

    Work-related exposure to noise and other ototoxins can cause damage to the cochlea, synapses between the inner hair cells, the auditory nerve fibers, and higher auditory pathways, leading to difficulties in recognizing speech. Procedures designed to determine speech recognition scores (SRS) in an objective manner can be helpful in disability compensation cases where the worker claims to have poor speech perception due to exposure to noise or ototoxins. Such measures can also be helpful in determining SRS in individuals who cannot provide reliable responses to speech stimuli, including patients with Alzheimer's disease, traumatic brain injuries, and infants with and without hearing loss. Cost-effective neural monitoring hardware and software is being rapidly refined due to the high demand for neurogaming (games involving the use of brain-computer interfaces), health, and other applications. More specifically, two related advances in neuro-technology include relative ease in recording neural activity and availability of sophisticated analysing techniques. These techniques are reviewed in the current article and their applications for developing objective SRS procedures are proposed. Issues related to neuroaudioethics (ethics related to collection of neural data evoked by auditory stimuli including speech) and neurosecurity (preservation of a person's neural mechanisms and free will) are also discussed.

  1. The ability of cochlear implant users to use temporal envelope cues recovered from speech frequency modulation.

    PubMed

    Won, Jong Ho; Lorenzi, Christian; Nie, Kaibao; Li, Xing; Jameyson, Elyse M; Drennan, Ward R; Rubinstein, Jay T

    2012-08-01

    Previous studies have demonstrated that normal-hearing listeners can understand speech using the recovered "temporal envelopes," i.e., amplitude modulation (AM) cues from frequency modulation (FM). This study evaluated this mechanism in cochlear implant (CI) users for consonant identification. Stimuli containing only FM cues were created using 1, 2, 4, and 8-band FM-vocoders to determine if consonant identification performance would improve as the recovered AM cues become more available. A consistent improvement was observed as the band number decreased from 8 to 1, supporting the hypothesis that (1) the CI sound processor generates recovered AM cues from broadband FM, and (2) CI users can use the recovered AM cues to recognize speech. The correlation between the intact and the recovered AM components at the output of the sound processor was also generally higher when the band number was low, supporting the consonant identification results. Moreover, CI subjects who were better at using recovered AM cues from broadband FM cues showed better identification performance with intact (unprocessed) speech stimuli. This suggests that speech perception performance variability in CI users may be partly caused by differences in their ability to use AM cues recovered from FM speech cues.

  2. Limb versus speech motor control: a conceptual review.

    PubMed

    Grimme, Britta; Fuchs, Susanne; Perrier, Pascal; Schöner, Gregor

    2011-01-01

    This paper presents a comparative conceptual review of speech and limb motor control. Speech is essentially cognitive in nature and constrained by the rules of language, while limb movement is often oriented to physical objects. We discuss the issue of intrinsic vs. extrinsic variables underlying the representations of motor goals as well as whether motor goals specify terminal postures or entire trajectories. Timing and coordination is recognized as an area of strong interchange between the two domains. Although coordination among different motor acts within a sequence and coarticulation are central to speech motor control, they have received only limited attention in manipulatory movements. The biomechanics of speech production is characterized by the presence of soft tissue, a variable number of degrees of freedom, and the challenges of high rates of production, while limb movements deal more typically with inertial constraints from manipulated objects. This comparative review thus leads us to identify many strands of thinking that are shared across the two domains, but also points us to issues on which approaches in the two domains differ. We conclude that conceptual interchange between the fields of limb and speech motor control has been useful in the past and promises continued benefit.

  3. Mandarin-Speaking Children's Speech Recognition: Developmental Changes in the Influences of Semantic Context and F0 Contours.

    PubMed

    Zhou, Hong; Li, Yu; Liang, Meng; Guan, Connie Qun; Zhang, Linjun; Shu, Hua; Zhang, Yang

    2017-01-01

    The goal of this developmental speech perception study was to assess whether and how age group modulated the influences of high-level semantic context and low-level fundamental frequency ( F 0 ) contours on the recognition of Mandarin speech by elementary and middle-school-aged children in quiet and interference backgrounds. The results revealed different patterns for semantic and F 0 information. One the one hand, age group modulated significantly the use of F 0 contours, indicating that elementary school children relied more on natural F 0 contours than middle school children during Mandarin speech recognition. On the other hand, there was no significant modulation effect of age group on semantic context, indicating that children of both age groups used semantic context to assist speech recognition to a similar extent. Furthermore, the significant modulation effect of age group on the interaction between F 0 contours and semantic context revealed that younger children could not make better use of semantic context in recognizing speech with flat F 0 contours compared with natural F 0 contours, while older children could benefit from semantic context even when natural F 0 contours were altered, thus confirming the important role of F 0 contours in Mandarin speech recognition by elementary school children. The developmental changes in the effects of high-level semantic and low-level F 0 information on speech recognition might reflect the differences in auditory and cognitive resources associated with processing of the two types of information in speech perception.

  4. Community Health Workers perceptions in relation to speech and language disorders.

    PubMed

    Knochenhauer, Carla Cristina Lins Santos; Vianna, Karina Mary de Paiva

    2016-01-01

    To know the perception of the Community Health Workers (CHW) about the speech and language disorders. Cross-sectional study, which involved a questionnaire with questions related to the knowledge of CHW on speech and language disorders. The research was carried out with CHW allocated in the Centro Sanitary District of Florianópolis. We interviewed 35 CHW, being mostly (80%) female gender, with a average age of 47 years (standard deviation = 2.09 years). From the total number of interviewed professionals, 57% said that they knew the work of the speech therapist, 57% believe that there is no relationship between chronic diseases and speech therapy and 97% who think the participation of Speech, Hearing and Language Sciences is important in primary care. As for capacity development, 88% of CHW claim not to have had any training performed by a speech therapist, 75% of professionals stated they had done the training Estratégia Amamenta e Alimenta Brasil, 57% of the Programa Capital Criança and 41% of the Programa Capital Idoso. The knowledge of CHW about the work of a speech therapist is still limited, but the importance of speech and language disorders is recognized in primary care. The lack of knowledge, with regard to speech and language disorders, may be related to lack of qualification of the CHW in actions and/or continuing education courses that could clarify and educate these professionals to identify and better educate the population in their home visits. This study highlights the need for further research on training actions of these professionals.

  5. Cognitive Conflict in a Syllable Identification Task Causes Transient Activation of Speech Perception Area

    ERIC Educational Resources Information Center

    Saetrevik, Bjorn; Specht, Karsten

    2012-01-01

    It has previously been shown that task performance and frontal cortical activation increase after cognitive conflict. This has been argued to support a model of attention where the level of conflict automatically adjusts the amount of cognitive control applied. Conceivably, conflict could also modulate lower-level processing pathways, which would…

  6. Developing Swedish Spelling Exercises on the ICALL Platform Lärka

    ERIC Educational Resources Information Center

    Pijetlovic, Dijana; Volodina, Elena

    2013-01-01

    In this project we developed web services on the ICALL platform Lärka for automatic generation of Swedish spelling exercises using Text-To-Speech (TTS) technology which allows L2 learners to train their spelling and listening individually at home. The spelling exercises contain five different linguistic levels, whereby the language learner has the…

  7. The Game Embedded CALL System to Facilitate English Vocabulary Acquisition and Pronunciation

    ERIC Educational Resources Information Center

    Young, Shelley Shwu-Ching; Wang, Yi-Hsuan

    2014-01-01

    The aim of this study is to make a new attempt to explore the potential of integrating game strategies with automatic speech recognition technologies to provide learners with individual opportunities for English pronunciation learning. The study developed the Game Embedded CALL (GeCALL) system with two activities for on-line speaking practice. For…

  8. Frequency of Use Leads to Automaticity of Production: Evidence from Repair in Conversation

    ERIC Educational Resources Information Center

    Kapatsinski, Vsevolod

    2010-01-01

    In spontaneous speech, speakers sometimes replace a word they have just produced or started producing by another word. The present study reports that in these replacement repairs, low-frequency replaced words are more likely to be interrupted prior to completion than high-frequency words, providing support to the hypothesis that the production of…

  9. Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition

    ERIC Educational Resources Information Center

    Heintz, Ilana

    2010-01-01

    The goal of this dissertation is to introduce a method for deriving morphemes from Arabic words using stem patterns, a feature of Arabic morphology. The motivations are three-fold: modeling with morphemes rather than words should help address the out-of-vocabulary problem; working with stem patterns should prove to be a cross-dialectally valid…

  10. The Processing and Interpretation of Verb Phrase Ellipsis Constructions by Children at Normal and Slowed Speech Rates

    ERIC Educational Resources Information Center

    Callahan, Sarah M.; Walenski, Matthew; Love, Tracy

    2012-01-01

    Purpose: To examine children's comprehension of verb phrase (VP) ellipsis constructions in light of their automatic, online structural processing abilities and conscious, metalinguistic reflective skill. Method: Forty-two children ages 5 through 12 years listened to VP ellipsis constructions involving the strict/sloppy ambiguity (e.g., "The…

  11. Listening Difficulty Detection to Foster Second Language Listening with a Partial and Synchronized Caption System

    ERIC Educational Resources Information Center

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2017-01-01

    This study proposes a method to detect problematic speech segments automatically for second language (L2) listeners, considering both lexical and acoustic aspects. It introduces a tool, Partial and Synchronized Caption (PSC), which provides assistance for language learners and fosters L2 listening skills. PSC presents purposively selected words…

  12. Histogram equalization with Bayesian estimation for noise robust speech recognition.

    PubMed

    Suh, Youngjoo; Kim, Hoirin

    2018-02-01

    The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histogram equalization approach has been proposed. Although this approach showed substantial performance improvement under noise environments, it still suffers from performance degradation due to the overfitting problem when test data are insufficient. To address this issue, the proposed histogram equalization technique employs the Bayesian estimation method in the test cumulative distribution function estimation. It was reported in a previous study conducted on the Aurora-4 task that the proposed approach provided substantial performance gains in speech recognition systems based on the acoustic modeling of the Gaussian mixture model-hidden Markov model. In this work, the proposed approach was examined in speech recognition systems with deep neural network-hidden Markov model (DNN-HMM), the current mainstream speech recognition approach where it also showed meaningful performance improvement over the conventional maximum likelihood estimation-based method. The fusion of the proposed features with the mel-frequency cepstral coefficients provided additional performance gains in DNN-HMM systems, which otherwise suffer from performance degradation in the clean test condition.

  13. Comparison of Classification Methods for Detecting Emotion from Mandarin Speech

    NASA Astrophysics Data System (ADS)

    Pao, Tsang-Long; Chen, Yu-Te; Yeh, Jun-Heng

    It is said that technology comes out from humanity. What is humanity? The very definition of humanity is emotion. Emotion is the basis for all human expression and the underlying theme behind everything that is done, said, thought or imagined. Making computers being able to perceive and respond to human emotion, the human-computer interaction will be more natural. Several classifiers are adopted for automatically assigning an emotion category, such as anger, happiness or sadness, to a speech utterance. These classifiers were designed independently and tested on various emotional speech corpora, making it difficult to compare and evaluate their performance. In this paper, we first compared several popular classification methods and evaluated their performance by applying them to a Mandarin speech corpus consisting of five basic emotions, including anger, happiness, boredom, sadness and neutral. The extracted feature streams contain MFCC, LPCC, and LPC. The experimental results show that the proposed WD-MKNN classifier achieves an accuracy of 81.4% for the 5-class emotion recognition and outperforms other classification techniques, including KNN, MKNN, DW-KNN, LDA, QDA, GMM, HMM, SVM, and BPNN. Then, to verify the advantage of the proposed method, we compared these classifiers by applying them to another Mandarin expressive speech corpus consisting of two emotions. The experimental results still show that the proposed WD-MKNN outperforms others.

  14. Oral reading fluency analysis in patients with Alzheimer disease and asymptomatic control subjects.

    PubMed

    Martínez-Sánchez, F; Meilán, J J G; García-Sevilla, J; Carro, J; Arana, J M

    2013-01-01

    Many studies highlight that an impaired ability to communicate is one of the key clinical features of Alzheimer disease (AD). To study temporal organisation of speech in an oral reading task in patients with AD and in matched healthy controls using a semi-automatic method, and evaluate that method's ability to discriminate between the 2 groups. A test with an oral reading task was administered to 70 subjects, comprising 35 AD patients and 35 controls. Before speech samples were recorded, participants completed a battery of neuropsychological tests. There were no differences between groups with regard to age, sex, or educational level. All of the study variables showed impairment in the AD group. According to the results, AD patients' oral reading was marked by reduced speech and articulation rates, low effectiveness of phonation time, and increases in the number and proportion of pauses. Signal processing algorithms applied to reading fluency recordings were shown to be capable of differentiating between AD patients and controls with an accuracy of 80% (specificity 74.2%, sensitivity 77.1%) based on speech rate. Analysis of oral reading fluency may be useful as a tool for the objective study and quantification of speech deficits in AD. Copyright © 2012 Sociedad Española de Neurología. Published by Elsevier Espana. All rights reserved.

  15. The influence of ambient speech on adult speech productions through unintentional imitation.

    PubMed

    Delvaux, Véronique; Soquet, Alain

    2007-01-01

    This paper deals with the influence of ambient speech on individual speech productions. A methodological framework is defined to gather the experimental data necessary to feed computer models simulating self-organisation in phonological systems. Two experiments were carried out. Experiment 1 was run on French native speakers from two regiolects of Belgium: two from Liège and two from Brussels. When exposed to the way of speaking of the other regiolect via loudspeakers, the speakers of one regiolect produced vowels that were significantly different from their typical realisations, and significantly closer to the way of speaking specific of the other regiolect. Experiment 2 achieved a replication of the results for 8 Mons speakers hearing a Liège speaker. A significant part of the imitative effect remained up to 10 min after the end of the exposure to the other regiolect productions. As a whole, the results suggest that: (i) imitation occurs automatically and unintentionally, (ii) the modified realisations leave a memory trace, in which case the mechanism may be better defined as 'mimesis' than as 'imitation'. The potential effects of multiple imitative speech interactions on sound change are discussed in this paper, as well as the implications for a general theory of phonetic implementation and phonetic representation.

  16. Processing Electromyographic Signals to Recognize Words

    NASA Technical Reports Server (NTRS)

    Jorgensen, C. C.; Lee, D. D.

    2009-01-01

    A recently invented speech-recognition method applies to words that are articulated by means of the tongue and throat muscles but are otherwise not voiced or, at most, are spoken sotto voce. This method could satisfy a need for speech recognition under circumstances in which normal audible speech is difficult, poses a hazard, is disturbing to listeners, or compromises privacy. The method could also be used to augment traditional speech recognition by providing an additional source of information about articulator activity. The method can be characterized as intermediate between (1) conventional speech recognition through processing of voice sounds and (2) a method, not yet developed, of processing electroencephalographic signals to extract unspoken words directly from thoughts. This method involves computational processing of digitized electromyographic (EMG) signals from muscle innervation acquired by surface electrodes under a subject's chin near the tongue and on the side of the subject s throat near the larynx. After preprocessing, digitization, and feature extraction, EMG signals are processed by a neural-network pattern classifier, implemented in software, that performs the bulk of the recognition task as described.

  17. Integrated approach for automatic target recognition using a network of collaborative sensors.

    PubMed

    Mahalanobis, Abhijit; Van Nevel, Alan

    2006-10-01

    We introduce what is believed to be a novel concept by which several sensors with automatic target recognition (ATR) capability collaborate to recognize objects. Such an approach would be suitable for netted systems in which the sensors and platforms can coordinate to optimize end-to-end performance. We use correlation filtering techniques to facilitate the development of the concept, although other ATR algorithms may be easily substituted. Essentially, a self-configuring geometry of netted platforms is proposed that positions the sensors optimally with respect to each other, and takes into account the interactions among the sensor, the recognition algorithms, and the classes of the objects to be recognized. We show how such a paradigm optimizes overall performance, and illustrate the collaborative ATR scheme for recognizing targets in synthetic aperture radar imagery by using viewing position as a sensor parameter.

  18. A knowledge engineering approach to recognizing and extracting sequences of nucleic acids from scientific literature.

    PubMed

    García-Remesal, Miguel; Maojo, Victor; Crespo, José

    2010-01-01

    In this paper we present a knowledge engineering approach to automatically recognize and extract genetic sequences from scientific articles. To carry out this task, we use a preliminary recognizer based on a finite state machine to extract all candidate DNA/RNA sequences. The latter are then fed into a knowledge-based system that automatically discards false positives and refines noisy and incorrectly merged sequences. We created the knowledge base by manually analyzing different manuscripts containing genetic sequences. Our approach was evaluated using a test set of 211 full-text articles in PDF format containing 3134 genetic sequences. For such set, we achieved 87.76% precision and 97.70% recall respectively. This method can facilitate different research tasks. These include text mining, information extraction, and information retrieval research dealing with large collections of documents containing genetic sequences.

  19. Content-based TV sports video retrieval using multimodal analysis

    NASA Astrophysics Data System (ADS)

    Yu, Yiqing; Liu, Huayong; Wang, Hongbin; Zhou, Dongru

    2003-09-01

    In this paper, we propose content-based video retrieval, which is a kind of retrieval by its semantical contents. Because video data is composed of multimodal information streams such as video, auditory and textual streams, we describe a strategy of using multimodal analysis for automatic parsing sports video. The paper first defines the basic structure of sports video database system, and then introduces a new approach that integrates visual stream analysis, speech recognition, speech signal processing and text extraction to realize video retrieval. The experimental results for TV sports video of football games indicate that the multimodal analysis is effective for video retrieval by quickly browsing tree-like video clips or inputting keywords within predefined domain.

  20. Temporal processing of speech in a time-feature space

    NASA Astrophysics Data System (ADS)

    Avendano, Carlos

    1997-09-01

    The performance of speech communication systems often degrades under realistic environmental conditions. Adverse environmental factors include additive noise sources, room reverberation, and transmission channel distortions. This work studies the processing of speech in the temporal-feature or modulation spectrum domain, aiming for alleviation of the effects of such disturbances. Speech reflects the geometry of the vocal organs, and the linguistically dominant component is in the shape of the vocal tract. At any given point in time, the shape of the vocal tract is reflected in the short-time spectral envelope of the speech signal. The rate of change of the vocal tract shape appears to be important for the identification of linguistic components. This rate of change, or the rate of change of the short-time spectral envelope can be described by the modulation spectrum, i.e. the spectrum of the time trajectories described by the short-time spectral envelope. For a wide range of frequency bands, the modulation spectrum of speech exhibits a maximum at about 4 Hz, the average syllabic rate. Disturbances often have modulation frequency components outside the speech range, and could in principle be attenuated without significantly affecting the range with relevant linguistic information. Early efforts for exploiting the modulation spectrum domain (temporal processing), such as the dynamic cepstrum or the RASTA processing, used ad hoc designed processing and appear to be suboptimal. As a major contribution, in this dissertation we aim for a systematic data-driven design of temporal processing. First we analytically derive and discuss some properties and merits of temporal processing for speech signals. We attempt to formalize the concept and provide a theoretical background which has been lacking in the field. In the experimental part we apply temporal processing to a number of problems including adaptive noise reduction in cellular telephone environments, reduction of reverberation for speech enhancement, and improvements on automatic recognition of speech degraded by linear distortions and reverberation.

Top