Masked Speech Recognition and Reading Ability in School-Age Children: Is There a Relationship?
ERIC Educational Resources Information Center
Miller, Gabrielle; Lewis, Barbara; Benchek, Penelope; Buss, Emily; Calandruccio, Lauren
2018-01-01
Purpose: The relationship between reading (decoding) skills, phonological processing abilities, and masked speech recognition in typically developing children was explored. This experiment was designed to evaluate the relationship between phonological processing and decoding abilities and 2 aspects of masked speech recognition in typically…
McCreery, Ryan W; Walker, Elizabeth A; Spratford, Meredith; Oleson, Jacob; Bentler, Ruth; Holte, Lenore; Roush, Patricia
2015-01-01
Progress has been made in recent years in the provision of amplification and early intervention for children who are hard of hearing. However, children who use hearing aids (HAs) may have inconsistent access to their auditory environment due to limitations in speech audibility through their HAs or limited HA use. The effects of variability in children's auditory experience on parent-reported auditory skills questionnaires and on speech recognition in quiet and in noise were examined for a large group of children who were followed as part of the Outcomes of Children with Hearing Loss study. Parent ratings on auditory development questionnaires and children's speech recognition were assessed for 306 children who are hard of hearing. Children ranged in age from 12 months to 9 years. Three questionnaires involving parent ratings of auditory skill development and behavior were used, including the LittlEARS Auditory Questionnaire, Parents Evaluation of Oral/Aural Performance in Children rating scale, and an adaptation of the Speech, Spatial, and Qualities of Hearing scale. Speech recognition in quiet was assessed using the Open- and Closed-Set Test, Early Speech Perception test, Lexical Neighborhood Test, and Phonetically Balanced Kindergarten word lists. Speech recognition in noise was assessed using the Computer-Assisted Speech Perception Assessment. Children who are hard of hearing were compared with peers with normal hearing matched for age, maternal educational level, and nonverbal intelligence. The effects of aided audibility, HA use, and language ability on parent responses to auditory development questionnaires and on children's speech recognition were also examined. Children who are hard of hearing had poorer performance than peers with normal hearing on parent ratings of auditory skills and had poorer speech recognition. Significant individual variability among children who are hard of hearing was observed. Children with greater aided audibility through their HAs, more hours of HA use, and better language abilities generally had higher parent ratings of auditory skills and better speech-recognition abilities in quiet and in noise than peers with less audibility, more limited HA use, or poorer language abilities. In addition to the auditory and language factors that were predictive for speech recognition in quiet, phonological working memory was also a positive predictor for word recognition abilities in noise. Children who are hard of hearing continue to experience delays in auditory skill development and speech-recognition abilities compared with peers with normal hearing. However, significant improvements in these domains have occurred in comparison to similar data reported before the adoption of universal newborn hearing screening and early intervention programs for children who are hard of hearing. Increasing the audibility of speech has a direct positive effect on auditory skill development and speech-recognition abilities and also may enhance these skills by improving language abilities in children who are hard of hearing. Greater number of hours of HA use also had a significant positive impact on parent ratings of auditory skills and children's speech recognition.
Individual differences in language and working memory affect children's speech recognition in noise.
McCreery, Ryan W; Spratford, Meredith; Kirby, Benjamin; Brennan, Marc
2017-05-01
We examined how cognitive and linguistic skills affect speech recognition in noise for children with normal hearing. Children with better working memory and language abilities were expected to have better speech recognition in noise than peers with poorer skills in these domains. As part of a prospective, cross-sectional study, children with normal hearing completed speech recognition in noise for three types of stimuli: (1) monosyllabic words, (2) syntactically correct but semantically anomalous sentences and (3) semantically and syntactically anomalous word sequences. Measures of vocabulary, syntax and working memory were used to predict individual differences in speech recognition in noise. Ninety-six children with normal hearing, who were between 5 and 12 years of age. Higher working memory was associated with better speech recognition in noise for all three stimulus types. Higher vocabulary abilities were associated with better recognition in noise for sentences and word sequences, but not for words. Working memory and language both influence children's speech recognition in noise, but the relationships vary across types of stimuli. These findings suggest that clinical assessment of speech recognition is likely to reflect underlying cognitive and linguistic abilities, in addition to a child's auditory skills, consistent with the Ease of Language Understanding model.
McCreery, Ryan W.; Walker, Elizabeth A.; Spratford, Meredith; Oleson, Jacob; Bentler, Ruth; Holte, Lenore; Roush, Patricia
2015-01-01
Objectives Progress has been made in recent years in the provision of amplification and early intervention for children who are hard of hearing. However, children who use hearing aids (HA) may have inconsistent access to their auditory environment due to limitations in speech audibility through their HAs or limited HA use. The effects of variability in children’s auditory experience on parent-report auditory skills questionnaires and on speech recognition in quiet and in noise were examined for a large group of children who were followed as part of the Outcomes of Children with Hearing Loss study. Design Parent ratings on auditory development questionnaires and children’s speech recognition were assessed for 306 children who are hard of hearing. Children ranged in age from 12 months to 9 years of age. Three questionnaires involving parent ratings of auditory skill development and behavior were used, including the LittlEARS Auditory Questionnaire, Parents Evaluation of Oral/Aural Performance in Children Rating Scale, and an adaptation of the Speech, Spatial and Qualities of Hearing scale. Speech recognition in quiet was assessed using the Open and Closed set task, Early Speech Perception Test, Lexical Neighborhood Test, and Phonetically-balanced Kindergarten word lists. Speech recognition in noise was assessed using the Computer-Assisted Speech Perception Assessment. Children who are hard of hearing were compared to peers with normal hearing matched for age, maternal educational level and nonverbal intelligence. The effects of aided audibility, HA use and language ability on parent responses to auditory development questionnaires and on children’s speech recognition were also examined. Results Children who are hard of hearing had poorer performance than peers with normal hearing on parent ratings of auditory skills and had poorer speech recognition. Significant individual variability among children who are hard of hearing was observed. Children with greater aided audibility through their HAs, more hours of HA use and better language abilities generally had higher parent ratings of auditory skills and better speech recognition abilities in quiet and in noise than peers with less audibility, more limited HA use or poorer language abilities. In addition to the auditory and language factors that were predictive for speech recognition in quiet, phonological working memory was also a positive predictor for word recognition abilities in noise. Conclusions Children who are hard of hearing continue to experience delays in auditory skill development and speech recognition abilities compared to peers with normal hearing. However, significant improvements in these domains have occurred in comparison to similar data reported prior to the adoption of universal newborn hearing screening and early intervention programs for children who are hard of hearing. Increasing the audibility of speech has a direct positive effect on auditory skill development and speech recognition abilities, and may also enhance these skills by improving language abilities in children who are hard of hearing. Greater number of hours of HA use also had a significant positive impact on parent ratings of auditory skills and children’s speech recognition. PMID:26731160
Relationship between listeners' nonnative speech recognition and categorization abilities
Atagi, Eriko; Bent, Tessa
2015-01-01
Enhancement of the perceptual encoding of talker characteristics (indexical information) in speech can facilitate listeners' recognition of linguistic content. The present study explored this indexical-linguistic relationship in nonnative speech processing by examining listeners' performance on two tasks: nonnative accent categorization and nonnative speech-in-noise recognition. Results indicated substantial variability across listeners in their performance on both the accent categorization and nonnative speech recognition tasks. Moreover, listeners' accent categorization performance correlated with their nonnative speech-in-noise recognition performance. These results suggest that having more robust indexical representations for nonnative accents may allow listeners to more accurately recognize the linguistic content of nonnative speech. PMID:25618098
Nittrouer, Susan; Caldwell-Tarr, Amanda; Tarr, Eric; Lowenstein, Joanna H.; Rice, Caitlin; Moberly, Aaron C.
2014-01-01
Objective: This study examined speech recognition in noise for children with hearing loss, compared it to recognition for children with normal hearing, and examined mechanisms that might explain variance in children’s abilities to recognize speech in noise. Design: Word recognition was measured in two levels of noise, both when the speech and noise were co-located in front and when the noise came separately from one side. Four mechanisms were examined as factors possibly explaining variance: vocabulary knowledge, sensitivity to phonological structure, binaural summation, and head shadow. Study sample: Participants were 113 eight-year-old children. Forty-eight had normal hearing (NH) and 65 had hearing loss: 18 with hearing aids (HAs), 19 with one cochlear implant (CI), and 28 with two CIs. Results: Phonological sensitivity explained a significant amount of between-groups variance in speech-in-noise recognition. Little evidence of binaural summation was found. Head shadow was similar in magnitude for children with NH and with CIs, regardless of whether they wore one or two CIs. Children with HAs showed reduced head shadow effects. Conclusion: These outcomes suggest that in order to improve speech-in-noise recognition for children with hearing loss, intervention needs to be comprehensive, focusing on both language abilities and auditory mechanisms. PMID:23834373
How linguistic closure and verbal working memory relate to speech recognition in noise--a review.
Besser, Jana; Koelewijn, Thomas; Zekveld, Adriana A; Kramer, Sophia E; Festen, Joost M
2013-06-01
The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations.
How Linguistic Closure and Verbal Working Memory Relate to Speech Recognition in Noise—A Review
Koelewijn, Thomas; Zekveld, Adriana A.; Kramer, Sophia E.; Festen, Joost M.
2013-01-01
The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations. PMID:23945955
The Effect of Lexical Content on Dichotic Speech Recognition in Older Adults.
Findlen, Ursula M; Roup, Christina M
2016-01-01
Age-related auditory processing deficits have been shown to negatively affect speech recognition for older adult listeners. In contrast, older adults gain benefit from their ability to make use of semantic and lexical content of the speech signal (i.e., top-down processing), particularly in complex listening situations. Assessment of auditory processing abilities among aging adults should take into consideration semantic and lexical content of the speech signal. The purpose of this study was to examine the effects of lexical and attentional factors on dichotic speech recognition performance characteristics for older adult listeners. A repeated measures design was used to examine differences in dichotic word recognition as a function of lexical and attentional factors. Thirty-five older adults (61-85 yr) with sensorineural hearing loss participated in this study. Dichotic speech recognition was evaluated using consonant-vowel-consonant (CVC) word and nonsense CVC syllable stimuli administered in the free recall, directed recall right, and directed recall left response conditions. Dichotic speech recognition performance for nonsense CVC syllables was significantly poorer than performance for CVC words. Dichotic recognition performance varied across response condition for both stimulus types, which is consistent with previous studies on dichotic speech recognition. Inspection of individual results revealed that five listeners demonstrated an auditory-based left ear deficit for one or both stimulus types. Lexical content of stimulus materials affects performance characteristics for dichotic speech recognition tasks in the older adult population. The use of nonsense CVC syllable material may provide a way to assess dichotic speech recognition performance while potentially lessening the effects of lexical content on performance (i.e., measuring bottom-up auditory function both with and without top-down processing). American Academy of Audiology.
Stenbäck, Victoria; Hällgren, Mathias; Lyxell, Björn; Larsby, Birgitta
2015-06-01
Cognitive functions and speech-recognition-in-noise were evaluated with a cognitive test battery, assessing response inhibition using the Hayling task, working memory capacity (WMC) and verbal information processing, and an auditory test of speech recognition. The cognitive tests were performed in silence whereas the speech recognition task was presented in noise. Thirty young normally-hearing individuals participated in the study. The aim of the study was to investigate one executive function, response inhibition, and whether it is related to individual working memory capacity (WMC), and how speech-recognition-in-noise relates to WMC and inhibitory control. The results showed a significant difference between initiation and response inhibition, suggesting that the Hayling task taps cognitive activity responsible for executive control. Our findings also suggest that high verbal ability was associated with better performance in the Hayling task. We also present findings suggesting that individuals who perform well on tasks involving response inhibition, and WMC, also perform well on a speech-in-noise task. Our findings indicate that capacity to resist semantic interference can be used to predict performance on speech-in-noise tasks. © 2015 Scandinavian Psychological Associations and John Wiley & Sons Ltd.
Speech Perception in Noise by Children With Cochlear Implants
Caldwell, Amanda; Nittrouer, Susan
2013-01-01
Purpose Common wisdom suggests that listening in noise poses disproportionately greater difficulty for listeners with cochlear implants (CIs) than for peers with normal hearing (NH). The purpose of this study was to examine phonological, language, and cognitive skills that might help explain speech-in-noise abilities for children with CIs. Method Three groups of kindergartners (NH, hearing aid wearers, and CI users) were tested on speech recognition in quiet and noise and on tasks thought to underlie the abilities that fit into the domains of phonological awareness, general language, and cognitive skills. These last measures were used as predictor variables in regression analyses with speech-in-noise scores as dependent variables. Results Compared to children with NH, children with CIs did not perform as well on speech recognition in noise or on most other measures, including recognition in quiet. Two surprising results were that (a) noise effects were consistent across groups and (b) scores on other measures did not explain any group differences in speech recognition. Conclusions Limitations of implant processing take their primary toll on recognition in quiet and account for poor speech recognition and language/phonological deficits in children with CIs. Implications are that teachers/clinicians need to teach language/phonology directly and maximize signal-to-noise levels in the classroom. PMID:22744138
NASA Astrophysics Data System (ADS)
Kayasith, Prakasith; Theeramunkong, Thanaruk
It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.
Using Automatic Speech Recognition Technology with Elicited Oral Response Testing
ERIC Educational Resources Information Center
Cox, Troy L.; Davies, Randall S.
2012-01-01
This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…
Smits, Cas; Merkus, Paul; Festen, Joost M.; Goverts, S. Theo
2017-01-01
Not all of the variance in speech-recognition performance of cochlear implant (CI) users can be explained by biographic and auditory factors. In normal-hearing listeners, linguistic and cognitive factors determine most of speech-in-noise performance. The current study explored specifically the influence of visually measured lexical-access ability compared with other cognitive factors on speech recognition of 24 postlingually deafened CI users. Speech-recognition performance was measured with monosyllables in quiet (consonant-vowel-consonant [CVC]), sentences-in-noise (SIN), and digit-triplets in noise (DIN). In addition to a composite variable of lexical-access ability (LA), measured with a lexical-decision test (LDT) and word-naming task, vocabulary size, working-memory capacity (Reading Span test [RSpan]), and a visual analogue of the SIN test (text reception threshold test) were measured. The DIN test was used to correct for auditory factors in SIN thresholds by taking the difference between SIN and DIN: SRTdiff. Correlation analyses revealed that duration of hearing loss (dHL) was related to SIN thresholds. Better working-memory capacity was related to SIN and SRTdiff scores. LDT reaction time was positively correlated with SRTdiff scores. No significant relationships were found for CVC or DIN scores with the predictor variables. Regression analyses showed that together with dHL, RSpan explained 55% of the variance in SIN thresholds. When controlling for auditory performance, LA, LDT, and RSpan separately explained, together with dHL, respectively 37%, 36%, and 46% of the variance in SRTdiff outcome. The results suggest that poor verbal working-memory capacity and to a lesser extent poor lexical-access ability limit speech-recognition ability in listeners with a CI. PMID:29205095
Nuesse, Theresa; Steenken, Rike; Neher, Tobias; Holube, Inga
2018-01-01
Elderly listeners are known to differ considerably in their ability to understand speech in noise. Several studies have addressed the underlying factors that contribute to these differences. These factors include audibility, and age-related changes in supra-threshold auditory processing abilities, and it has been suggested that differences in cognitive abilities may also be important. The objective of this study was to investigate associations between performance in cognitive tasks and speech recognition under different listening conditions in older adults with either age appropriate hearing or hearing-impairment. To that end, speech recognition threshold (SRT) measurements were performed under several masking conditions that varied along the perceptual dimensions of dip listening, spatial separation, and informational masking. In addition, a neuropsychological test battery was administered, which included measures of verbal working and short-term memory, executive functioning, selective and divided attention, and lexical and semantic abilities. Age-matched groups of older adults with either age-appropriate hearing (ENH, n = 20) or aided hearing impairment (EHI, n = 21) participated. In repeated linear regression analyses, composite scores of cognitive test outcomes (evaluated using PCA) were included to predict SRTs. These associations were different for the two groups. When hearing thresholds were controlled for, composed cognitive factors were significantly associated with the SRTs for the ENH listeners. Whereas better lexical and semantic abilities were associated with lower (better) SRTs in this group, there was a negative association between attentional abilities and speech recognition in the presence of spatially separated speech-like maskers. For the EHI group, the pure-tone thresholds (averaged across 0.5, 1, 2, and 4 kHz) were significantly associated with the SRTs, despite the fact that all signals were amplified and therefore in principle audible. PMID:29867654
Buss, Emily; Leibold, Lori J.; Porter, Heather L.; Grose, John H.
2017-01-01
Children perform more poorly than adults on a wide range of masked speech perception paradigms, but this effect is particularly pronounced when the masker itself is also composed of speech. The present study evaluated two factors that might contribute to this effect: the ability to perceptually isolate the target from masker speech, and the ability to recognize target speech based on sparse cues (glimpsing). Speech reception thresholds (SRTs) were estimated for closed-set, disyllabic word recognition in children (5–16 years) and adults in a one- or two-talker masker. Speech maskers were 60 dB sound pressure level (SPL), and they were either presented alone or in combination with a 50-dB-SPL speech-shaped noise masker. There was an age effect overall, but performance was adult-like at a younger age for the one-talker than the two-talker masker. Noise tended to elevate SRTs, particularly for older children and adults, and when summed with the one-talker masker. Removing time-frequency epochs associated with a poor target-to-masker ratio markedly improved SRTs, with larger effects for younger listeners; the age effect was not eliminated, however. Results were interpreted as indicating that development of speech-in-speech recognition is likely impacted by development of both perceptual masking and the ability recognize speech based on sparse cues. PMID:28464682
ERIC Educational Resources Information Center
Chen, Howard Hao-Jan
2011-01-01
Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…
Stam, Mariska; Smits, Cas; Twisk, Jos W R; Lemke, Ulrike; Festen, Joost M; Kramer, Sophia E
2015-01-01
The first aim of the present study was to determine the change in speech recognition in noise over a period of 5 years in participants ages 18 to 70 years at baseline. The second aim was to investigate whether age, gender, educational level, the level of initial speech recognition in noise, and reported chronic conditions were associated with a change in speech recognition in noise. The baseline and 5-year follow-up data of 427 participants with and without hearing impairment participating in the National Longitudinal Study on Hearing (NL-SH) were analyzed. The ability to recognize speech in noise was measured twice with the online National Hearing Test, a digit-triplet speech-in-noise test. Speech-reception-threshold in noise (SRTn) scores were calculated, corresponding to 50% speech intelligibility. Unaided SRTn scores obtained with the same transducer (headphones or loudspeakers) at both test moments were included. Changes in SRTn were calculated as a raw shift (T1 - T0) and an adjusted shift for regression towards the mean. Paired t tests and multivariable linear regression analyses were applied. The mean increase (i.e., deterioration) in SRTn was 0.38-dB signal-to-noise ratio (SNR) over 5 years (p < 0.001). Results of the multivariable regression analyses showed that the age group of 50 to 59 years had a significantly larger deterioration in SRTn compared with the age group of 18 to 39 years (raw shift: beta: 0.64-dB SNR; 95% confidence interval: 0.07-1.22; p = 0.028, adjusted for initial speech recognition level - adjusted shift: beta: 0.82-dB SNR; 95% confidence interval: 0.27-1.34; p = 0.004). Gender, educational level, and the number of chronic conditions were not associated with a change in SRTn over time. No significant differences in increase of SRTn were found between the initial levels of speech recognition (i.e., good, insufficient, or poor) when taking into account the phenomenon regression towards the mean. The study results indicate that hearing deterioration of speech recognition in noise over 5 years can also be detected in adults ages 18 to 70 years. This rather small numeric change might represent a relevant impact on an individual's ability to understand speech in everyday life.
Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina
2014-12-01
In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory-only speech recognition. Copyright © 2014 Elsevier Ltd. All rights reserved.
Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise.
Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther
2016-01-01
Vocabulary size has been suggested as a useful measure of "verbal abilities" that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18-35 years) and 22 older (60-78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults' poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an average vocabulary size similar to that of younger adults, they were still slower in lexical access.
Erb, Julia; Ludwig, Alexandra Annemarie; Kunke, Dunja; Fuchs, Michael; Obleser, Jonas
2018-04-24
Psychoacoustic tests assessed shortly after cochlear implantation are useful predictors of the rehabilitative speech outcome. While largely independent, both spectral and temporal resolution tests are important to provide an accurate prediction of speech recognition. However, rapid tests of temporal sensitivity are currently lacking. Here, we propose a simple amplitude modulation rate discrimination (AMRD) paradigm that is validated by predicting future speech recognition in adult cochlear implant (CI) patients. In 34 newly implanted patients, we used an adaptive AMRD paradigm, where broadband noise was modulated at the speech-relevant rate of ~4 Hz. In a longitudinal study, speech recognition in quiet was assessed using the closed-set Freiburger number test shortly after cochlear implantation (t0) as well as the open-set Freiburger monosyllabic word test 6 months later (t6). Both AMRD thresholds at t0 (r = -0.51) and speech recognition scores at t0 (r = 0.56) predicted speech recognition scores at t6. However, AMRD and speech recognition at t0 were uncorrelated, suggesting that those measures capture partially distinct perceptual abilities. A multiple regression model predicting 6-month speech recognition outcome with deafness duration and speech recognition at t0 improved from adjusted R = 0.30 to adjusted R = 0.44 when AMRD threshold was added as a predictor. These findings identify AMRD thresholds as a reliable, nonredundant predictor above and beyond established speech tests for CI outcome. This AMRD test could potentially be developed into a rapid clinical temporal-resolution test to be integrated into the postoperative test battery to improve the reliability of speech outcome prognosis.
ERIC Educational Resources Information Center
Harris, Richard W.; And Others
1988-01-01
A two-microphone adaptive digital noise cancellation technique improved word-recognition ability for 20 normal and 12 hearing-impaired adults by reducing multitalker speech babble and speech spectrum noise 18-22 dB. Word recognition improvements averaged 37-50 percent for normal and 27-40 percent for hearing-impaired subjects. Improvement was best…
Sequential Bilateral Cochlear Implantation in a Patient with Bilateral Meniere’s Disease
Holden, Laura K.; Neely, J. Gail; Gotter, Brenda D.; Mispagel, Karen M.; Firszt, Jill B.
2012-01-01
This case study describes a 45 year old female with bilateral, profound sensorineural hearing loss due to Meniere’s disease. She received her first cochlear implant in the right ear in 2008 and the second cochlear implant in the left ear in 2010. The case study examines the enhancement to speech recognition, particularly in noise, provided by bilateral cochlear implants. Speech recognition tests were administered prior to obtaining the second implant and at a number of test intervals following activation of the second device. Speech recognition in quiet and noise as well as localization abilities were assessed in several conditions to determine bilateral benefit and performance differences between ears. The results of the speech recognition testing indicated a substantial improvement in the patient’s ability to understand speech in noise and her ability to localize sound when using bilateral cochlear implants compared to using a unilateral implant or an implant and a hearing aid. In addition, the patient reported considerable improvement in her ability to communicate in daily life when using bilateral implants versus a unilateral implant. This case suggests that cochlear implantation is a viable option for patients who have lost their hearing to Meniere’s disease even when a number of medical treatments and surgical interventions have been performed to control vertigo. In the case presented, bilateral cochlear implantation was necessary for this patient to communicate successfully at home and at work. PMID:22463939
An articulatorily constrained, maximum entropy approach to speech recognition and speech coding
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.
Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values aremore » constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.« less
Second Language Ability and Emotional Prosody Perception
Bhatara, Anjali; Laukka, Petri; Boll-Avetisyan, Natalie; Granjon, Lionel; Anger Elfenbein, Hillary; Bänziger, Tanja
2016-01-01
The present study examines the effect of language experience on vocal emotion perception in a second language. Native speakers of French with varying levels of self-reported English ability were asked to identify emotions from vocal expressions produced by American actors in a forced-choice task, and to rate their pleasantness, power, alertness and intensity on continuous scales. Stimuli included emotionally expressive English speech (emotional prosody) and non-linguistic vocalizations (affect bursts), and a baseline condition with Swiss-French pseudo-speech. Results revealed effects of English ability on the recognition of emotions in English speech but not in non-linguistic vocalizations. Specifically, higher English ability was associated with less accurate identification of positive emotions, but not with the interpretation of negative emotions. Moreover, higher English ability was associated with lower ratings of pleasantness and power, again only for emotional prosody. This suggests that second language skills may sometimes interfere with emotion recognition from speech prosody, particularly for positive emotions. PMID:27253326
ERIC Educational Resources Information Center
McMurray, Bob; Munson, Cheyenne; Tomblin, J. Bruce
2014-01-01
Purpose: The authors examined speech perception deficits associated with individual differences in language ability, contrasting auditory, phonological, or lexical accounts by asking whether lexical competition is differentially sensitive to fine-grained acoustic variation. Method: Adolescents with a range of language abilities (N = 74, including…
ERIC Educational Resources Information Center
Besser, Jana; Zekveld, Adriana A.; Kramer, Sophia E.; Ronnberg, Jerker; Festen, Joost M.
2012-01-01
Purpose: In this research, the authors aimed to increase the analogy between Text Reception Threshold (TRT; Zekveld, George, Kramer, Goverts, & Houtgast, 2007) and Speech Reception Threshold (SRT; Plomp & Mimpen, 1979) and to examine the TRT's value in estimating cognitive abilities that are important for speech comprehension in noise. Method: The…
Roman, Adrienne S; Pisoni, David B; Kronenberger, William G; Faulkner, Kathleen F
Noise-vocoded speech is a valuable research tool for testing experimental hypotheses about the effects of spectral degradation on speech recognition in adults with normal hearing (NH). However, very little research has utilized noise-vocoded speech with children with NH. Earlier studies with children with NH focused primarily on the amount of spectral information needed for speech recognition without assessing the contribution of neurocognitive processes to speech perception and spoken word recognition. In this study, we first replicated the seminal findings reported by ) who investigated effects of lexical density and word frequency on noise-vocoded speech perception in a small group of children with NH. We then extended the research to investigate relations between noise-vocoded speech recognition abilities and five neurocognitive measures: auditory attention (AA) and response set, talker discrimination, and verbal and nonverbal short-term working memory. Thirty-one children with NH between 5 and 13 years of age were assessed on their ability to perceive lexically controlled words in isolation and in sentences that were noise-vocoded to four spectral channels. Children were also administered vocabulary assessments (Peabody Picture Vocabulary test-4th Edition and Expressive Vocabulary test-2nd Edition) and measures of AA (NEPSY AA and response set and a talker discrimination task) and short-term memory (visual digit and symbol spans). Consistent with the findings reported in the original ) study, we found that children perceived noise-vocoded lexically easy words better than lexically hard words. Words in sentences were also recognized better than the same words presented in isolation. No significant correlations were observed between noise-vocoded speech recognition scores and the Peabody Picture Vocabulary test-4th Edition using language quotients to control for age effects. However, children who scored higher on the Expressive Vocabulary test-2nd Edition recognized lexically easy words better than lexically hard words in sentences. Older children perceived noise-vocoded speech better than younger children. Finally, we found that measures of AA and short-term memory capacity were significantly correlated with a child's ability to perceive noise-vocoded isolated words and sentences. First, we successfully replicated the major findings from the ) study. Because familiarity, phonological distinctiveness and lexical competition affect word recognition, these findings provide additional support for the proposal that several foundational elementary neurocognitive processes underlie the perception of spectrally degraded speech. Second, we found strong and significant correlations between performance on neurocognitive measures and children's ability to recognize words and sentences noise-vocoded to four spectral channels. These findings extend earlier research suggesting that perception of spectrally degraded speech reflects early peripheral auditory processes, as well as additional contributions of executive function, specifically, selective attention and short-term memory processes in spoken word recognition. The present findings suggest that AA and short-term memory support robust spoken word recognition in children with NH even under compromised and challenging listening conditions. These results are relevant to research carried out with listeners who have hearing loss, because they are routinely required to encode, process, and understand spectrally degraded acoustic signals.
Roman, Adrienne S.; Pisoni, David B.; Kronenberger, William G.; Faulkner, Kathleen F.
2016-01-01
Objectives Noise-vocoded speech is a valuable research tool for testing experimental hypotheses about the effects of spectral-degradation on speech recognition in adults with normal hearing (NH). However, very little research has utilized noise-vocoded speech with children with NH. Earlier studies with children with NH focused primarily on the amount of spectral information needed for speech recognition without assessing the contribution of neurocognitive processes to speech perception and spoken word recognition. In this study, we first replicated the seminal findings reported by Eisenberg et al. (2002) who investigated effects of lexical density and word frequency on noise-vocoded speech perception in a small group of children with NH. We then extended the research to investigate relations between noise-vocoded speech recognition abilities and five neurocognitive measures: auditory attention and response set, talker discrimination and verbal and nonverbal short-term working memory. Design Thirty-one children with NH between 5 and 13 years of age were assessed on their ability to perceive lexically controlled words in isolation and in sentences that were noise-vocoded to four spectral channels. Children were also administered vocabulary assessments (PPVT-4 and EVT-2) and measures of auditory attention (NEPSY Auditory Attention (AA) and Response Set (RS) and a talker discrimination task (TD)) and short-term memory (visual digit and symbol spans). Results Consistent with the findings reported in the original Eisenberg et al. (2002) study, we found that children perceived noise-vocoded lexically easy words better than lexically hard words. Words in sentences were also recognized better than the same words presented in isolation. No significant correlations were observed between noise-vocoded speech recognition scores and the PPVT-4 using language quotients to control for age effects. However, children who scored higher on the EVT-2 recognized lexically easy words better than lexically hard words in sentences. Older children perceived noise-vocoded speech better than younger children. Finally, we found that measures of auditory attention and short-term memory capacity were significantly correlated with a child’s ability to perceive noise-vocoded isolated words and sentences. Conclusions First, we successfully replicated the major findings from the Eisenberg et al. (2002) study. Because familiarity, phonological distinctiveness and lexical competition affect word recognition, these findings provide additional support for the proposal that several foundational elementary neurocognitive processes underlie the perception of spectrally-degraded speech. Second, we found strong and significant correlations between performance on neurocognitive measures and children’s ability to recognize words and sentences noise-vocoded to four spectral channels. These findings extend earlier research suggesting that perception of spectrally-degraded speech reflects early peripheral auditory processes as well as additional contributions of executive function, specifically, selective attention and short-term memory processes in spoken word recognition. The present findings suggest that auditory attention and short-term memory support robust spoken word recognition in children with NH even under compromised and challenging listening conditions. These results are relevant to research carried out with listeners who have hearing loss, since they are routinely required to encode, process and understand spectrally-degraded acoustic signals. PMID:28045787
Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela
2015-01-01
Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes. PMID:26233047
Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela
2015-07-01
Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes.
Auditory Training with Multiple Talkers and Passage-Based Semantic Cohesion
ERIC Educational Resources Information Center
Casserly, Elizabeth D.; Barney, Erin C.
2017-01-01
Purpose: Current auditory training methods typically result in improvements to speech recognition abilities in quiet, but learner gains may not extend to other domains in speech (e.g., recognition in noise) or self-assessed benefit. This study examined the potential of training involving multiple talkers and training emphasizing discourse-level…
Preschoolers Benefit From Visually Salient Speech Cues
Holt, Rachael Frush
2015-01-01
Purpose This study explored visual speech influence in preschoolers using 3 developmentally appropriate tasks that vary in perceptual difficulty and task demands. They also examined developmental differences in the ability to use visually salient speech cues and visual phonological knowledge. Method Twelve adults and 27 typically developing 3- and 4-year-old children completed 3 audiovisual (AV) speech integration tasks: matching, discrimination, and recognition. The authors compared AV benefit for visually salient and less visually salient speech discrimination contrasts and assessed the visual saliency of consonant confusions in auditory-only and AV word recognition. Results Four-year-olds and adults demonstrated visual influence on all measures. Three-year-olds demonstrated visual influence on speech discrimination and recognition measures. All groups demonstrated greater AV benefit for the visually salient discrimination contrasts. AV recognition benefit in 4-year-olds and adults depended on the visual saliency of speech sounds. Conclusions Preschoolers can demonstrate AV speech integration. Their AV benefit results from efficient use of visually salient speech cues. Four-year-olds, but not 3-year-olds, used visual phonological knowledge to take advantage of visually salient speech cues, suggesting possible developmental differences in the mechanisms of AV benefit. PMID:25322336
Automatic Speech Recognition from Neural Signals: A Focused Review.
Herff, Christian; Schultz, Tanja
2016-01-01
Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.
Qi, Beier; Liu, Bo; Liu, Sha; Liu, Haihong; Dong, Ruijuan; Zhang, Ning; Gong, Shusheng
2011-05-01
To study the effect of cochlear electrode coverage and different insertion region on speech recognition, especially tone perception of cochlear implant users whose native language is Mandarin Chinese. Setting seven test conditions by fitting software. All conditions were created by switching on/off respective channels in order to simulate different insertion position. Then Mandarin CI users received 4 Speech tests, including Vowel Identification test, Consonant Identification test, Tone Identification test-male speaker, Mandarin HINT test (SRS) in quiet and noise. To all test conditions: the average score of vowel identification was significantly different, from 56% to 91% (Rank sum test, P < 0.05). The average score of consonant identification was significantly different, from 72% to 85% (ANOVNA, P < 0.05). The average score of Tone identification was not significantly different (ANOVNA, P > 0.05). However the more channels activated, the higher scores obtained, from 68% to 81%. This study shows that there is a correlation between insertion depth and speech recognition. Because all parts of the basement membrane can help CI users to improve their speech recognition ability, it is very important to enhance verbal communication ability and social interaction ability of CI users by increasing insertion depth and actively stimulating the top region of cochlear.
The NTID speech recognition test: NSRT(®).
Bochner, Joseph H; Garrison, Wayne M; Doherty, Karen A
2015-07-01
The purpose of this study was to collect and analyse data necessary for expansion of the NSRT item pool and to evaluate the NSRT adaptive testing software. Participants were administered pure-tone and speech recognition tests including W-22 and QuickSIN, as well as a set of 323 new NSRT items and NSRT adaptive tests in quiet and background noise. Performance on the adaptive tests was compared to pure-tone thresholds and performance on other speech recognition measures. The 323 new items were subjected to Rasch scaling analysis. Seventy adults with mild to moderately severe hearing loss participated in this study. Their mean age was 62.4 years (sd = 20.8). The 323 new NSRT items fit very well with the original item bank, enabling the item pool to be more than doubled in size. Data indicate high reliability coefficients for the NSRT and moderate correlations with pure-tone thresholds (PTA and HFPTA) and other speech recognition measures (W-22, QuickSIN, and SRT). The adaptive NSRT is an efficient and effective measure of speech recognition, providing valid and reliable information concerning respondents' speech perception abilities.
Investigating an Innovative Computer Application to Improve L2 Word Recognition from Speech
ERIC Educational Resources Information Center
Matthews, Joshua; O'Toole, John Mitchell
2015-01-01
The ability to recognise words from the aural modality is a critical aspect of successful second language (L2) listening comprehension. However, little research has been reported on computer-mediated development of L2 word recognition from speech in L2 learning contexts. This report describes the development of an innovative computer application…
Speech Recognition in Noise by Children with and without Dyslexia: How is it Related to Reading?
Nittrouer, Susan; Krieg, Letitia M; Lowenstein, Joanna H
2018-06-01
Developmental dyslexia is commonly viewed as a phonological deficit that makes it difficult to decode written language. But children with dyslexia typically exhibit other problems, as well, including poor speech recognition in noise. The purpose of this study was to examine whether the speech-in-noise problems of children with dyslexia are related to their reading problems, and if so, if a common underlying factor might explain both. The specific hypothesis examined was that a spectral processing disorder results in these children receiving smeared signals, which could explain both the diminished sensitivity to phonological structure - leading to reading problems - and the speech recognition in noise difficulties. The alternative hypothesis tested in this study was that children with dyslexia simply have broadly based language deficits. Ninety-seven children between the ages of 7 years; 10 months and 12 years; 9 months participated: 46 with dyslexia and 51 without dyslexia. Children were tested on two dependent measures: word reading and recognition in noise with two types of sentence materials: as unprocessed (UP) signals, and as spectrally smeared (SM) signals. Data were collected for four predictor variables: phonological awareness, vocabulary, grammatical knowledge, and digit span. Children with dyslexia showed deficits on both dependent and all predictor variables. Their scores for speech recognition in noise were poorer than those of children without dyslexia for both the UP and SM signals, but by equivalent amounts across signal conditions indicating that they were not disproportionately hindered by spectral distortion. Correlation analyses on scores from children with dyslexia showed that reading ability and speech-in-noise recognition were only mildly correlated, and each skill was related to different underlying abilities. No substantial evidence was found to support the suggestion that the reading and speech recognition in noise problems of children with dyslexia arise from a single factor that could be defined as a spectral processing disorder. The reading and speech recognition in noise deficits of these children appeared to be largely independent. Copyright © 2018 Elsevier Ltd. All rights reserved.
Random Deep Belief Networks for Recognizing Emotions from Speech Signals.
Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang
2017-01-01
Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.
Random Deep Belief Networks for Recognizing Emotions from Speech Signals
Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang
2017-01-01
Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition. PMID:28356908
Speech recognition for embedded automatic positioner for laparoscope
NASA Astrophysics Data System (ADS)
Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin
2014-07-01
In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.
Does quality of life depend on speech recognition performance for adult cochlear implant users?
Capretta, Natalie R; Moberly, Aaron C
2016-03-01
Current postoperative clinical outcome measures for adults receiving cochlear implants (CIs) consist of testing speech recognition, primarily under quiet conditions. However, it is strongly suspected that results on these measures may not adequately reflect patients' quality of life (QOL) using their implants. This study aimed to evaluate whether QOL for CI users depends on speech recognition performance. Twenty-three postlingually deafened adults with CIs were assessed. Participants were tested for speech recognition (Central Institute for the Deaf word and AzBio sentence recognition in quiet) and completed three QOL measures-the Nijmegen Cochlear Implant Questionnaire; either the Hearing Handicap Inventory for Adults or the Hearing Handicap Inventory for the Elderly; and the Speech, Spatial and Qualities of Hearing Scale questionnaires-to assess a variety of QOL factors. Correlations were sought between speech recognition and QOL scores. Demographics, audiologic history, language, and cognitive skills were also examined as potential predictors of QOL. Only a few QOL scores significantly correlated with postoperative sentence or word recognition in quiet, and correlations were primarily isolated to speech-related subscales on QOL measures. Poorer pre- and postoperative unaided hearing predicted better QOL. Socioeconomic status, duration of deafness, age at implantation, duration of CI use, reading ability, vocabulary size, and cognitive status did not consistently predict QOL scores. For adult, postlingually deafened CI users, clinical speech recognition measures in quiet do not correlate broadly with QOL. Results suggest the need for additional outcome measures of the benefits and limitations of cochlear implantation. 4. Laryngoscope, 126:699-706, 2016. © 2015 The American Laryngological, Rhinological and Otological Society, Inc.
Yoon, Yang-soo; Li, Yongxin; Kang, Hou-Yong; Fu, Qian-Jie
2011-01-01
Objective The full benefit of bilateral cochlear implants may depend on the unilateral performance with each device, the speech materials, processing ability of the user, and/or the listening environment. In this study, bilateral and unilateral speech performances were evaluated in terms of recognition of phonemes and sentences presented in quiet or in noise. Design Speech recognition was measured for unilateral left, unilateral right, and bilateral listening conditions; speech and noise were presented at 0° azimuth. The “binaural benefit” was defined as the difference between bilateral performance and unilateral performance with the better ear. Study Sample 9 adults with bilateral cochlear implants participated. Results On average, results showed a greater binaural benefit in noise than in quiet for all speech tests. More importantly, the binaural benefit was greater when unilateral performance was similar across ears. As the difference in unilateral performance between ears increased, the binaural advantage decreased; this functional relationship was observed across the different speech materials and noise levels even though there was substantial intra- and inter-subject variability. Conclusions The results indicate that subjects who show symmetry in speech recognition performance between implanted ears in general show a large binaural benefit. PMID:21696329
Multitasking During Degraded Speech Recognition in School-Age Children
Ward, Kristina M.; Brehm, Laurel
2017-01-01
Multitasking requires individuals to allocate their cognitive resources across different tasks. The purpose of the current study was to assess school-age children’s multitasking abilities during degraded speech recognition. Children (8 to 12 years old) completed a dual-task paradigm including a sentence recognition (primary) task containing speech that was either unprocessed or noise-band vocoded with 8, 6, or 4 spectral channels and a visual monitoring (secondary) task. Children’s accuracy and reaction time on the visual monitoring task was quantified during the dual-task paradigm in each condition of the primary task and compared with single-task performance. Children experienced dual-task costs in the 6- and 4-channel conditions of the primary speech recognition task with decreased accuracy on the visual monitoring task relative to baseline performance. In all conditions, children’s dual-task performance on the visual monitoring task was strongly predicted by their single-task (baseline) performance on the task. Results suggest that children’s proficiency with the secondary task contributes to the magnitude of dual-task costs while multitasking during degraded speech recognition. PMID:28105890
Multitasking During Degraded Speech Recognition in School-Age Children.
Grieco-Calub, Tina M; Ward, Kristina M; Brehm, Laurel
2017-01-01
Multitasking requires individuals to allocate their cognitive resources across different tasks. The purpose of the current study was to assess school-age children's multitasking abilities during degraded speech recognition. Children (8 to 12 years old) completed a dual-task paradigm including a sentence recognition (primary) task containing speech that was either unprocessed or noise-band vocoded with 8, 6, or 4 spectral channels and a visual monitoring (secondary) task. Children's accuracy and reaction time on the visual monitoring task was quantified during the dual-task paradigm in each condition of the primary task and compared with single-task performance. Children experienced dual-task costs in the 6- and 4-channel conditions of the primary speech recognition task with decreased accuracy on the visual monitoring task relative to baseline performance. In all conditions, children's dual-task performance on the visual monitoring task was strongly predicted by their single-task (baseline) performance on the task. Results suggest that children's proficiency with the secondary task contributes to the magnitude of dual-task costs while multitasking during degraded speech recognition.
Polur, Prasad D; Miller, Gerald E
2006-10-01
Computer speech recognition of individuals with dysarthria, such as cerebral palsy patients requires a robust technique that can handle conditions of very high variability and limited training data. In this study, application of a 10 state ergodic hidden Markov model (HMM)/artificial neural network (ANN) hybrid structure for a dysarthric speech (isolated word) recognition system, intended to act as an assistive tool, was investigated. A small size vocabulary spoken by three cerebral palsy subjects was chosen. The effect of such a structure on the recognition rate of the system was investigated by comparing it with an ergodic hidden Markov model as a control tool. This was done in order to determine if this modified technique contributed to enhanced recognition of dysarthric speech. The speech was sampled at 11 kHz. Mel frequency cepstral coefficients were extracted from them using 15 ms frames and served as training input to the hybrid model setup. The subsequent results demonstrated that the hybrid model structure was quite robust in its ability to handle the large variability and non-conformity of dysarthric speech. The level of variability in input dysarthric speech patterns sometimes limits the reliability of the system. However, its application as a rehabilitation/control tool to assist dysarthric motor impaired individuals holds sufficient promise.
Do Older Listeners With Hearing Loss Benefit From Dynamic Pitch for Speech Recognition in Noise?
Shen, Jing; Souza, Pamela E
2017-10-12
Dynamic pitch, the variation in the fundamental frequency of speech, aids older listeners' speech perception in noise. It is unclear, however, whether some older listeners with hearing loss benefit from strengthened dynamic pitch cues for recognizing speech in certain noise scenarios and how this relative benefit may be associated with individual factors. We first examined older individuals' relative benefit between natural and strong dynamic pitches for better speech recognition in noise. Further, we reported the individual factors of the 2 groups of listeners who benefit differently from natural and strong dynamic pitches. Speech reception thresholds of 13 older listeners with mild-moderate hearing loss were measured using target speech with 3 levels of dynamic pitch strength. Individuals' ability to benefit from dynamic pitch was defined as the speech reception threshold difference between speeches with and without dynamic pitch cues. The relative benefit of natural versus strong dynamic pitch varied across individuals. However, this relative benefit remained consistent for the same individuals across those background noises with temporal modulation. Those listeners who benefited more from strong dynamic pitch reported better subjective speech perception abilities. Strong dynamic pitch may be more beneficial than natural dynamic pitch for some older listeners to recognize speech better in noise, particularly when the noise has temporal modulation.
Robust relationship between reading span and speech recognition in noise
Souza, Pamela; Arehart, Kathryn
2015-01-01
Objective Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. Design The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. Study sample The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Results Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Conclusions Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition. PMID:25975360
Robust relationship between reading span and speech recognition in noise.
Souza, Pamela; Arehart, Kathryn
2015-01-01
Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition.
Moberly, Aaron C; Harris, Michael S; Boyce, Lauren; Nittrouer, Susan
2017-04-14
Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced years of hearing loss obtains a cochlear implant. Thirty adults with cochlear implants and 30 age-matched controls with age-normal hearing underwent testing of verbal working memory using digit span and serial recall of words. Phonological capacities were assessed using a lexical decision task and nonword repetition. Recognition of words in sentences in speech-shaped noise was measured. Implant users had only slightly poorer working memory accuracy than did controls and only on serial recall of words; however, phonological sensitivity was highly impaired. Working memory did not facilitate speech recognition in noise for either group. Phonological sensitivity predicted sentence recognition for implant users but not for listeners with normal hearing. Clinical speech recognition outcomes for adult implant users relate to the ability of these users to process phonological information. Results suggest that phonological capacities may serve as potential clinical targets through rehabilitative training. Such novel interventions may be particularly helpful for older adult implant users.
Harris, Michael S.; Boyce, Lauren; Nittrouer, Susan
2017-01-01
Purpose Models of speech recognition suggest that “top-down” linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced years of hearing loss obtains a cochlear implant. Method Thirty adults with cochlear implants and 30 age-matched controls with age-normal hearing underwent testing of verbal working memory using digit span and serial recall of words. Phonological capacities were assessed using a lexical decision task and nonword repetition. Recognition of words in sentences in speech-shaped noise was measured. Results Implant users had only slightly poorer working memory accuracy than did controls and only on serial recall of words; however, phonological sensitivity was highly impaired. Working memory did not facilitate speech recognition in noise for either group. Phonological sensitivity predicted sentence recognition for implant users but not for listeners with normal hearing. Conclusion Clinical speech recognition outcomes for adult implant users relate to the ability of these users to process phonological information. Results suggest that phonological capacities may serve as potential clinical targets through rehabilitative training. Such novel interventions may be particularly helpful for older adult implant users. PMID:28384805
Non-native Listeners’ Recognition of High-Variability Speech Using PRESTO
Tamati, Terrin N.; Pisoni, David B.
2015-01-01
Background Natural variability in speech is a significant challenge to robust successful spoken word recognition. In everyday listening environments, listeners must quickly adapt and adjust to multiple sources of variability in both the signal and listening environments. High-variability speech may be particularly difficult to understand for non-native listeners, who have less experience with the second language (L2) phonological system and less detailed knowledge of sociolinguistic variation of the L2. Purpose The purpose of this study was to investigate the effects of high-variability sentences on non-native speech recognition and to explore the underlying sources of individual differences in speech recognition abilities of non-native listeners. Research Design Participants completed two sentence recognition tasks involving high-variability and low-variability sentences. They also completed a battery of behavioral tasks and self-report questionnaires designed to assess their indexical processing skills, vocabulary knowledge, and several core neurocognitive abilities. Study Sample Native speakers of Mandarin (n = 25) living in the United States recruited from the Indiana University community participated in the current study. A native comparison group consisted of scores obtained from native speakers of English (n = 21) in the Indiana University community taken from an earlier study. Data Collection and Analysis Speech recognition in high-variability listening conditions was assessed with a sentence recognition task using sentences from PRESTO (Perceptually Robust English Sentence Test Open-Set) mixed in 6-talker multitalker babble. Speech recognition in low-variability listening conditions was assessed using sentences from HINT (Hearing In Noise Test) mixed in 6-talker multitalker babble. Indexical processing skills were measured using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Vocabulary knowledge was assessed with the WordFam word familiarity test, and executive functioning was assessed with the BRIEF-A (Behavioral Rating Inventory of Executive Function – Adult Version) self-report questionnaire. Scores from the non-native listeners on behavioral tasks and self-report questionnaires were compared with scores obtained from native listeners tested in a previous study and were examined for individual differences. Results Non-native keyword recognition scores were significantly lower on PRESTO sentences than on HINT sentences. Non-native listeners’ keyword recognition scores were also lower than native listeners’ scores on both sentence recognition tasks. Differences in performance on the sentence recognition tasks between non-native and native listeners were larger on PRESTO than on HINT, although group differences varied by signal-to-noise ratio. The non-native and native groups also differed in the ability to categorize talkers by region of origin and in vocabulary knowledge. Individual non-native word recognition accuracy on PRESTO sentences in multitalker babble at more favorable signal-to-noise ratios was found to be related to several BRIEF-A subscales and composite scores. However, non-native performance on PRESTO was not related to regional dialect categorization, talker and gender discrimination, or vocabulary knowledge. Conclusions High-variability sentences in multitalker babble were particularly challenging for non-native listeners. Difficulty under high-variability testing conditions was related to lack of experience with the L2, especially L2 sociolinguistic information, compared with native listeners. Individual differences among the non-native listeners were related to weaknesses in core neurocognitive abilities affecting behavioral control in everyday life. PMID:25405842
Lima, César F; Garrett, Carolina; Castro, São Luís
2013-01-01
Does emotion processing in music and speech prosody recruit common neurocognitive mechanisms? To examine this question, we implemented a cross-domain comparative design in Parkinson's disease (PD). Twenty-four patients and 25 controls performed emotion recognition tasks for music and spoken sentences. In music, patients had impaired recognition of happiness and peacefulness, and intact recognition of sadness and fear; this pattern was independent of general cognitive and perceptual abilities. In speech, patients had a small global impairment, which was significantly mediated by executive dysfunction. Hence, PD affected differently musical and prosodic emotions. This dissociation indicates that the mechanisms underlying the two domains are partly independent.
Speech Perception in Complex Acoustic Environments: Developmental Effects
ERIC Educational Resources Information Center
Leibold, Lori J.
2017-01-01
Purpose: The ability to hear and understand speech in complex acoustic environments follows a prolonged time course of development. The purpose of this article is to provide a general overview of the literature describing age effects in susceptibility to auditory masking in the context of speech recognition, including a summary of findings related…
The Effect of Remote Masking on the Reception of Speech by Young School-Age Children.
Youngdahl, Carla L; Healy, Eric W; Yoho, Sarah E; Apoux, Frédéric; Holt, Rachael Frush
2018-02-15
Psychoacoustic data indicate that infants and children are less likely than adults to focus on a spectral region containing an anticipated signal and are more susceptible to remote masking of a signal. These detection tasks suggest that infants and children, unlike adults, do not listen selectively. However, less is known about children's ability to listen selectively during speech recognition. Accordingly, the current study examines remote masking during speech recognition in children and adults. Adults and 7- and 5-year-old children performed sentence recognition in the presence of various spectrally remote maskers. Intelligibility was determined for each remote-masker condition, and performance was compared across age groups. It was found that speech recognition for 5-year-olds was reduced in the presence of spectrally remote noise, whereas the maskers had no effect on the 7-year-olds or adults. Maskers of different bandwidth and remoteness had similar effects. In accord with psychoacoustic data, young children do not appear to focus on a spectral region of interest and ignore other regions during speech recognition. This tendency may help account for their typically poorer speech perception in noise. This study also appears to capture an important developmental stage, during which a substantial refinement in spectral listening occurs.
Moberly, Aaron C; Patel, Tirth R; Castellanos, Irina
2018-02-01
As a result of their hearing loss, adults with cochlear implants (CIs) would self-report poorer executive functioning (EF) skills than normal-hearing (NH) peers, and these EF skills would be associated with performance on speech recognition tasks. EF refers to a group of high order neurocognitive skills responsible for behavioral and emotional regulation during goal-directed activity, and EF has been found to be poorer in children with CIs than their NH age-matched peers. Moreover, there is increasing evidence that neurocognitive skills, including some EF skills, contribute to the ability to recognize speech through a CI. Thirty postlingually deafened adults with CIs and 42 age-matched NH adults were enrolled. Participants and their spouses or significant others (informants) completed well-validated self-reports or informant-reports of EF, the Behavior Rating Inventory of Executive Function - Adult (BRIEF-A). CI users' speech recognition skills were assessed in quiet using several measures of sentence recognition. NH peers were tested for recognition of noise-vocoded versions of the same speech stimuli. CI users self-reported difficulty on EF tasks of shifting and task monitoring. In CI users, measures of speech recognition correlated with several self-reported EF skills. The present findings provide further evidence that neurocognitive factors, including specific EF skills, may decline in association with hearing loss, and that some of these EF skills contribute to speech processing under degraded listening conditions.
Development of coffee maker service robot using speech and face recognition systems using POMDP
NASA Astrophysics Data System (ADS)
Budiharto, Widodo; Meiliana; Santoso Gunawan, Alexander Agung
2016-07-01
There are many development of intelligent service robot in order to interact with user naturally. This purpose can be done by embedding speech and face recognition ability on specific tasks to the robot. In this research, we would like to propose Intelligent Coffee Maker Robot which the speech recognition is based on Indonesian language and powered by statistical dialogue systems. This kind of robot can be used in the office, supermarket or restaurant. In our scenario, robot will recognize user's face and then accept commands from the user to do an action, specifically in making a coffee. Based on our previous work, the accuracy for speech recognition is about 86% and face recognition is about 93% in laboratory experiments. The main problem in here is to know the intention of user about how sweetness of the coffee. The intelligent coffee maker robot should conclude the user intention through conversation under unreliable automatic speech in noisy environment. In this paper, this spoken dialog problem is treated as a partially observable Markov decision process (POMDP). We describe how this formulation establish a promising framework by empirical results. The dialog simulations are presented which demonstrate significant quantitative outcome.
Text as a Supplement to Speech in Young and Older Adults a)
Krull, Vidya; Humes, Larry E.
2015-01-01
Objective The purpose of this experiment was to quantify the contribution of visual text to auditory speech recognition in background noise. Specifically, we tested the hypothesis that partially accurate visual text from an automatic speech recognizer could be used successfully to supplement speech understanding in difficult listening conditions in older adults, with normal or impaired hearing. Our working hypotheses were based on what is known regarding audiovisual speech perception in the elderly from speechreading literature. We hypothesized that: 1) combining auditory and visual text information will result in improved recognition accuracy compared to auditory or visual text information alone; 2) benefit from supplementing speech with visual text (auditory and visual enhancement) in young adults will be greater than that in older adults; and 3) individual differences in performance on perceptual measures would be associated with cognitive abilities. Design Fifteen young adults with normal hearing, fifteen older adults with normal hearing, and fifteen older adults with hearing loss participated in this study. All participants completed sentence recognition tasks in auditory-only, text-only, and combined auditory-text conditions. The auditory sentence stimuli were spectrally shaped to restore audibility for the older participants with impaired hearing. All participants also completed various cognitive measures, including measures of working memory, processing speed, verbal comprehension, perceptual and cognitive speed, processing efficiency, inhibition, and the ability to form wholes from parts. Group effects were examined for each of the perceptual and cognitive measures. Audiovisual benefit was calculated relative to performance on auditory-only and visual-text only conditions. Finally, the relationship between perceptual measures and other independent measures were examined using principal-component factor analyses, followed by regression analyses. Results Both young and older adults performed similarly on nine out of ten perceptual measures (auditory, visual, and combined measures). Combining degraded speech with partially correct text from an automatic speech recognizer improved the understanding of speech in both young and older adults, relative to both auditory- and text-only performance. In all subjects, cognition emerged as a key predictor for a general speech-text integration ability. Conclusions These results suggest that neither age nor hearing loss affected the ability of subjects to benefit from text when used to support speech, after ensuring audibility through spectral shaping. These results also suggest that the benefit obtained by supplementing auditory input with partially accurate text is modulated by cognitive ability, specifically lexical and verbal skills. PMID:26458131
Firszt, Jill B; Reeder, Ruth M; Holden, Laura K
At a minimum, unilateral hearing loss (UHL) impairs sound localization ability and understanding speech in noisy environments, particularly if the loss is severe to profound. Accompanying the numerous negative consequences of UHL is considerable unexplained individual variability in the magnitude of its effects. Identification of covariables that affect outcome and contribute to variability in UHLs could augment counseling, treatment options, and rehabilitation. Cochlear implantation as a treatment for UHL is on the rise yet little is known about factors that could impact performance or whether there is a group at risk for poor cochlear implant outcomes when hearing is near-normal in one ear. The overall goal of our research is to investigate the range and source of variability in speech recognition in noise and localization among individuals with severe to profound UHL and thereby help determine factors relevant to decisions regarding cochlear implantation in this population. The present study evaluated adults with severe to profound UHL and adults with bilateral normal hearing. Measures included adaptive sentence understanding in diffuse restaurant noise, localization, roving-source speech recognition (words from 1 of 15 speakers in a 140° arc), and an adaptive speech-reception threshold psychoacoustic task with varied noise types and noise-source locations. There were three age-sex-matched groups: UHL (severe to profound hearing loss in one ear and normal hearing in the contralateral ear), normal hearing listening bilaterally, and normal hearing listening unilaterally. Although the normal-hearing-bilateral group scored significantly better and had less performance variability than UHLs on all measures, some UHL participants scored within the range of the normal-hearing-bilateral group on all measures. The normal-hearing participants listening unilaterally had better monosyllabic word understanding than UHLs for words presented on the blocked/deaf side but not the open/hearing side. In contrast, UHLs localized better than the normal-hearing unilateral listeners for stimuli on the open/hearing side but not the blocked/deaf side. This suggests that UHLs had learned strategies for improved localization on the side of the intact ear. The UHL and unilateral normal-hearing participant groups were not significantly different for speech in noise measures. UHL participants with childhood rather than recent hearing loss onset localized significantly better; however, these two groups did not differ for speech recognition in noise. Age at onset in UHL adults appears to affect localization ability differently than understanding speech in noise. Hearing thresholds were significantly correlated with speech recognition for UHL participants but not the other two groups. Auditory abilities of UHLs varied widely and could be explained only in part by hearing threshold levels. Age at onset and length of hearing loss influenced performance on some, but not all measures. Results support the need for a revised and diverse set of clinical measures, including sound localization, understanding speech in varied environments, and careful consideration of functional abilities as individuals with severe to profound UHL are being considered potential cochlear implant candidates.
Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali
2015-01-01
In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature. PMID:25799141
Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker
2014-11-23
The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.
Relation between speech-in-noise threshold, hearing loss and cognition from 40-69 years of age.
Moore, David R; Edmondson-Jones, Mark; Dawes, Piers; Fortnum, Heather; McCormack, Abby; Pierzycki, Robert H; Munro, Kevin J
2014-01-01
Healthy hearing depends on sensitive ears and adequate brain processing. Essential aspects of both hearing and cognition decline with advancing age, but it is largely unknown how one influences the other. The current standard measure of hearing, the pure-tone audiogram is not very cognitively demanding and does not predict well the most important yet challenging use of hearing, listening to speech in noisy environments. We analysed data from UK Biobank that asked 40-69 year olds about their hearing, and assessed their ability on tests of speech-in-noise hearing and cognition. About half a million volunteers were recruited through NHS registers. Respondents completed 'whole-body' testing in purpose-designed, community-based test centres across the UK. Objective hearing (spoken digit recognition in noise) and cognitive (reasoning, memory, processing speed) data were analysed using logistic and multiple regression methods. Speech hearing in noise declined exponentially with age for both sexes from about 50 years, differing from previous audiogram data that showed a more linear decline from <40 years for men, and consistently less hearing loss for women. The decline in speech-in-noise hearing was especially dramatic among those with lower cognitive scores. Decreasing cognitive ability and increasing age were both independently associated with decreasing ability to hear speech-in-noise (0.70 and 0.89 dB, respectively) among the population studied. Men subjectively reported up to 60% higher rates of difficulty hearing than women. Workplace noise history associated with difficulty in both subjective hearing and objective speech hearing in noise. Leisure noise history was associated with subjective, but not with objective difficulty hearing. Older people have declining cognitive processing ability associated with reduced ability to hear speech in noise, measured by recognition of recorded spoken digits. Subjective reports of hearing difficulty generally show a higher prevalence than objective measures, suggesting that current objective methods could be extended further.
Relation between Speech-in-Noise Threshold, Hearing Loss and Cognition from 40–69 Years of Age
Moore, David R.; Edmondson-Jones, Mark; Dawes, Piers; Fortnum, Heather; McCormack, Abby; Pierzycki, Robert H.; Munro, Kevin J.
2014-01-01
Background Healthy hearing depends on sensitive ears and adequate brain processing. Essential aspects of both hearing and cognition decline with advancing age, but it is largely unknown how one influences the other. The current standard measure of hearing, the pure-tone audiogram is not very cognitively demanding and does not predict well the most important yet challenging use of hearing, listening to speech in noisy environments. We analysed data from UK Biobank that asked 40–69 year olds about their hearing, and assessed their ability on tests of speech-in-noise hearing and cognition. Methods and Findings About half a million volunteers were recruited through NHS registers. Respondents completed ‘whole-body’ testing in purpose-designed, community-based test centres across the UK. Objective hearing (spoken digit recognition in noise) and cognitive (reasoning, memory, processing speed) data were analysed using logistic and multiple regression methods. Speech hearing in noise declined exponentially with age for both sexes from about 50 years, differing from previous audiogram data that showed a more linear decline from <40 years for men, and consistently less hearing loss for women. The decline in speech-in-noise hearing was especially dramatic among those with lower cognitive scores. Decreasing cognitive ability and increasing age were both independently associated with decreasing ability to hear speech-in-noise (0.70 and 0.89 dB, respectively) among the population studied. Men subjectively reported up to 60% higher rates of difficulty hearing than women. Workplace noise history associated with difficulty in both subjective hearing and objective speech hearing in noise. Leisure noise history was associated with subjective, but not with objective difficulty hearing. Conclusions Older people have declining cognitive processing ability associated with reduced ability to hear speech in noise, measured by recognition of recorded spoken digits. Subjective reports of hearing difficulty generally show a higher prevalence than objective measures, suggesting that current objective methods could be extended further. PMID:25229622
Speech recognition: Acoustic-phonetic knowledge acquisition and representation
NASA Astrophysics Data System (ADS)
Zue, Victor W.
1988-09-01
The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.
Polur, Prasad D; Miller, Gerald E
2005-01-01
Computer speech recognition of individuals with dysarthria, such as cerebral palsy patients, requires a robust technique that can handle conditions of very high variability and limited training data. In this study, a hidden Markov model (HMM) was constructed and conditions investigated that would provide improved performance for a dysarthric speech (isolated word) recognition system intended to act as an assistive/control tool. In particular, we investigated the effect of high-frequency spectral components on the recognition rate of the system to determine if they contributed useful additional information to the system. A small-size vocabulary spoken by three cerebral palsy subjects was chosen. Mel-frequency cepstral coefficients extracted with the use of 15 ms frames served as training input to an ergodic HMM setup. Subsequent results demonstrated that no significant useful information was available to the system for enhancing its ability to discriminate dysarthric speech above 5.5 kHz in the current set of dysarthric data. The level of variability in input dysarthric speech patterns limits the reliability of the system. However, its application as a rehabilitation/control tool to assist dysarthric motor-impaired individuals such as cerebral palsy subjects holds sufficient promise.
2015-10-01
Scoring, Gaussian Backend , etc.) as shown in Fig. 39. The methods in this domain also emphasized the ability to perform data purification for both...investigation using the same infrastructure was undertaken to explore Lombard effect “flavor” detection for improved speaker ID. The study The presence of...dimension selection and compared to a common N-gram frequency based selection. 2.1.2: Exploration on NN/DBN backend : Since Deep Neural Networks (DNN) have
On Older Listeners' Ability to Perceive Dynamic Pitch
ERIC Educational Resources Information Center
Shen, Jing; Wright, Richard; Souza, Pamela E.
2016-01-01
Purpose: Natural speech comes with variation in pitch, which serves as an important cue for speech recognition. The present study investigated older listeners' dynamic pitch perception with a focus on interindividual variability. In particular, we asked whether some of the older listeners' inability to perceive dynamic pitch stems from the higher…
Is talking to an automated teller machine natural and fun?
Chan, F Y; Khalid, H M
Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.
Firszt, Jill B.; Reeder, Ruth M.; Holden, Laura K.
2016-01-01
Objectives At a minimum, unilateral hearing loss (UHL) impairs sound localization ability and understanding speech in noisy environments, particularly if the loss is severe to profound. Accompanying the numerous negative consequences of UHL is considerable unexplained individual variability in the magnitude of its effects. Identification of co-variables that affect outcome and contribute to variability in UHLs could augment counseling, treatment options, and rehabilitation. Cochlear implantation as a treatment for UHL is on the rise yet little is known about factors that could impact performance or whether there is a group at risk for poor cochlear implant outcomes when hearing is near-normal in one ear. The overall goal of our research is to investigate the range and source of variability in speech recognition in noise and localization among individuals with severe to profound UHL and thereby help determine factors relevant to decisions regarding cochlear implantation in this population. Design The present study evaluated adults with severe to profound UHL and adults with bilateral normal hearing. Measures included adaptive sentence understanding in diffuse restaurant noise, localization, roving-source speech recognition (words from 1 of 15 speakers in a 140° arc) and an adaptive speech-reception threshold psychoacoustic task with varied noise types and noise-source locations. There were three age-gender-matched groups: UHL (severe to profound hearing loss in one ear and normal hearing in the contralateral ear), normal hearing listening bilaterally, and normal hearing listening unilaterally. Results Although the normal-hearing-bilateral group scored significantly better and had less performance variability than UHLs on all measures, some UHL participants scored within the range of the normal-hearing-bilateral group on all measures. The normal-hearing participants listening unilaterally had better monosyllabic word understanding than UHLs for words presented on the blocked/deaf side but not the open/hearing side. In contrast, UHLs localized better than the normal hearing unilateral listeners for stimuli on the open/hearing side but not the blocked/deaf side. This suggests that UHLs had learned strategies for improved localization on the side of the intact ear. The UHL and unilateral normal hearing participant groups were not significantly different for speech-in-noise measures. UHL participants with childhood rather than recent hearing loss onset localized significantly better; however, these two groups did not differ for speech recognition in noise. Age at onset in UHL adults appears to affect localization ability differently than understanding speech in noise. Hearing thresholds were significantly correlated with speech recognition for UHL participants but not the other two groups. Conclusions Auditory abilities of UHLs varied widely and could be explained only in part by hearing threshold levels. Age at onset and length of hearing loss influenced performance on some, but not all measures. Results support the need for a revised and diverse set of clinical measures, including sound localization, understanding speech in varied environments and careful consideration of functional abilities as individuals with severe to profound UHL are being considered potential cochlear implant candidates. PMID:28067750
Evaluation of auditory functions for Royal Canadian Mounted Police officers.
Vaillancourt, Véronique; Laroche, Chantal; Giguère, Christian; Beaulieu, Marc-André; Legault, Jean-Pierre
2011-06-01
Auditory fitness for duty (AFFD) testing is an important element in an assessment of workers' ability to perform job tasks safely and effectively. Functional hearing is particularly critical to job performance in law enforcement. Most often, assessment is based on pure-tone detection thresholds; however, its validity can be questioned and challenged in court. In an attempt to move beyond the pure-tone audiogram, some organizations like the Royal Canadian Mounted Police (RCMP) are incorporating additional testing to supplement audiometric data in their AFFD protocols, such as measurements of speech recognition in quiet and/or in noise, and sound localization. This article reports on the assessment of RCMP officers wearing hearing aids in speech recognition and sound localization tasks. The purpose was to quantify individual performance in different domains of hearing identified as necessary components of fitness for duty, and to document the type of hearing aids prescribed in the field and their benefit for functional hearing. The data are to help RCMP in making more informed decisions regarding AFFD in officers wearing hearing aids. The proposed new AFFD protocol included unaided and aided measures of speech recognition in quiet and in noise using the Hearing in Noise Test (HINT) and sound localization in the left/right (L/R) and front/back (F/B) horizontal planes. Sixty-four officers were identified and selected by the RCMP to take part in this study on the basis of hearing thresholds exceeding current audiometrically based criteria. This article reports the results of 57 officers wearing hearing aids. Based on individual results, 49% of officers were reclassified from nonoperational status to operational with limitations on fine hearing duties, given their unaided and/or aided performance. Group data revealed that hearing aids (1) improved speech recognition thresholds on the HINT, the effects being most prominent in Quiet and in conditions of spatial separation between target and noise (Noise Right and Noise Left) and least considerable in Noise Front; (2) neither significantly improved nor impeded L/R localization; and (3) substantially increased F/B errors in localization in a number of cases. Additional analyses also pointed to the poor ability of threshold data to predict functional abilities for speech in noise (r² = 0.26 to 0.33) and sound localization (r² = 0.03 to 0.28). Only speech in quiet (r² = 0.68 to 0.85) is predicted adequately from threshold data. Combined with previous findings, results indicate that the use of hearing aids can considerably affect F/B localization abilities in a number of individuals. Moreover, speech understanding in noise and sound localization abilities were poorly predicted from pure-tone thresholds, demonstrating the need to specifically test these abilities, both unaided and aided, when assessing AFFD. Finally, further work is needed to develop empirically based hearing criteria for the RCMP and identify best practices in hearing aid fittings for optimal functional hearing abilities. American Academy of Audiology.
Gordon-Salant, Sandra; Cole, Stacey Samuels
2016-01-01
This study aimed to determine if younger and older listeners with normal hearing who differ on working memory span perform differently on speech recognition tests in noise. Older adults typically exhibit poorer speech recognition scores in noise than younger adults, which is attributed primarily to poorer hearing sensitivity and more limited working memory capacity in older than younger adults. Previous studies typically tested older listeners with poorer hearing sensitivity and shorter working memory spans than younger listeners, making it difficult to discern the importance of working memory capacity on speech recognition. This investigation controlled for hearing sensitivity and compared speech recognition performance in noise by younger and older listeners who were subdivided into high and low working memory groups. Performance patterns were compared for different speech materials to assess whether or not the effect of working memory capacity varies with the demands of the specific speech test. The authors hypothesized that (1) normal-hearing listeners with low working memory span would exhibit poorer speech recognition performance in noise than those with high working memory span; (2) older listeners with normal hearing would show poorer speech recognition scores than younger listeners with normal hearing, when the two age groups were matched for working memory span; and (3) an interaction between age and working memory would be observed for speech materials that provide contextual cues. Twenty-eight older (61 to 75 years) and 25 younger (18 to 25 years) normal-hearing listeners were assigned to groups based on age and working memory status. Northwestern University Auditory Test No. 6 words and Institute of Electrical and Electronics Engineers sentences were presented in noise using an adaptive procedure to measure the signal-to-noise ratio corresponding to 50% correct performance. Cognitive ability was evaluated with two tests of working memory (Listening Span Test and Reading Span Test) and two tests of processing speed (Paced Auditory Serial Addition Test and The Letter Digit Substitution Test). Significant effects of age and working memory capacity were observed on the speech recognition measures in noise, but these effects were mediated somewhat by the speech signal. Specifically, main effects of age and working memory were revealed for both words and sentences, but the interaction between the two was significant for sentences only. For these materials, effects of age were observed for listeners in the low working memory groups only. Although all cognitive measures were significantly correlated with speech recognition in noise, working memory span was the most important variable accounting for speech recognition performance. The results indicate that older adults with high working memory capacity are able to capitalize on contextual cues and perform as well as young listeners with high working memory capacity for sentence recognition. The data also suggest that listeners with normal hearing and low working memory capacity are less able to adapt to distortion of speech signals caused by background noise, which requires the allocation of more processing resources to earlier processing stages. These results indicate that both younger and older adults with low working memory capacity and normal hearing are at a disadvantage for recognizing speech in noise.
[Improvement in Phoneme Discrimination in Noise in Normal Hearing Adults].
Schumann, A; Garea Garcia, L; Hoppe, U
2017-02-01
Objective: The study's aim was to examine the possibility to train phoneme-discrimination in noise with normal hearing adults, and its effectivity on speech recognition in noise. A specific computerised training program was used, consisting of special nonsense-syllables with background noise, to train participants' discrimination ability. Material and Methods: 46 normal hearing subjects took part in this study, 28 as training group participants, 18 as control group participants. Only the training group subjects were asked to train over a period of 3 weeks, twice a week for an hour with a computer-based training program. Speech recognition in noise were measured pre- to posttraining for the training group subjects with the Freiburger Einsilber Test. The control group subjects obtained test and restest measures within a 2-3 week break. For the training group follow-up speech recognition was measured 2-3 months after the end of the training. Results: The majority of training group subjects improved their phoneme discrimination significantly. Besides, their speech recognition in noise improved significantly during the training compared to the control group, and remained stable for a period of time. Conclusions: Phonem-Discrimination in noise can be trained by normal hearing adults. The improvements have got a positiv effect on speech recognition in noise, also for a longer period of time. © Georg Thieme Verlag KG Stuttgart · New York.
NASA Astrophysics Data System (ADS)
Collison, Elizabeth A.; Munson, Benjamin; Carney, Arlene E.
2002-05-01
Recent research has attempted to identify the factors that predict speech perception performance among users of cochlear implants (CIs). Studies have found that approximately 20%-60% of the variance in speech perception scores can be accounted for by factors including duration of deafness, etiology, type of device, and length of implant use, leaving approximately 50% of the variance unaccounted for. The current study examines the extent to which vocabulary size and nonverbal cognitive ability predict CI listeners' spoken word recognition. Fifteen postlingually deafened adults with nucleus or clarion CIs were given standardized assessments of nonverbal cognitive ability and expressive vocabulary size: the Expressive Vocabulary Test, the Test of Nonverbal Intelligence-III, and the Woodcock-Johnson-III Test of Cognitive Ability, Verbal Comprehension subtest. Two spoken word recognition tasks were administered. In the first, listeners identified isophonemic CVC words. In the second, listeners identified gated words varying in lexical frequency and neighborhood density. Analyses will examine the influence of lexical frequency and neighborhood density on the uniqueness point in the gating task, as well as relationships among nonverbal cognitive ability, vocabulary size, and the two spoken word recognition measures. [Work supported by NIH Grant P01 DC00110 and by the Lions 3M Hearing Foundation.
Validation of Automated Scoring of Oral Reading
ERIC Educational Resources Information Center
Balogh, Jennifer; Bernstein, Jared; Cheng, Jian; Van Moere, Alistair; Townshend, Brent; Suzuki, Masanori
2012-01-01
A two-part experiment is presented that validates a new measurement tool for scoring oral reading ability. Data collected by the U.S. government in a large-scale literacy assessment of adults were analyzed by a system called VersaReader that uses automatic speech recognition and speech processing technologies to score oral reading fluency. In the…
Effects of noise on speech recognition: Challenges for communication by service members.
Le Prell, Colleen G; Clavier, Odile H
2017-06-01
Speech communication often takes place in noisy environments; this is an urgent issue for military personnel who must communicate in high-noise environments. The effects of noise on speech recognition vary significantly according to the sources of noise, the number and types of talkers, and the listener's hearing ability. In this review, speech communication is first described as it relates to current standards of hearing assessment for military and civilian populations. The next section categorizes types of noise (also called maskers) according to their temporal characteristics (steady or fluctuating) and perceptive effects (energetic or informational masking). Next, speech recognition difficulties experienced by listeners with hearing loss and by older listeners are summarized, and questions on the possible causes of speech-in-noise difficulty are discussed, including recent suggestions of "hidden hearing loss". The final section describes tests used by military and civilian researchers, audiologists, and hearing technicians to assess performance of an individual in recognizing speech in background noise, as well as metrics that predict performance based on a listener and background noise profile. This article provides readers with an overview of the challenges associated with speech communication in noisy backgrounds, as well as its assessment and potential impact on functional performance, and provides guidance for important new research directions relevant not only to military personnel, but also to employees who work in high noise environments. Copyright © 2016 Elsevier B.V. All rights reserved.
The Impact of Strong Assimilation on the Perception of Connected Speech
ERIC Educational Resources Information Center
Gaskell, M. Gareth; Snoeren, Natalie D.
2008-01-01
Models of compensation for phonological variation in spoken word recognition differ in their ability to accommodate complete assimilatory alternations (such as run assimilating fully to rum in the context of a quick run picks you up). Two experiments addressed whether such complete changes can be observed in casual speech, and if so, whether they…
NASA Astrophysics Data System (ADS)
Poock, G. K.; Martin, B. J.
1984-02-01
This was an applied investigation examining the ability of a speech recognition system to recognize speakers' inputs when the speakers were under different stress levels. Subjects were asked to speak to a voice recognition system under three conditions: (1) normal office environment, (2) emotional stress, and (3) perceptual-motor stress. Results indicate a definite relationship between voice recognition system performance and the type of low stress reference patterns used to achieve recognition.
Open source OCR framework using mobile devices
NASA Astrophysics Data System (ADS)
Zhou, Steven Zhiying; Gilani, Syed Omer; Winkler, Stefan
2008-02-01
Mobile phones have evolved from passive one-to-one communication device to powerful handheld computing device. Today most new mobile phones are capable of capturing images, recording video, and browsing internet and do much more. Exciting new social applications are emerging on mobile landscape, like, business card readers, sing detectors and translators. These applications help people quickly gather the information in digital format and interpret them without the need of carrying laptops or tablet PCs. However with all these advancements we find very few open source software available for mobile phones. For instance currently there are many open source OCR engines for desktop platform but, to our knowledge, none are available on mobile platform. Keeping this in perspective we propose a complete text detection and recognition system with speech synthesis ability, using existing desktop technology. In this work we developed a complete OCR framework with subsystems from open source desktop community. This includes a popular open source OCR engine named Tesseract for text detection & recognition and Flite speech synthesis module, for adding text-to-speech ability.
Koeritzer, Margaret A; Rogers, Chad S; Van Engen, Kristin J; Peelle, Jonathan E
2018-03-15
The goal of this study was to determine how background noise, linguistic properties of spoken sentences, and listener abilities (hearing sensitivity and verbal working memory) affect cognitive demand during auditory sentence comprehension. We tested 30 young adults and 30 older adults. Participants heard lists of sentences in quiet and in 8-talker babble at signal-to-noise ratios of +15 dB and +5 dB, which increased acoustic challenge but left the speech largely intelligible. Half of the sentences contained semantically ambiguous words to additionally manipulate cognitive challenge. Following each list, participants performed a visual recognition memory task in which they viewed written sentences and indicated whether they remembered hearing the sentence previously. Recognition memory (indexed by d') was poorer for acoustically challenging sentences, poorer for sentences containing ambiguous words, and differentially poorer for noisy high-ambiguity sentences. Similar patterns were observed for Z-transformed response time data. There were no main effects of age, but age interacted with both acoustic clarity and semantic ambiguity such that older adults' recognition memory was poorer for acoustically degraded high-ambiguity sentences than the young adults'. Within the older adult group, exploratory correlation analyses suggested that poorer hearing ability was associated with poorer recognition memory for sentences in noise, and better verbal working memory was associated with better recognition memory for sentences in noise. Our results demonstrate listeners' reliance on domain-general cognitive processes when listening to acoustically challenging speech, even when speech is highly intelligible. Acoustic challenge and semantic ambiguity both reduce the accuracy of listeners' recognition memory for spoken sentences. https://doi.org/10.23641/asha.5848059.
Computational validation of the motor contribution to speech perception.
Badino, Leonardo; D'Ausilio, Alessandro; Fadiga, Luciano; Metta, Giorgio
2014-07-01
Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion information onto observers' motor representations. This process of matching the actions of others onto our own sensorimotor repertoire is thought to be important for action recognition, providing a non-mediated "motor perception" based on a bidirectional flow of information along the mirror parieto-frontal circuits. State-of-the-art machine learning strategies for hand action identification have shown better performances when sensorimotor data, as opposed to visual information only, are available during learning. As speech is a particular type of action (with acoustic targets), it is expected to activate a mirror neuron mechanism. Indeed, in speech perception, motor centers have been shown to be causally involved in the discrimination of speech sounds. In this paper, we review recent neurophysiological and machine learning-based studies showing (a) the specific contribution of the motor system to speech perception and (b) that automatic phone recognition is significantly improved when motor data are used during training of classifiers (as opposed to learning from purely auditory data). Copyright © 2014 Cognitive Science Society, Inc.
Some factors underlying individual differences in speech recognition on PRESTO: a first report.
Tamati, Terrin N; Gilbert, Jaimie L; Pisoni, David B
2013-01-01
Previous studies investigating speech recognition in adverse listening conditions have found extensive variability among individual listeners. However, little is currently known about the core underlying factors that influence speech recognition abilities. To investigate sensory, perceptual, and neurocognitive differences between good and poor listeners on the Perceptually Robust English Sentence Test Open-set (PRESTO), a new high-variability sentence recognition test under adverse listening conditions. Participants who fell in the upper quartile (HiPRESTO listeners) or lower quartile (LoPRESTO listeners) on key word recognition on sentences from PRESTO in multitalker babble completed a battery of behavioral tasks and self-report questionnaires designed to investigate real-world hearing difficulties, indexical processing skills, and neurocognitive abilities. Young, normal-hearing adults (N = 40) from the Indiana University community participated in the current study. Participants' assessment of their own real-world hearing difficulties was measured with a self-report questionnaire on situational hearing and hearing health history. Indexical processing skills were assessed using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Neurocognitive abilities were measured with the Auditory Digit Span Forward (verbal short-term memory) and Digit Span Backward (verbal working memory) tests, the Stroop Color and Word Test (attention/inhibition), the WordFam word familiarity test (vocabulary size), the Behavioral Rating Inventory of Executive Function-Adult Version (BRIEF-A) self-report questionnaire on executive function, and two performance subtests of the Wechsler Abbreviated Scale of Intelligence (WASI) Performance Intelligence Quotient (IQ; nonverbal intelligence). Scores on self-report questionnaires and behavioral tasks were tallied and analyzed by listener group (HiPRESTO and LoPRESTO). The extreme groups did not differ overall on self-reported hearing difficulties in real-world listening environments. However, an item-by-item analysis of questions revealed that LoPRESTO listeners reported significantly greater difficulty understanding speakers in a public place. HiPRESTO listeners were significantly more accurate than LoPRESTO listeners at gender discrimination and regional dialect categorization, but they did not differ on talker discrimination accuracy or response time, or gender discrimination response time. HiPRESTO listeners also had longer forward and backward digit spans, higher word familiarity ratings on the WordFam test, and lower (better) scores for three individual items on the BRIEF-A questionnaire related to cognitive load. The two groups did not differ on the Stroop Color and Word Test or either of the WASI performance IQ subtests. HiPRESTO listeners and LoPRESTO listeners differed in indexical processing abilities, short-term and working memory capacity, vocabulary size, and some domains of executive functioning. These findings suggest that individual differences in the ability to encode and maintain highly detailed episodic information in speech may underlie the variability observed in speech recognition performance in adverse listening conditions using high-variability PRESTO sentences in multitalker babble. American Academy of Audiology.
Some Factors Underlying Individual Differences in Speech Recognition on PRESTO: A First Report
Tamati, Terrin N.; Gilbert, Jaimie L.; Pisoni, David B.
2013-01-01
Background Previous studies investigating speech recognition in adverse listening conditions have found extensive variability among individual listeners. However, little is currently known about the core, underlying factors that influence speech recognition abilities. Purpose To investigate sensory, perceptual, and neurocognitive differences between good and poor listeners on PRESTO, a new high-variability sentence recognition test under adverse listening conditions. Research Design Participants who fell in the upper quartile (HiPRESTO listeners) or lower quartile (LoPRESTO listeners) on key word recognition on sentences from PRESTO in multitalker babble completed a battery of behavioral tasks and self-report questionnaires designed to investigate real-world hearing difficulties, indexical processing skills, and neurocognitive abilities. Study Sample Young, normal-hearing adults (N = 40) from the Indiana University community participated in the current study. Data Collection and Analysis Participants’ assessment of their own real-world hearing difficulties was measured with a self-report questionnaire on situational hearing and hearing health history. Indexical processing skills were assessed using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Neurocognitive abilities were measured with the Auditory Digit Span Forward (verbal short-term memory) and Digit Span Backward (verbal working memory) tests, the Stroop Color and Word Test (attention/inhibition), the WordFam word familiarity test (vocabulary size), the BRIEF-A self-report questionnaire on executive function, and two performance subtests of the WASI Performance IQ (non-verbal intelligence). Scores on self-report questionnaires and behavioral tasks were tallied and analyzed by listener group (HiPRESTO and LoPRESTO). Results The extreme groups did not differ overall on self-reported hearing difficulties in real-world listening environments. However, an item-by-item analysis of questions revealed that LoPRESTO listeners reported significantly greater difficulty understanding speakers in a public place. HiPRESTO listeners were significantly more accurate than LoPRESTO listeners at gender discrimination and regional dialect categorization, but they did not differ on talker discrimination accuracy or response time, or gender discrimination response time. HiPRESTO listeners also had longer forward and backward digit spans, higher word familiarity ratings on the WordFam test, and lower (better) scores for three individual items on the BRIEF-A questionnaire related to cognitive load. The two groups did not differ on the Stroop Color and Word Test or either of the WASI performance IQ subtests. Conclusions HiPRESTO listeners and LoPRESTO listeners differed in indexical processing abilities, short-term and working memory capacity, vocabulary size, and some domains of executive functioning. These findings suggest that individual differences in the ability to encode and maintain highly detailed episodic information in speech may underlie the variability observed in speech recognition performance in adverse listening conditions using high-variability PRESTO sentences in multitalker babble. PMID:24047949
Comparing Monotic and Diotic Selective Auditory Attention Abilities in Children
ERIC Educational Resources Information Center
Cherry, Rochelle; Rubinstein, Adrienne
2006-01-01
Purpose: Some researchers have assessed ear-specific performance of auditory processing ability using speech recognition tasks with normative data based on diotic administration. The present study investigated whether monotic and diotic administrations yield similar results using the Selective Auditory Attention Test. Method: Seventy-two typically…
Bimodal benefits on objective and subjective outcomes for adult cochlear implant users.
Heo, Ji-Hye; Lee, Jae-Hee; Lee, Won-Sang
2013-09-01
Given that only a few studies have focused on the bimodal benefits on objective and subjective outcomes and emphasized the importance of individual data, the present study aimed to measure the bimodal benefits on the objective and subjective outcomes for adults with cochlear implant. Fourteen listeners with bimodal devices were tested on the localization and recognition abilities using environmental sounds, 1-talker, and 2-talker speech materials. The localization ability was measured through an 8-loudspeaker array. For the recognition measures, listeners were asked to repeat the sentences or say the environmental sounds the listeners heard. As a subjective questionnaire, three domains of Korean-version of Speech, Spatial, Qualities of Hearing scale (K-SSQ) were used to explore any relationships between objective and subjective outcomes. Based on the group-mean data, the bimodal hearing enhanced both localization and recognition regardless of test material. However, the inter- and intra-subject variability appeared to be large across test materials for both localization and recognition abilities. Correlation analyses revealed that the relationships were not always consistent between the objective outcomes and the subjective self-reports with bimodal devices. Overall, this study supports significant bimodal advantages on localization and recognition measures, yet the large individual variability in bimodal benefits should be considered carefully for the clinical assessment as well as counseling. The discrepant relations between objective and subjective results suggest that the bimodal benefits in traditional localization or recognition measures might not necessarily correspond to the self-reported subjective advantages in everyday listening environments.
Processing load induced by informational masking is related to linguistic abilities.
Koelewijn, Thomas; Zekveld, Adriana A; Festen, Joost M; Rönnberg, Jerker; Kramer, Sophia E
2012-01-01
It is often assumed that the benefit of hearing aids is not primarily reflected in better speech performance, but that it is reflected in less effortful listening in the aided than in the unaided condition. Before being able to assess such a hearing aid benefit the present study examined how processing load while listening to masked speech relates to inter-individual differences in cognitive abilities relevant for language processing. Pupil dilation was measured in thirty-two normal hearing participants while listening to sentences masked by fluctuating noise or interfering speech at either 50% and 84% intelligibility. Additionally, working memory capacity, inhibition of irrelevant information, and written text reception was tested. Pupil responses were larger during interfering speech as compared to fluctuating noise. This effect was independent of intelligibility level. Regression analysis revealed that high working memory capacity, better inhibition, and better text reception were related to better speech reception thresholds. Apart from a positive relation to speech recognition, better inhibition and better text reception are also positively related to larger pupil dilation in the single-talker masker conditions. We conclude that better cognitive abilities not only relate to better speech perception, but also partly explain higher processing load in complex listening conditions.
Accuracy of cochlear implant recipients in speech reception in the presence of background music.
Gfeller, Kate; Turner, Christopher; Oleson, Jacob; Kliethermes, Stephanie; Driscoll, Virginia
2012-12-01
This study examined speech recognition abilities of cochlear implant (CI) recipients in the spectrally complex listening condition of 3 contrasting types of background music, and compared performance based upon listener groups: CI recipients using conventional long-electrode devices, Hybrid CI recipients (acoustic plus electric stimulation), and normal-hearing adults. We tested 154 long-electrode CI recipients using varied devices and strategies, 21 Hybrid CI recipients, and 49 normal-hearing adults on closed-set recognition of spondees presented in 3 contrasting forms of background music (piano solo, large symphony orchestra, vocal solo with small combo accompaniment) in an adaptive test. Signal-to-noise ratio thresholds for speech in music were examined in relation to measures of speech recognition in background noise and multitalker babble, pitch perception, and music experience. The signal-to-noise ratio thresholds for speech in music varied as a function of category of background music, group membership (long-electrode, Hybrid, normal-hearing), and age. The thresholds for speech in background music were significantly correlated with measures of pitch perception and thresholds for speech in background noise; auditory status was an important predictor. Evidence suggests that speech reception thresholds in background music change as a function of listener age (with more advanced age being detrimental), structural characteristics of different types of music, and hearing status (residual hearing). These findings have implications for everyday listening conditions such as communicating in social or commercial situations in which there is background music.
Multisensory speech perception in autism spectrum disorder: From phoneme to whole-word perception.
Stevenson, Ryan A; Baum, Sarah H; Segers, Magali; Ferber, Susanne; Barense, Morgan D; Wallace, Mark T
2017-07-01
Speech perception in noisy environments is boosted when a listener can see the speaker's mouth and integrate the auditory and visual speech information. Autistic children have a diminished capacity to integrate sensory information across modalities, which contributes to core symptoms of autism, such as impairments in social communication. We investigated the abilities of autistic and typically-developing (TD) children to integrate auditory and visual speech stimuli in various signal-to-noise ratios (SNR). Measurements of both whole-word and phoneme recognition were recorded. At the level of whole-word recognition, autistic children exhibited reduced performance in both the auditory and audiovisual modalities. Importantly, autistic children showed reduced behavioral benefit from multisensory integration with whole-word recognition, specifically at low SNRs. At the level of phoneme recognition, autistic children exhibited reduced performance relative to their TD peers in auditory, visual, and audiovisual modalities. However, and in contrast to their performance at the level of whole-word recognition, both autistic and TD children showed benefits from multisensory integration for phoneme recognition. In accordance with the principle of inverse effectiveness, both groups exhibited greater benefit at low SNRs relative to high SNRs. Thus, while autistic children showed typical multisensory benefits during phoneme recognition, these benefits did not translate to typical multisensory benefit of whole-word recognition in noisy environments. We hypothesize that sensory impairments in autistic children raise the SNR threshold needed to extract meaningful information from a given sensory input, resulting in subsequent failure to exhibit behavioral benefits from additional sensory information at the level of whole-word recognition. Autism Res 2017. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. Autism Res 2017, 10: 1280-1290. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. © 2017 International Society for Autism Research, Wiley Periodicals, Inc.
Normative Data on Audiovisual Speech Integration Using Sentence Recognition and Capacity Measures
Altieri, Nicholas; Hudock, Daniel
2016-01-01
Objective The ability to use visual speech cues and integrate them with auditory information is important, especially in noisy environments and for hearing-impaired (HI) listeners. Providing data on measures of integration skills that encompass accuracy and processing speed will benefit researchers and clinicians. Design The study consisted of two experiments: First, accuracy scores were obtained using CUNY sentences, and capacity measures that assessed reaction-time distributions were obtained from a monosyllabic word recognition task. Study Sample We report data on two measures of integration obtained from a sample comprised of 86 young and middle-age adult listeners: Results To summarize our results, capacity showed a positive correlation with accuracy measures of audiovisual benefit obtained from sentence recognition. More relevant, factor analysis indicated that a single-factor model captured audiovisual speech integration better than models containing more factors. Capacity exhibited strong loadings on the factor, while the accuracy-based measures from sentence recognition exhibited weaker loadings. Conclusions Results suggest that a listener’s integration skills may be assessed optimally using a measure that incorporates both processing speed and accuracy. PMID:26853446
Normative data on audiovisual speech integration using sentence recognition and capacity measures.
Altieri, Nicholas; Hudock, Daniel
2016-01-01
The ability to use visual speech cues and integrate them with auditory information is important, especially in noisy environments and for hearing-impaired (HI) listeners. Providing data on measures of integration skills that encompass accuracy and processing speed will benefit researchers and clinicians. The study consisted of two experiments: First, accuracy scores were obtained using City University of New York (CUNY) sentences, and capacity measures that assessed reaction-time distributions were obtained from a monosyllabic word recognition task. We report data on two measures of integration obtained from a sample comprised of 86 young and middle-age adult listeners: To summarize our results, capacity showed a positive correlation with accuracy measures of audiovisual benefit obtained from sentence recognition. More relevant, factor analysis indicated that a single-factor model captured audiovisual speech integration better than models containing more factors. Capacity exhibited strong loadings on the factor, while the accuracy-based measures from sentence recognition exhibited weaker loadings. Results suggest that a listener's integration skills may be assessed optimally using a measure that incorporates both processing speed and accuracy.
Li, Tianhao; Fu, Qian-Jie
2011-08-01
(1) To investigate whether voice gender discrimination (VGD) could be a useful indicator of the spectral and temporal processing abilities of individual cochlear implant (CI) users; (2) To examine the relationship between VGD and speech recognition with CI when comparable acoustic cues are used for both perception processes. VGD was measured using two talker sets with different inter-gender fundamental frequencies (F(0)), as well as different acoustic CI simulations. Vowel and consonant recognition in quiet and noise were also measured and compared with VGD performance. Eleven postlingually deaf CI users. The results showed that (1) mean VGD performance differed for different stimulus sets, (2) VGD and speech recognition performance varied among individual CI users, and (3) individual VGD performance was significantly correlated with speech recognition performance under certain conditions. VGD measured with selected stimulus sets might be useful for assessing not only pitch-related perception, but also spectral and temporal processing by individual CI users. In addition to improvements in spectral resolution and modulation detection, the improvement in higher modulation frequency discrimination might be particularly important for CI users in noisy environments.
An Introduction to Neural Networks for Hearing Aid Noise Recognition.
ERIC Educational Resources Information Center
Kim, Jun W.; Tyler, Richard S.
1995-01-01
This article introduces the use of multilayered artificial neural networks in hearing aid noise recognition. It reviews basic principles of neural networks, and offers an example of an application in which a neural network is used to identify the presence or absence of noise in speech. The ability of neural networks to "learn" the…
Automatic speech recognition technology development at ITT Defense Communications Division
NASA Technical Reports Server (NTRS)
White, George M.
1977-01-01
An assessment of the applications of automatic speech recognition to defense communication systems is presented. Future research efforts include investigations into the following areas: (1) dynamic programming; (2) recognition of speech degraded by noise; (3) speaker independent recognition; (4) large vocabulary recognition; (5) word spotting and continuous speech recognition; and (6) isolated word recognition.
Li, Huahui; Kong, Lingzhi; Wu, Xihong; Li, Liang
2013-01-01
In reverberant rooms with multiple-people talking, spatial separation between speech sources improves recognition of attended speech, even though both the head-shadowing and interaural-interaction unmasking cues are limited by numerous reflections. It is the perceptual integration between the direct wave and its reflections that bridges the direct-reflection temporal gaps and results in the spatial unmasking under reverberant conditions. This study further investigated (1) the temporal dynamic of the direct-reflection-integration-based spatial unmasking as a function of the reflection delay, and (2) whether this temporal dynamic is correlated with the listeners’ auditory ability to temporally retain raw acoustic signals (i.e., the fast decaying primitive auditory memory, PAM). The results showed that recognition of the target speech against the speech-masker background is a descending exponential function of the delay of the simulated target reflection. In addition, the temporal extent of PAM is frequency dependent and markedly longer than that for perceptual fusion. More importantly, the temporal dynamic of the speech-recognition function is significantly correlated with the temporal extent of the PAM of low-frequency raw signals. Thus, we propose that a chain process, which links the earlier-stage PAM with the later-stage correlation computation, perceptual integration, and attention facilitation, plays a role in spatially unmasking target speech under reverberant conditions. PMID:23658664
Accuracy of Cochlear Implant Recipients on Speech Reception in Background Music
Gfeller, Kate; Turner, Christopher; Oleson, Jacob; Kliethermes, Stephanie; Driscoll, Virginia
2012-01-01
Objectives This study (a) examined speech recognition abilities of cochlear implant (CI) recipients in the spectrally complex listening condition of three contrasting types of background music, and (b) compared performance based upon listener groups: CI recipients using conventional long-electrode (LE) devices, Hybrid CI recipients (acoustic plus electric stimulation), and normal-hearing (NH) adults. Methods We tested 154 LE CI recipients using varied devices and strategies, 21 Hybrid CI recipients, and 49 NH adults on closed-set recognition of spondees presented in three contrasting forms of background music (piano solo, large symphony orchestra, vocal solo with small combo accompaniment) in an adaptive test. Outcomes Signal-to-noise thresholds for speech in music (SRTM) were examined in relation to measures of speech recognition in background noise and multi-talker babble, pitch perception, and music experience. Results SRTM thresholds varied as a function of category of background music, group membership (LE, Hybrid, NH), and age. Thresholds for speech in background music were significantly correlated with measures of pitch perception and speech in background noise thresholds; auditory status was an important predictor. Conclusions Evidence suggests that speech reception thresholds in background music change as a function of listener age (with more advanced age being detrimental), structural characteristics of different types of music, and hearing status (residual hearing). These findings have implications for everyday listening conditions such as communicating in social or commercial situations in which there is background music. PMID:23342550
Fifty years of progress in speech and speaker recognition
NASA Astrophysics Data System (ADS)
Furui, Sadaoki
2004-10-01
Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.
A Spondee Recognition Test for Young Hearing-Impaired Children
ERIC Educational Resources Information Center
Cramer, Kathryn D.; Erber, Norman P.
1974-01-01
An auditory test of 10 spondaic words recorded on Language Master cards was presented monaurally, through insert receivers to 58 hearing-impaired young children to evaluate their ability to recognize familiar speech material. (MYS)
Free Field Word recognition test in the presence of noise in normal hearing adults.
Almeida, Gleide Viviani Maciel; Ribas, Angela; Calleros, Jorge
In ideal listening situations, subjects with normal hearing can easily understand speech, as can many subjects who have a hearing loss. To present the validation of the Word Recognition Test in a Free Field in the Presence of Noise in normal-hearing adults. Sample consisted of 100 healthy adults over 18 years of age with normal hearing. After pure tone audiometry, a speech recognition test was applied in free field condition with monosyllables and disyllables, with standardized material in three listening situations: optimal listening condition (no noise), with a signal to noise ratio of 0dB and a signal to noise ratio of -10dB. For these tests, an environment in calibrated free field was arranged where speech was presented to the subject being tested from two speakers located at 45°, and noise from a third speaker, located at 180°. All participants had speech audiometry results in the free field between 88% and 100% in the three listening situations. Word Recognition Test in Free Field in the Presence of Noise proved to be easy to be organized and applied. The results of the test validation suggest that individuals with normal hearing should get between 88% and 100% of the stimuli correct. The test can be an important tool in measuring noise interference on the speech perception abilities. Copyright © 2016 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.
Speech and gesture interfaces for squad-level human-robot teaming
NASA Astrophysics Data System (ADS)
Harris, Jonathan; Barber, Daniel
2014-06-01
As the military increasingly adopts semi-autonomous unmanned systems for military operations, utilizing redundant and intuitive interfaces for communication between Soldiers and robots is vital to mission success. Currently, Soldiers use a common lexicon to verbally and visually communicate maneuvers between teammates. In order for robots to be seamlessly integrated within mixed-initiative teams, they must be able to understand this lexicon. Recent innovations in gaming platforms have led to advancements in speech and gesture recognition technologies, but the reliability of these technologies for enabling communication in human robot teaming is unclear. The purpose for the present study is to investigate the performance of Commercial-Off-The-Shelf (COTS) speech and gesture recognition tools in classifying a Squad Level Vocabulary (SLV) for a spatial navigation reconnaissance and surveillance task. The SLV for this study was based on findings from a survey conducted with Soldiers at Fort Benning, GA. The items of the survey focused on the communication between the Soldier and the robot, specifically in regards to verbally instructing them to execute reconnaissance and surveillance tasks. Resulting commands, identified from the survey, were then converted to equivalent arm and hand gestures, leveraging existing visual signals (e.g. U.S. Army Field Manual for Visual Signaling). A study was then run to test the ability of commercially available automated speech recognition technologies and a gesture recognition glove to classify these commands in a simulated intelligence, surveillance, and reconnaissance task. This paper presents classification accuracy of these devices for both speech and gesture modalities independently.
Bimodal Benefits on Objective and Subjective Outcomes for Adult Cochlear Implant Users
Heo, Ji-Hye; Lee, Won-Sang
2013-01-01
Background and Objectives Given that only a few studies have focused on the bimodal benefits on objective and subjective outcomes and emphasized the importance of individual data, the present study aimed to measure the bimodal benefits on the objective and subjective outcomes for adults with cochlear implant. Subjects and Methods Fourteen listeners with bimodal devices were tested on the localization and recognition abilities using environmental sounds, 1-talker, and 2-talker speech materials. The localization ability was measured through an 8-loudspeaker array. For the recognition measures, listeners were asked to repeat the sentences or say the environmental sounds the listeners heard. As a subjective questionnaire, three domains of Korean-version of Speech, Spatial, Qualities of Hearing scale (K-SSQ) were used to explore any relationships between objective and subjective outcomes. Results Based on the group-mean data, the bimodal hearing enhanced both localization and recognition regardless of test material. However, the inter- and intra-subject variability appeared to be large across test materials for both localization and recognition abilities. Correlation analyses revealed that the relationships were not always consistent between the objective outcomes and the subjective self-reports with bimodal devices. Conclusions Overall, this study supports significant bimodal advantages on localization and recognition measures, yet the large individual variability in bimodal benefits should be considered carefully for the clinical assessment as well as counseling. The discrepant relations between objective and subjective results suggest that the bimodal benefits in traditional localization or recognition measures might not necessarily correspond to the self-reported subjective advantages in everyday listening environments. PMID:24653909
Influences of High and Low Variability on Infant Word Recognition
ERIC Educational Resources Information Center
Singh, Leher
2008-01-01
Although infants begin to encode and track novel words in fluent speech by 7.5 months, their ability to recognize words is somewhat limited at this stage. In particular, when the surface form of a word is altered, by changing the gender or affective prosody of the speaker, infants begin to falter at spoken word recognition. Given that natural…
Effect of signal to noise ratio on the speech perception ability of older adults
Shojaei, Elahe; Ashayeri, Hassan; Jafari, Zahra; Zarrin Dast, Mohammad Reza; Kamali, Koorosh
2016-01-01
Background: Speech perception ability depends on auditory and extra-auditory elements. The signal- to-noise ratio (SNR) is an extra-auditory element that has an effect on the ability to normally follow speech and maintain a conversation. Speech in noise perception difficulty is a common complaint of the elderly. In this study, the importance of SNR magnitude as an extra-auditory effect on speech perception in noise was examined in the elderly. Methods: The speech perception in noise test (SPIN) was conducted on 25 elderly participants who had bilateral low–mid frequency normal hearing thresholds at three SNRs in the presence of ipsilateral white noise. These participants were selected by available sampling method. Cognitive screening was done using the Persian Mini Mental State Examination (MMSE) test. Results: Independent T- test, ANNOVA and Pearson Correlation Index were used for statistical analysis. There was a significant difference in word discrimination scores at silence and at three SNRs in both ears (p≤0.047). Moreover, there was a significant difference in word discrimination scores for paired SNRs (0 and +5, 0 and +10, and +5 and +10 (p≤0.04)). No significant correlation was found between age and word recognition scores at silence and at three SNRs in both ears (p≥0.386). Conclusion: Our results revealed that decreasing the signal level and increasing the competing noise considerably reduced the speech perception ability in normal hearing at low–mid thresholds in the elderly. These results support the critical role of SNRs for speech perception ability in the elderly. Furthermore, our results revealed that normal hearing elderly participants required compensatory strategies to maintain normal speech perception in challenging acoustic situations. PMID:27390712
Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise
Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther
2016-01-01
Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an average vocabulary size similar to that of younger adults, they were still slower in lexical access. PMID:27458400
Should visual speech cues (speechreading) be considered when fitting hearing aids?
NASA Astrophysics Data System (ADS)
Grant, Ken
2002-05-01
When talker and listener are face-to-face, visual speech cues become an important part of the communication environment, and yet, these cues are seldom considered when designing hearing aids. Models of auditory-visual speech recognition highlight the importance of complementary versus redundant speech information for predicting auditory-visual recognition performance. Thus, for hearing aids to work optimally when visual speech cues are present, it is important to know whether the cues provided by amplification and the cues provided by speechreading complement each other. In this talk, data will be reviewed that show nonmonotonicity between auditory-alone speech recognition and auditory-visual speech recognition, suggesting that efforts designed solely to improve auditory-alone recognition may not always result in improved auditory-visual recognition. Data will also be presented showing that one of the most important speech cues for enhancing auditory-visual speech recognition performance, voicing, is often the cue that benefits least from amplification.
Lexical influences on competing speech perception in younger, middle-aged, and older adults
Helfer, Karen S.; Jesse, Alexandra
2015-01-01
The influence of lexical characteristics of words in to-be-attended and to-be-ignored speech streams was examined in a competing speech task. Older, middle-aged, and younger adults heard pairs of low-cloze probability sentences in which the frequency or neighborhood density of words was manipulated in either the target speech stream or the masking speech stream. All participants also completed a battery of cognitive measures. As expected, for all groups, target words that occur frequently or that are from sparse lexical neighborhoods were easier to recognize than words that are infrequent or from dense neighborhoods. Compared to other groups, these neighborhood density effects were largest for older adults; the frequency effect was largest for middle-aged adults. Lexical characteristics of words in the to-be-ignored speech stream also affected recognition of to-be-attended words, but only when overall performance was relatively good (that is, when younger participants listened to the speech streams at a more advantageous signal-to-noise ratio). For these listeners, to-be-ignored masker words from sparse neighborhoods interfered with recognition of target speech more than masker words from dense neighborhoods. Amount of hearing loss and cognitive abilities relating to attentional control modulated overall performance as well as the strength of lexical influences. PMID:26233036
NASA Astrophysics Data System (ADS)
Saweikis, Meghan; Surprenant, Aimée M.; Davies, Patricia; Gallant, Don
2003-10-01
While young and old subjects with comparable audiograms tend to perform comparably on speech recognition tasks in quiet environments, the older subjects have more difficulty than the younger subjects with recognition tasks in degraded listening conditions. This suggests that factors other than an absolute threshold may account for some of the difficulty older listeners have on recognition tasks in noisy environments. Many metrics, including the Speech Intelligibility Index (SII), used to measure speech intelligibility, only consider an absolute threshold when accounting for age related hearing loss. Therefore these metrics tend to overestimate the performance for elderly listeners in noisy environments [Tobias et al., J. Acoust. Soc. Am. 83, 859-895 (1988)]. The present studies examine the predictive capabilities of the SII in an environment with automobile noise present. This is of interest because people's evaluation of the automobile interior sound is closely linked to their ability to carry on conversations with their fellow passengers. The four studies examine whether, for subjects with age related hearing loss, the accuracy of the SII can be improved by incorporating factors other than an absolute threshold into the model. [Work supported by Ford Motor Company.
How Spoken Language Comprehension is Achieved by Older Listeners in Difficult Listening Situations.
Schneider, Bruce A; Avivi-Reich, Meital; Daneman, Meredyth
2016-01-01
Comprehending spoken discourse in noisy situations is likely to be more challenging to older adults than to younger adults due to potential declines in the auditory, cognitive, or linguistic processes supporting speech comprehension. These challenges might force older listeners to reorganize the ways in which they perceive and process speech, thereby altering the balance between the contributions of bottom-up versus top-down processes to speech comprehension. The authors review studies that investigated the effect of age on listeners' ability to follow and comprehend lectures (monologues), and two-talker conversations (dialogues), and the extent to which individual differences in lexical knowledge and reading comprehension skill relate to individual differences in speech comprehension. Comprehension was evaluated after each lecture or conversation by asking listeners to answer multiple-choice questions regarding its content. Once individual differences in speech recognition for words presented in babble were compensated for, age differences in speech comprehension were minimized if not eliminated. However, younger listeners benefited more from spatial separation than did older listeners. Vocabulary knowledge predicted the comprehension scores of both younger and older listeners when listening was difficult, but not when it was easy. However, the contribution of reading comprehension to listening comprehension appeared to be independent of listening difficulty in younger adults but not in older adults. The evidence suggests (1) that most of the difficulties experienced by older adults are due to age-related auditory declines, and (2) that these declines, along with listening difficulty, modulate the degree to which selective linguistic and cognitive abilities are engaged to support listening comprehension in difficult listening situations. When older listeners experience speech recognition difficulties, their attentional resources are more likely to be deployed to facilitate lexical access, making it difficult for them to fully engage higher-order cognitive abilities in support of listening comprehension.
SAM: speech-aware applications in medicine to support structured data entry.
Wormek, A. K.; Ingenerf, J.; Orthner, H. F.
1997-01-01
In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech recognizer to a specific medical application. The goals of our work are first, to develop a speech-aware core component which is able to establish connections to speech recognition engines of different vendors. This is realized in SAM. Second, with applications based on SAM we want to support the physician in his/her routine clinical care activities. Within the STAMP project (STAndardized Multimedia report generator in Pathology), we extend SAM by combining a structured data entry approach with speech recognition technology. Another speech-aware application in the field of Diabetes care is connected to a terminology server. The server delivers a controlled vocabulary which can be used for speech recognition. PMID:9357730
Li, Tianhao; Fu, Qian-Jie
2013-01-01
Objectives (1) To investigate whether voice gender discrimination (VGD) could be a useful indicator of the spectral and temporal processing abilities of individual cochlear implant (CI) users; (2) To examine the relationship between VGD and speech recognition with CI when comparable acoustic cues are used for both perception processes. Design VGD was measured using two talker sets with different inter-gender fundamental frequencies (F0), as well as different acoustic CI simulations. Vowel and consonant recognition in quiet and noise were also measured and compared with VGD performance. Study sample Eleven postlingually deaf CI users. Results The results showed that (1) mean VGD performance differed for different stimulus sets, (2) VGD and speech recognition performance varied among individual CI users, and (3) individual VGD performance was significantly correlated with speech recognition performance under certain conditions. Conclusions VGD measured with selected stimulus sets might be useful for assessing not only pitch-related perception, but also spectral and temporal processing by individual CI users. In addition to improvements in spectral resolution and modulation detection, the improvement in higher modulation frequency discrimination might be particularly important for CI users in noisy environments. PMID:21696330
Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor
NASA Astrophysics Data System (ADS)
Heracleous, Panikos; Kaino, Tomomi; Saruwatari, Hiroshi; Shikano, Kiyohiro
2006-12-01
We present the use of stethoscope and silicon NAM (nonaudible murmur) microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (nonaudible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc.) for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a[InlineEquation not available: see fulltext.] word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.
Speech recognition technology: an outlook for human-to-machine interaction.
Erdel, T; Crooks, S
2000-01-01
Speech recognition, as an enabling technology in healthcare-systems computing, is a topic that has been discussed for quite some time, but is just now coming to fruition. Traditionally, speech-recognition software has been constrained by hardware, but improved processors and increased memory capacities are starting to remove some of these limitations. With these barriers removed, companies that create software for the healthcare setting have the opportunity to write more successful applications. Among the criticisms of speech-recognition applications are the high rates of error and steep training curves. However, even in the face of such negative perceptions, there remains significant opportunities for speech recognition to allow healthcare providers and, more specifically, physicians, to work more efficiently and ultimately spend more time with their patients and less time completing necessary documentation. This article will identify opportunities for inclusion of speech-recognition technology in the healthcare setting and examine major categories of speech-recognition software--continuous speech recognition, command and control, and text-to-speech. We will discuss the advantages and disadvantages of each area, the limitations of the software today, and how future trends might affect them.
Method and apparatus for obtaining complete speech signals for speech recognition applications
NASA Technical Reports Server (NTRS)
Abrash, Victor (Inventor); Cesari, Federico (Inventor); Franco, Horacio (Inventor); George, Christopher (Inventor); Zheng, Jing (Inventor)
2009-01-01
The present invention relates to a method and apparatus for obtaining complete speech signals for speech recognition applications. In one embodiment, the method continuously records an audio stream comprising a sequence of frames to a circular buffer. When a user command to commence or terminate speech recognition is received, the method obtains a number of frames of the audio stream occurring before or after the user command in order to identify an augmented audio signal for speech recognition processing. In further embodiments, the method analyzes the augmented audio signal in order to locate starting and ending speech endpoints that bound at least a portion of speech to be processed for recognition. At least one of the speech endpoints is located using a Hidden Markov Model.
Gradient language dominance affects talker learning.
Bregman, Micah R; Creel, Sarah C
2014-01-01
Traditional conceptions of spoken language assume that speech recognition and talker identification are computed separately. Neuropsychological and neuroimaging studies imply some separation between the two faculties, but recent perceptual studies suggest better talker recognition in familiar languages than unfamiliar languages. A familiar-language benefit in talker recognition potentially implies strong ties between the two domains. However, little is known about the nature of this language familiarity effect. The current study investigated the relationship between speech and talker processing by assessing bilingual and monolingual listeners' ability to learn voices as a function of language familiarity and age of acquisition. Two effects emerged. First, bilinguals learned to recognize talkers in their first language (Korean) more rapidly than they learned to recognize talkers in their second language (English), while English-speaking participants showed the opposite pattern (learning English talkers faster than Korean talkers). Second, bilinguals' learning rate for talkers in their second language (English) correlated with age of English acquisition. Taken together, these results suggest that language background materially affects talker encoding, implying a tight relationship between speech and talker representations. Copyright © 2013 Elsevier B.V. All rights reserved.
Audibility-based predictions of speech recognition for children and adults with normal hearing.
McCreery, Ryan W; Stelmachowicz, Patricia G
2011-12-01
This study investigated the relationship between audibility and predictions of speech recognition for children and adults with normal hearing. The Speech Intelligibility Index (SII) is used to quantify the audibility of speech signals and can be applied to transfer functions to predict speech recognition scores. Although the SII is used clinically with children, relatively few studies have evaluated SII predictions of children's speech recognition directly. Children have required more audibility than adults to reach maximum levels of speech understanding in previous studies. Furthermore, children may require greater bandwidth than adults for optimal speech understanding, which could influence frequency-importance functions used to calculate the SII. Speech recognition was measured for 116 children and 19 adults with normal hearing. Stimulus bandwidth and background noise level were varied systematically in order to evaluate speech recognition as predicted by the SII and derive frequency-importance functions for children and adults. Results suggested that children required greater audibility to reach the same level of speech understanding as adults. However, differences in performance between adults and children did not vary across frequency bands. © 2011 Acoustical Society of America
Corthals, Paul
2008-01-01
The aim of the present study is to construct a simple method for visualizing and quantifying the audibility of speech on the audiogram and to predict speech intelligibility. The proposed method involves a series of indices on the audiogram form reflecting the sound pressure level distribution of running speech. The indices that coincide with a patient's pure tone thresholds reflect speech audibility and give evidence of residual functional hearing capacity. Two validation studies were conducted among sensorineurally hearing-impaired participants (n = 56 and n = 37, respectively) to investigate the relation with speech recognition ability and hearing disability. The potential of the new audibility indices as predictors for speech reception thresholds is comparable to the predictive potential of the ANSI 1968 articulation index and the ANSI 1997 speech intelligibility index. The sum of indices or a weighted combination can explain considerable proportions of variance in speech reception results for sentences in quiet free field conditions. The proportions of variance that can be explained in questionnaire results on hearing disability are less, presumably because the threshold indices almost exclusively reflect message audibility and much less the psychosocial consequences of hearing deficits. The outcomes underpin the validity of the new audibility indexing system, even though the proposed method may be better suited for predicting relative performance across a set of conditions than for predicting absolute speech recognition performance. (c) 2007 S. Karger AG, Basel
The Suitability of Cloud-Based Speech Recognition Engines for Language Learning
ERIC Educational Resources Information Center
Daniels, Paul; Iwago, Koji
2017-01-01
As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…
Visual-auditory integration during speech imitation in autism.
Williams, Justin H G; Massaro, Dominic W; Peel, Natalie J; Bosseler, Alexis; Suddendorf, Thomas
2004-01-01
Children with autistic spectrum disorder (ASD) may have poor audio-visual integration, possibly reflecting dysfunctional 'mirror neuron' systems which have been hypothesised to be at the core of the condition. In the present study, a computer program, utilizing speech synthesizer software and a 'virtual' head (Baldi), delivered speech stimuli for identification in auditory, visual or bimodal conditions. Children with ASD were poorer than controls at recognizing stimuli in the unimodal conditions, but once performance on this measure was controlled for, no group difference was found in the bimodal condition. A group of participants with ASD were also trained to develop their speech-reading ability. Training improved visual accuracy and this also improved the children's ability to utilize visual information in their processing of speech. Overall results were compared to predictions from mathematical models based on integration and non-integration, and were most consistent with the integration model. We conclude that, whilst they are less accurate in recognizing stimuli in the unimodal condition, children with ASD show normal integration of visual and auditory speech stimuli. Given that training in recognition of visual speech was effective, children with ASD may benefit from multi-modal approaches in imitative therapy and language training.
Shin, Young Hoon; Seo, Jiwon
2016-01-01
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867
Shin, Young Hoon; Seo, Jiwon
2016-10-29
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.
Effects and modeling of phonetic and acoustic confusions in accented speech.
Fung, Pascale; Liu, Yi
2005-11-01
Accented speech recognition is more challenging than standard speech recognition due to the effects of phonetic and acoustic confusions. Phonetic confusion in accented speech occurs when an expected phone is pronounced as a different one, which leads to erroneous recognition. Acoustic confusion occurs when the pronounced phone is found to lie acoustically between two baseform models and can be equally recognized as either one. We propose that it is necessary to analyze and model these confusions separately in order to improve accented speech recognition without degrading standard speech recognition. Since low phonetic confusion units in accented speech do not give rise to automatic speech recognition errors, we focus on analyzing and reducing phonetic and acoustic confusability under high phonetic confusion conditions. We propose using likelihood ratio test to measure phonetic confusion, and asymmetric acoustic distance to measure acoustic confusion. Only accent-specific phonetic units with low acoustic confusion are used in an augmented pronunciation dictionary, while phonetic units with high acoustic confusion are reconstructed using decision tree merging. Experimental results show that our approach is effective and superior to methods modeling phonetic confusion or acoustic confusion alone in accented speech, with a significant 5.7% absolute WER reduction, without degrading standard speech recognition.
Investigation on the music perception skills of Italian children with cochlear implants.
Scorpecci, Alessandro; Zagari, Felicia; Mari, Giorgia; Giannantonio, Sara; D'Alatri, Lucia; Di Nardo, Walter; Paludetti, Gaetano
2012-10-01
To compare the music perception skills of a group of Italian-speaking children with cochlear implants to those of a group of normal hearing children; to analyze possible correlations between implanted children's musical skills and their demographics, clinical characteristics, phonological perception, and speech recognition and production abilities. 18 implanted children aged 5-12 years and a reference group of 23 normal-hearing subjects with typical language development were enrolled. Both groups received a melody identification test and a song (i.e. original version) identification test. The implanted children also received a test battery aimed at assessing speech recognition, speech production and phoneme discrimination. The implanted children scored significantly worse than the normal hearing subjects in both musical tests. In the cochlear implant group, phoneme discrimination abilities were significantly correlated with both melody and song identification skills, and length of device use was significantly correlated with song identification skills. Experience with device use and phonological perception had a moderate-to-strong correlation to implanted children's music perception abilities. In the light of these findings, it is reasonable to assume that a rehabilitation program specifically aimed at improving phonological perception could help pediatric cochlear implant recipients better understand the basic elements of music; moreover, a training aimed at improving the comprehension of the spectral elements of music could enhance implanted children's phonological skills. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Zeremdini, Jihen; Ben Messaoud, Mohamed Anouar; Bouzid, Aicha
2015-09-01
Humans have the ability to easily separate a composed speech and to form perceptual representations of the constituent sources in an acoustic mixture thanks to their ears. Until recently, researchers attempt to build computer models of high-level functions of the auditory system. The problem of the composed speech segregation is still a very challenging problem for these researchers. In our case, we are interested in approaches that are addressed to the monaural speech segregation. For this purpose, we study in this paper the computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. CASA is the reproduction of the source organization achieved by listeners. It is based on two main stages: segmentation and grouping. In this work, we have presented, and compared several studies that have used CASA for speech separation and recognition.
Kirk, Karen Iler; Prusick, Lindsay; French, Brian; Gotch, Chad; Eisenberg, Laurie S; Young, Nancy
2012-06-01
Under natural conditions, listeners use both auditory and visual speech cues to extract meaning from speech signals containing many sources of variability. However, traditional clinical tests of spoken word recognition routinely employ isolated words or sentences produced by a single talker in an auditory-only presentation format. The more central cognitive processes used during multimodal integration, perceptual normalization, and lexical discrimination that may contribute to individual variation in spoken word recognition performance are not assessed in conventional tests of this kind. In this article, we review our past and current research activities aimed at developing a series of new assessment tools designed to evaluate spoken word recognition in children who are deaf or hard of hearing. These measures are theoretically motivated by a current model of spoken word recognition and also incorporate "real-world" stimulus variability in the form of multiple talkers and presentation formats. The goal of this research is to enhance our ability to estimate real-world listening skills and to predict benefit from sensory aid use in children with varying degrees of hearing loss. American Academy of Audiology.
Speech and language development in cognitively delayed children with cochlear implants.
Holt, Rachael Frush; Kirk, Karen Iler
2005-04-01
The primary goals of this investigation were to examine the speech and language development of deaf children with cochlear implants and mild cognitive delay and to compare their gains with those of children with cochlear implants who do not have this additional impairment. We retrospectively examined the speech and language development of 69 children with pre-lingual deafness. The experimental group consisted of 19 children with cognitive delays and no other disabilities (mean age at implantation = 38 months). The control group consisted of 50 children who did not have cognitive delays or any other identified disability. The control group was stratified by primary communication mode: half used total communication (mean age at implantation = 32 months) and the other half used oral communication (mean age at implantation = 26 months). Children were tested on a variety of standard speech and language measures and one test of auditory skill development at 6-month intervals. The results from each test were collapsed from blocks of two consecutive 6-month intervals to calculate group mean scores before implantation and at 1-year intervals after implantation. The children with cognitive delays and those without such delays demonstrated significant improvement in their speech and language skills over time on every test administered. Children with cognitive delays had significantly lower scores than typically developing children on two of the three measures of receptive and expressive language and had significantly slower rates of auditory-only sentence recognition development. Finally, there were no significant group differences in auditory skill development based on parental reports or in auditory-only or multimodal word recognition. The results suggest that deaf children with mild cognitive impairments benefit from cochlear implantation. Specifically, improvements are evident in their ability to perceive speech and in their reception and use of language. However, it may be reduced relative to their typically developing peers with cochlear implants, particularly in domains that require higher level skills, such as sentence recognition and receptive and expressive language. These findings suggest that children with mild cognitive deficits be considered for cochlear implantation with less trepidation than has been the case in the past. Although their speech and language gains may be tempered by their cognitive abilities, these limitations do not appear to preclude benefit from cochlear implant stimulation, as assessed by traditional measures of speech and language development.
Temporal and speech processing skills in normal hearing individuals exposed to occupational noise.
Kumar, U Ajith; Ameenudin, Syed; Sangamanatha, A V
2012-01-01
Prolonged exposure to high levels of occupational noise can cause damage to hair cells in the cochlea and result in permanent noise-induced cochlear hearing loss. Consequences of cochlear hearing loss on speech perception and psychophysical abilities have been well documented. Primary goal of this research was to explore temporal processing and speech perception Skills in individuals who are exposed to occupational noise of more than 80 dBA and not yet incurred clinically significant threshold shifts. Contribution of temporal processing skills to speech perception in adverse listening situation was also evaluated. A total of 118 participants took part in this research. Participants comprised three groups of train drivers in the age range of 30-40 (n= 13), 41 50 ( = 13), 41-50 (n = 9), and 51-60 (n = 6) years and their non-noise-exposed counterparts (n = 30 in each age group). Participants of all the groups including the train drivers had hearing sensitivity within 25 dB HL in the octave frequencies between 250 and 8 kHz. Temporal processing was evaluated using gap detection, modulation detection, and duration pattern tests. Speech recognition was tested in presence multi-talker babble at -5dB SNR. Differences between experimental and control groups were analyzed using ANOVA and independent sample t-tests. Results showed a trend of reduced temporal processing skills in individuals with noise exposure. These deficits were observed despite normal peripheral hearing sensitivity. Speech recognition scores in the presence of noise were also significantly poor in noise-exposed group. Furthermore, poor temporal processing skills partially accounted for the speech recognition difficulties exhibited by the noise-exposed individuals. These results suggest that noise can cause significant distortions in the processing of suprathreshold temporal cues which may add to difficulties in hearing in adverse listening conditions.
ERIC Educational Resources Information Center
Young, Victoria; Mihailidis, Alex
2010-01-01
Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…
The Effect of Dynamic Pitch on Speech Recognition in Temporally Modulated Noise.
Shen, Jing; Souza, Pamela E
2017-09-18
This study investigated the effect of dynamic pitch in target speech on older and younger listeners' speech recognition in temporally modulated noise. First, we examined whether the benefit from dynamic-pitch cues depends on the temporal modulation of noise. Second, we tested whether older listeners can benefit from dynamic-pitch cues for speech recognition in noise. Last, we explored the individual factors that predict the amount of dynamic-pitch benefit for speech recognition in noise. Younger listeners with normal hearing and older listeners with varying levels of hearing sensitivity participated in the study, in which speech reception thresholds were measured with sentences in nonspeech noise. The younger listeners benefited more from dynamic pitch for speech recognition in temporally modulated noise than unmodulated noise. Older listeners were able to benefit from the dynamic-pitch cues but received less benefit from noise modulation than the younger listeners. For those older listeners with hearing loss, the amount of hearing loss strongly predicted the dynamic-pitch benefit for speech recognition in noise. Dynamic-pitch cues aid speech recognition in noise, particularly when noise has temporal modulation. Hearing loss negatively affects the dynamic-pitch benefit to older listeners with significant hearing loss.
The Effect of Dynamic Pitch on Speech Recognition in Temporally Modulated Noise
Souza, Pamela E.
2017-01-01
Purpose This study investigated the effect of dynamic pitch in target speech on older and younger listeners' speech recognition in temporally modulated noise. First, we examined whether the benefit from dynamic-pitch cues depends on the temporal modulation of noise. Second, we tested whether older listeners can benefit from dynamic-pitch cues for speech recognition in noise. Last, we explored the individual factors that predict the amount of dynamic-pitch benefit for speech recognition in noise. Method Younger listeners with normal hearing and older listeners with varying levels of hearing sensitivity participated in the study, in which speech reception thresholds were measured with sentences in nonspeech noise. Results The younger listeners benefited more from dynamic pitch for speech recognition in temporally modulated noise than unmodulated noise. Older listeners were able to benefit from the dynamic-pitch cues but received less benefit from noise modulation than the younger listeners. For those older listeners with hearing loss, the amount of hearing loss strongly predicted the dynamic-pitch benefit for speech recognition in noise. Conclusions Dynamic-pitch cues aid speech recognition in noise, particularly when noise has temporal modulation. Hearing loss negatively affects the dynamic-pitch benefit to older listeners with significant hearing loss. PMID:28800370
Niijima, H; Ito, N; Ogino, S; Takatori, T; Iwase, H; Kobayashi, M
2000-11-01
For the purpose of practical use of speech recognition technology for recording of forensic autopsy, a language model of the speech recording system, specialized for the forensic autopsy, was developed. The language model for the forensic autopsy by applying 3-gram model was created, and an acoustic model for Japanese speech recognition by Hidden Markov Model in addition to the above were utilized to customize the speech recognition engine for forensic autopsy. A forensic vocabulary set of over 10,000 words was compiled and some 300,000 sentence patterns were made to create the forensic language model, then properly mixing with a general language model to attain high exactitude. When tried by dictating autopsy findings, this speech recognition system was proved to be about 95% of recognition rate that seems to have reached to the practical usability in view of speech recognition software, though there remains rooms for improving its hardware and application-layer software.
Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers
Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng
2014-01-01
Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004
The Impact of Early Bilingualism on Face Recognition Processes.
Kandel, Sonia; Burfin, Sabine; Méary, David; Ruiz-Tada, Elisa; Costa, Albert; Pascalis, Olivier
2016-01-01
Early linguistic experience has an impact on the way we decode audiovisual speech in face-to-face communication. The present study examined whether differences in visual speech decoding could be linked to a broader difference in face processing. To identify a phoneme we have to do an analysis of the speaker's face to focus on the relevant cues for speech decoding (e.g., locating the mouth with respect to the eyes). Face recognition processes were investigated through two classic effects in face recognition studies: the Other-Race Effect (ORE) and the Inversion Effect. Bilingual and monolingual participants did a face recognition task with Caucasian faces (own race), Chinese faces (other race), and cars that were presented in an Upright or Inverted position. The results revealed that monolinguals exhibited the classic ORE. Bilinguals did not. Overall, bilinguals were slower than monolinguals. These results suggest that bilinguals' face processing abilities differ from monolinguals'. Early exposure to more than one language may lead to a perceptual organization that goes beyond language processing and could extend to face analysis. We hypothesize that these differences could be due to the fact that bilinguals focus on different parts of the face than monolinguals, making them more efficient in other race face processing but slower. However, more studies using eye-tracking techniques are necessary to confirm this explanation.
Spoken Word Recognition in Toddlers Who Use Cochlear Implants
Grieco-Calub, Tina M.; Saffran, Jenny R.; Litovsky, Ruth Y.
2010-01-01
Purpose The purpose of this study was to assess the time course of spoken word recognition in 2-year-old children who use cochlear implants (CIs) in quiet and in the presence of speech competitors. Method Children who use CIs and age-matched peers with normal acoustic hearing listened to familiar auditory labels, in quiet or in the presence of speech competitors, while their eye movements to target objects were digitally recorded. Word recognition performance was quantified by measuring each child’s reaction time (i.e., the latency between the spoken auditory label and the first look at the target object) and accuracy (i.e., the amount of time that children looked at target objects within 367 ms to 2,000 ms after the label onset). Results Children with CIs were less accurate and took longer to fixate target objects than did age-matched children without hearing loss. Both groups of children showed reduced performance in the presence of the speech competitors, although many children continued to recognize labels at above-chance levels. Conclusion The results suggest that the unique auditory experience of young CI users slows the time course of spoken word recognition abilities. In addition, real-world listening environments may slow language processing in young language learners, regardless of their hearing status. PMID:19951921
Zekveld, Adriana A; Rudner, Mary; Kramer, Sophia E; Lyzenga, Johannes; Rönnberg, Jerker
2014-01-01
We investigated changes in speech recognition and cognitive processing load due to the masking release attributable to decreasing similarity between target and masker speech. This was achieved by using masker voices with either the same (female) gender as the target speech or different gender (male) and/or by spatially separating the target and masker speech using HRTFs. We assessed the relation between the signal-to-noise ratio required for 50% sentence intelligibility, the pupil response and cognitive abilities. We hypothesized that the pupil response, a measure of cognitive processing load, would be larger for co-located maskers and for same-gender compared to different-gender maskers. We further expected that better cognitive abilities would be associated with better speech perception and larger pupil responses as the allocation of larger capacity may result in more intense mental processing. In line with previous studies, the performance benefit from different-gender compared to same-gender maskers was larger for co-located masker signals. The performance benefit of spatially-separated maskers was larger for same-gender maskers. The pupil response was larger for same-gender than for different-gender maskers, but was not reduced by spatial separation. We observed associations between better perception performance and better working memory, better information updating, and better executive abilities when applying no corrections for multiple comparisons. The pupil response was not associated with cognitive abilities. Thus, although both gender and location differences between target and masker facilitate speech perception, only gender differences lower cognitive processing load. Presenting a more dissimilar masker may facilitate target-masker separation at a later (cognitive) processing stage than increasing the spatial separation between the target and masker. The pupil response provides information about speech perception that complements intelligibility data.
Zekveld, Adriana A.; Rudner, Mary; Kramer, Sophia E.; Lyzenga, Johannes; Rönnberg, Jerker
2014-01-01
We investigated changes in speech recognition and cognitive processing load due to the masking release attributable to decreasing similarity between target and masker speech. This was achieved by using masker voices with either the same (female) gender as the target speech or different gender (male) and/or by spatially separating the target and masker speech using HRTFs. We assessed the relation between the signal-to-noise ratio required for 50% sentence intelligibility, the pupil response and cognitive abilities. We hypothesized that the pupil response, a measure of cognitive processing load, would be larger for co-located maskers and for same-gender compared to different-gender maskers. We further expected that better cognitive abilities would be associated with better speech perception and larger pupil responses as the allocation of larger capacity may result in more intense mental processing. In line with previous studies, the performance benefit from different-gender compared to same-gender maskers was larger for co-located masker signals. The performance benefit of spatially-separated maskers was larger for same-gender maskers. The pupil response was larger for same-gender than for different-gender maskers, but was not reduced by spatial separation. We observed associations between better perception performance and better working memory, better information updating, and better executive abilities when applying no corrections for multiple comparisons. The pupil response was not associated with cognitive abilities. Thus, although both gender and location differences between target and masker facilitate speech perception, only gender differences lower cognitive processing load. Presenting a more dissimilar masker may facilitate target-masker separation at a later (cognitive) processing stage than increasing the spatial separation between the target and masker. The pupil response provides information about speech perception that complements intelligibility data. PMID:24808818
Task-dependent modulation of the visual sensory thalamus assists visual-speech recognition.
Díaz, Begoña; Blank, Helen; von Kriegstein, Katharina
2018-05-14
The cerebral cortex modulates early sensory processing via feed-back connections to sensory pathway nuclei. The functions of this top-down modulation for human behavior are poorly understood. Here, we show that top-down modulation of the visual sensory thalamus (the lateral geniculate body, LGN) is involved in visual-speech recognition. In two independent functional magnetic resonance imaging (fMRI) studies, LGN response increased when participants processed fast-varying features of articulatory movements required for visual-speech recognition, as compared to temporally more stable features required for face identification with the same stimulus material. The LGN response during the visual-speech task correlated positively with the visual-speech recognition scores across participants. In addition, the task-dependent modulation was present for speech movements and did not occur for control conditions involving non-speech biological movements. In face-to-face communication, visual speech recognition is used to enhance or even enable understanding what is said. Speech recognition is commonly explained in frameworks focusing on cerebral cortex areas. Our findings suggest that task-dependent modulation at subcortical sensory stages has an important role for communication: Together with similar findings in the auditory modality the findings imply that task-dependent modulation of the sensory thalami is a general mechanism to optimize speech recognition. Copyright © 2018. Published by Elsevier Inc.
NASA Technical Reports Server (NTRS)
Olorenshaw, Lex; Trawick, David
1991-01-01
The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.
NASA Astrophysics Data System (ADS)
Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis
2003-12-01
This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.
[Characteristics, advantages, and limits of matrix tests].
Brand, T; Wagener, K C
2017-03-01
Deterioration of communication abilities due to hearing problems is particularly relevant in listening situations with noise. Therefore, speech intelligibility tests in noise are required for audiological diagnostics and evaluation of hearing rehabilitation. This study analyzed the characteristics of matrix tests assessing the 50 % speech recognition threshold in noise. What are their advantages and limitations? Matrix tests are based on a matrix of 50 words (10 five-word sentences with same grammatical structure). In the standard setting, 20 sentences are presented using an adaptive procedure estimating the individual 50 % speech recognition threshold in noise. At present, matrix tests in 17 different languages are available. A high international comparability of matrix tests exists. The German language matrix test (OLSA, male speaker) has a reference 50 % speech recognition threshold of -7.1 (± 1.1) dB SNR. Before using a matrix test for the first time, the test person has to become familiar with the basic speech material using two training lists. Hereafter, matrix tests produce constant results even if repeated many times. Matrix tests are suitable for users of hearing aids and cochlear implants, particularly for assessment of benefit during the fitting process. Matrix tests can be performed in closed form and consequently with non-native listeners, even if the experimenter does not speak the test person's native language. Short versions of matrix tests are available for listeners with a shorter memory span, e.g., children.
Calandruccio, Lauren; Zhou, Haibo
2014-01-01
Purpose To examine whether improved speech recognition during linguistically mismatched target–masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method Monolingual English speakers (n = 20) and English–Greek simultaneous bilinguals (n = 20) listened to English sentences in the presence of competing English and Greek speech. Data were analyzed using mixed-effects regression models to determine differences in English recogition performance between the 2 groups and 2 masker conditions. Results Results indicated that English sentence recognition for monolinguals and simultaneous English–Greek bilinguals improved when the masker speech changed from competing English to competing Greek speech. Conclusion The improvement in speech recognition that has been observed for linguistically mismatched target–masker experiments cannot be simply explained by the masker language being linguistically unknown or unfamiliar to the listeners. Listeners can improve their speech recognition in linguistically mismatched target–masker experiments even when the listener is able to obtain meaningful linguistic information from the masker speech. PMID:24167230
Real-time classification of auditory sentences using evoked cortical activity in humans
NASA Astrophysics Data System (ADS)
Moses, David A.; Leonard, Matthew K.; Chang, Edward F.
2018-06-01
Objective. Recent research has characterized the anatomical and functional basis of speech perception in the human auditory cortex. These advances have made it possible to decode speech information from activity in brain regions like the superior temporal gyrus, but no published work has demonstrated this ability in real-time, which is necessary for neuroprosthetic brain-computer interfaces. Approach. Here, we introduce a real-time neural speech recognition (rtNSR) software package, which was used to classify spoken input from high-resolution electrocorticography signals in real-time. We tested the system with two human subjects implanted with electrode arrays over the lateral brain surface. Subjects listened to multiple repetitions of ten sentences, and rtNSR classified what was heard in real-time from neural activity patterns using direct sentence-level and HMM-based phoneme-level classification schemes. Main results. We observed single-trial sentence classification accuracies of 90% or higher for each subject with less than 7 minutes of training data, demonstrating the ability of rtNSR to use cortical recordings to perform accurate real-time speech decoding in a limited vocabulary setting. Significance. Further development and testing of the package with different speech paradigms could influence the design of future speech neuroprosthetic applications.
Fogerty, Daniel; Ahlstrom, Jayne B.; Bologna, William J.; Dubno, Judy R.
2015-01-01
This study investigated how single-talker modulated noise impacts consonant and vowel cues to sentence intelligibility. Younger normal-hearing, older normal-hearing, and older hearing-impaired listeners completed speech recognition tests. All listeners received spectrally shaped speech matched to their individual audiometric thresholds to ensure sufficient audibility with the exception of a second younger listener group who received spectral shaping that matched the mean audiogram of the hearing-impaired listeners. Results demonstrated minimal declines in intelligibility for older listeners with normal hearing and more evident declines for older hearing-impaired listeners, possibly related to impaired temporal processing. A correlational analysis suggests a common underlying ability to process information during vowels that is predictive of speech-in-modulated noise abilities. Whereas, the ability to use consonant cues appears specific to the particular characteristics of the noise and interruption. Performance declines for older listeners were mostly confined to consonant conditions. Spectral shaping accounted for the primary contributions of audibility. However, comparison with the young spectral controls who received identical spectral shaping suggests that this procedure may reduce wideband temporal modulation cues due to frequency-specific amplification that affected high-frequency consonants more than low-frequency vowels. These spectral changes may impact speech intelligibility in certain modulation masking conditions. PMID:26093436
Speech Perception Deficits in Mandarin-Speaking School-Aged Children with Poor Reading Comprehension
Liu, Huei-Mei; Tsao, Feng-Ming
2017-01-01
Previous studies have shown that children learning alphabetic writing systems who have language impairment or dyslexia exhibit speech perception deficits. However, whether such deficits exist in children learning logographic writing systems who have poor reading comprehension remains uncertain. To further explore this issue, the present study examined speech perception deficits in Mandarin-speaking children with poor reading comprehension. Two self-designed tasks, consonant categorical perception task and lexical tone discrimination task were used to compare speech perception performance in children (n = 31, age range = 7;4–10;2) with poor reading comprehension and an age-matched typically developing group (n = 31, age range = 7;7–9;10). Results showed that the children with poor reading comprehension were less accurate in consonant and lexical tone discrimination tasks and perceived speech contrasts less categorically than the matched group. The correlations between speech perception skills (i.e., consonant and lexical tone discrimination sensitivities and slope of consonant identification curve) and individuals’ oral language and reading comprehension were stronger than the correlations between speech perception ability and word recognition ability. In conclusion, the results revealed that Mandarin-speaking children with poor reading comprehension exhibit less-categorized speech perception, suggesting that imprecise speech perception, especially lexical tone perception, is essential to account for reading learning difficulties in Mandarin-speaking children. PMID:29312031
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.
The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation maymore » decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.« less
Eckert, Mark A; Teubner-Rhodes, Susan; Vaden, Kenneth I
2016-01-01
This review examines findings from functional neuroimaging studies of speech recognition in noise to provide a neural systems level explanation for the effort and fatigue that can be experienced during speech recognition in challenging listening conditions. Neuroimaging studies of speech recognition consistently demonstrate that challenging listening conditions engage neural systems that are used to monitor and optimize performance across a wide range of tasks. These systems appear to improve speech recognition in younger and older adults, but sustained engagement of these systems also appears to produce an experience of effort and fatigue that may affect the value of communication. When considered in the broader context of the neuroimaging and decision making literature, the speech recognition findings from functional imaging studies indicate that the expected value, or expected level of speech recognition given the difficulty of listening conditions, should be considered when measuring effort and fatigue. The authors propose that the behavioral economics or neuroeconomics of listening can provide a conceptual and experimental framework for understanding effort and fatigue that may have clinical significance.
Eckert, Mark A.; Teubner-Rhodes, Susan; Vaden, Kenneth I.
2016-01-01
This review examines findings from functional neuroimaging studies of speech recognition in noise to provide a neural systems level explanation for the effort and fatigue that can be experienced during speech recognition in challenging listening conditions. Neuroimaging studies of speech recognition consistently demonstrate that challenging listening conditions engage neural systems that are used to monitor and optimize performance across a wide range of tasks. These systems appear to improve speech recognition in younger and older adults, but sustained engagement of these systems also appears to produce an experience of effort and fatigue that may affect the value of communication. When considered in the broader context of the neuroimaging and decision making literature, the speech recognition findings from functional imaging studies indicate that the expected value, or expected level of speech recognition given the difficulty of listening conditions, should be considered when measuring effort and fatigue. We propose that the behavioral economics and/or neuroeconomics of listening can provide a conceptual and experimental framework for understanding effort and fatigue that may have clinical significance. PMID:27355759
Wu, Che-Ming; Liu, Tien-Chen; Wang, Nan-Mai; Chao, Wei-Chieh
2013-08-01
(1) To understand speech perception and communication ability through real telephone calls by Mandarin-speaking children with cochlear implants and compare them to live-voice perception, (2) to report the general condition of telephone use of this population, and (3) to investigate the factors that correlate with telephone speech perception performance. Fifty-six children with over 4 years of implant use (aged 6.8-13.6 years, mean duration 8.0 years) took three speech perception tests administered using telephone and live voice to examine sentence, monosyllabic-word and Mandarin tone perception. The children also filled out a questionnaire survey investigating everyday telephone use. Wilcoxon signed-rank test was used to compare the scores between live-voice and telephone tests, and Pearson's test to examine the correlation between them. The mean scores were 86.4%, 69.8% and 70.5% respectively for sentence, word and tone recognition over the telephone. The corresponding live-voice mean scores were 94.3%, 84.0% and 70.8%. Wilcoxon signed-rank test showed the sentence and word scores were significantly different between telephone and live voice test, while the tone recognition scores were not, indicating tone perception was less worsened by telephone transmission than words and sentences. Spearman's test showed that chronological age and duration of implant use were weakly correlated with the perception test scores. The questionnaire survey showed 78% of the children could initiate phone calls and 59% could use the telephone 2 years after implantation. Implanted children are potentially capable of using the telephone 2 years after implantation, and communication ability over the telephone becomes satisfactory 4 years after implantation. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Larraza, Saioa; Samuel, Arthur G; Oñederra, Miren Lourdes
2016-07-20
Accented speech has been seen as an additional impediment for speech processing; it usually adds linguistic and cognitive load to the listener's task. In the current study we analyse where the processing costs of regional dialects come from, a question that has not been answered yet. We quantify the proficiency of Basque-Spanish bilinguals who have different native dialects of Basque on many dimensions and test for costs at each of three levels of processing-phonemic discrimination, word recognition, and semantic processing. The ability to discriminate a dialect-specific contrast is affected by a bilingual's linguistic background less than lexical access is, and an individual's difficulty in lexical access is correlated with basic discrimination problems. Once lexical access is achieved, dialectal variation has little impact on semantic processing. The results are discussed in terms of the presence or absence of correlations between different processing levels. The implications of the results are considered for how models of spoken word recognition handle dialectal variation.
Auditory Word Serial Recall Benefits from Orthographic Dissimilarity
ERIC Educational Resources Information Center
Pattamadilok, Chotiga; Lafontaine, Helene; Morais, Jose; Kolinsky, Regine
2010-01-01
The influence of orthographic knowledge has been consistently observed in speech recognition and metaphonological tasks. The present study provides data suggesting that such influence also pervades other cognitive domains related to language abilities, such as verbal working memory. Using serial recall of auditory seven-word lists, we observed…
Automatic concept extraction from spoken medical reports.
Happe, André; Pouliquen, Bruno; Burgun, Anita; Cuggia, Marc; Le Beux, Pierre
2003-07-01
The objective of this project is to investigate methods whereby a combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices. We based our study on existing speech recognition software programs and on NOMINDEX, a tool that extracts MeSH concepts from medical text in natural language and that is mainly based on a French medical lexicon and on the UMLS. For each document, the process consists of three steps: (1) dictation and digital audio recording, (2) speech recognition, (3) automatic indexing. The evaluation consisted of a comparison between the set of concepts extracted by NOMINDEX after the speech recognition phase and the set of keywords manually extracted from the initial document. The method was evaluated on a set of 28 patient discharge summaries extracted from the MENELAS corpus in French, corresponding to in-patients admitted for coronarography. The overall precision was 73% and the overall recall was 90%. Indexing errors were mainly due to word sense ambiguity and abbreviations. A specific issue was the fact that the standard French translation of MeSH terms lacks diacritics. A preliminary evaluation of speech recognition tools showed that the rate of accurate recognition was higher than 98%. Only 3% of the indexing errors were generated by inadequate speech recognition. We discuss several areas to focus on to improve this prototype. However, the very low rate of indexing errors due to speech recognition errors highlights the potential benefits of combining speech recognition techniques and automatic indexing.
Speech Recognition as a Transcription Aid: A Randomized Comparison With Standard Transcription
Mohr, David N.; Turner, David W.; Pond, Gregory R.; Kamath, Joseph S.; De Vos, Cathy B.; Carpenter, Paul C.
2003-01-01
Objective. Speech recognition promises to reduce information entry costs for clinical information systems. It is most likely to be accepted across an organization if physicians can dictate without concerning themselves with real-time recognition and editing; assistants can then edit and process the computer-generated document. Our objective was to evaluate the use of speech-recognition technology in a randomized controlled trial using our institutional infrastructure. Design. Clinical note dictations from physicians in two specialty divisions were randomized to either a standard transcription process or a speech-recognition process. Secretaries and transcriptionists also were assigned randomly to each of these processes. Measurements. The duration of each dictation was measured. The amount of time spent processing a dictation to yield a finished document also was measured. Secretarial and transcriptionist productivity, defined as hours of secretary work per minute of dictation processed, was determined for speech recognition and standard transcription. Results. Secretaries in the endocrinology division were 87.3% (confidence interval, 83.3%, 92.3%) as productive with the speech-recognition technology as implemented in this study as they were using standard transcription. Psychiatry transcriptionists and secretaries were similarly less productive. Author, secretary, and type of clinical note were significant (p < 0.05) predictors of productivity. Conclusion. When implemented in an organization with an existing document-processing infrastructure (which included training and interfaces of the speech-recognition editor with the existing document entry application), speech recognition did not improve the productivity of secretaries or transcriptionists. PMID:12509359
Speech emotion recognition methods: A literature review
NASA Astrophysics Data System (ADS)
Basharirad, Babak; Moradhaseli, Mohammadreza
2017-10-01
Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.
Lu, Lingxi; Bao, Xiaohan; Chen, Jing; Qu, Tianshu; Wu, Xihong; Li, Liang
2018-05-01
Under a noisy "cocktail-party" listening condition with multiple people talking, listeners can use various perceptual/cognitive unmasking cues to improve recognition of the target speech against informational speech-on-speech masking. One potential unmasking cue is the emotion expressed in a speech voice, by means of certain acoustical features. However, it was unclear whether emotionally conditioning a target-speech voice that has none of the typical acoustical features of emotions (i.e., an emotionally neutral voice) can be used by listeners for enhancing target-speech recognition under speech-on-speech masking conditions. In this study we examined the recognition of target speech against a two-talker speech masker both before and after the emotionally neutral target voice was paired with a loud female screaming sound that has a marked negative emotional valence. The results showed that recognition of the target speech (especially the first keyword in a target sentence) was significantly improved by emotionally conditioning the target speaker's voice. Moreover, the emotional unmasking effect was independent of the unmasking effect of the perceived spatial separation between the target speech and the masker. Also, (skin conductance) electrodermal responses became stronger after emotional learning when the target speech and masker were perceptually co-located, suggesting an increase of listening efforts when the target speech was informationally masked. These results indicate that emotionally conditioning the target speaker's voice does not change the acoustical parameters of the target-speech stimuli, but the emotionally conditioned vocal features can be used as cues for unmasking target speech.
Onojima, Takayuki; Kitajo, Keiichi; Mizuhara, Hiroaki
2017-01-01
Neural oscillation is attracting attention as an underlying mechanism for speech recognition. Speech intelligibility is enhanced by the synchronization of speech rhythms and slow neural oscillation, which is typically observed as human scalp electroencephalography (EEG). In addition to the effect of neural oscillation, it has been proposed that speech recognition is enhanced by the identification of a speaker's motor signals, which are used for speech production. To verify the relationship between the effect of neural oscillation and motor cortical activity, we measured scalp EEG, and simultaneous EEG and functional magnetic resonance imaging (fMRI) during a speech recognition task in which participants were required to recognize spoken words embedded in noise sound. We proposed an index to quantitatively evaluate the EEG phase effect on behavioral performance. The results showed that the delta and theta EEG phase before speech inputs modulated the participant's response time when conducting speech recognition tasks. The simultaneous EEG-fMRI experiment showed that slow EEG activity was correlated with motor cortical activity. These results suggested that the effect of the slow oscillatory phase was associated with the activity of the motor cortex during speech recognition.
Address entry while driving: speech recognition versus a touch-screen keyboard.
Tsimhoni, Omer; Smith, Daniel; Green, Paul
2004-01-01
A driving simulator experiment was conducted to determine the effects of entering addresses into a navigation system during driving. Participants drove on roads of varying visual demand while entering addresses. Three address entry methods were explored: word-based speech recognition, character-based speech recognition, and typing on a touch-screen keyboard. For each method, vehicle control and task measures, glance timing, and subjective ratings were examined. During driving, word-based speech recognition yielded the shortest total task time (15.3 s), followed by character-based speech recognition (41.0 s) and touch-screen keyboard (86.0 s). The standard deviation of lateral position when performing keyboard entry (0.21 m) was 60% higher than that for all other address entry methods (0.13 m). Degradation of vehicle control associated with address entry using a touch screen suggests that the use of speech recognition is favorable. Speech recognition systems with visual feedback, however, even with excellent accuracy, are not without performance consequences. Applications of this research include the design of in-vehicle navigation systems as well as other systems requiring significant driver input, such as E-mail, the Internet, and text messaging.
Geers, Ann E; Davidson, Lisa S; Uchanski, Rosalie M; Nicholas, Johanna G
2013-09-01
This study documented the ability of experienced pediatric cochlear implant (CI) users to perceive linguistic properties (what is said) and indexical attributes (emotional intent and talker identity) of speech, and examined the extent to which linguistic (LSP) and indexical (ISP) perception skills are related. Preimplant-aided hearing, age at implantation, speech processor technology, CI-aided thresholds, sequential bilateral cochlear implantation, and academic integration with hearing age-mates were examined for their possible relationships to both LSP and ISP skills. Sixty 9- to 12-year olds, first implanted at an early age (12 to 38 months), participated in a comprehensive test battery that included the following LSP skills: (1) recognition of monosyllabic words at loud and soft levels, (2) repetition of phonemes and suprasegmental features from nonwords, and (3) recognition of key words from sentences presented within a noise background, and the following ISP skills: (1) discrimination of across-gender and within-gender (female) talkers and (2) identification and discrimination of emotional content from spoken sentences. A group of 30 age-matched children without hearing loss completed the nonword repetition, and talker- and emotion-perception tasks for comparison. Word-recognition scores decreased with signal level from a mean of 77% correct at 70 dB SPL to 52% at 50 dB SPL. On average, CI users recognized 50% of key words presented in sentences that were 9.8 dB above background noise. Phonetic properties were repeated from nonword stimuli at about the same level of accuracy as suprasegmental attributes (70 and 75%, respectively). The majority of CI users identified emotional content and differentiated talkers significantly above chance levels. Scores on LSP and ISP measures were combined into separate principal component scores and these components were highly correlated (r = 0.76). Both LSP and ISP component scores were higher for children who received a CI at the youngest ages, upgraded to more recent CI technology and had lower CI-aided thresholds. Higher scores, for both LSP and ISP components, were also associated with higher language levels and mainstreaming at younger ages. Higher ISP scores were associated with better social skills. Results strongly support a link between indexical and linguistic properties in perceptual analysis of speech. These two channels of information appear to be processed together in parallel by the auditory system and are inseparable in perception. Better speech performance, for both linguistic and indexical perception, is associated with younger age at implantation and use of more recent speech processor technology. Children with better speech perception demonstrated better spoken language, earlier academic mainstreaming, and placement in more typically sized classrooms (i.e., >20 students). Well-developed social skills were more highly associated with the ability to discriminate the nuances of talker identity and emotion than with the ability to recognize words and sentences through listening. The extent to which early cochlear implantation enabled these early-implanted children to make use of both linguistic and indexical properties of speech influenced not only their development of spoken language, but also their ability to function successfully in a hearing world.
Geers, Ann; Davidson, Lisa; Uchanski, Rosalie; Nicholas, Johanna
2013-01-01
Objectives This study documented the ability of experienced pediatric cochlear implant (CI) users to perceive linguistic properties (what is said) and indexical attributes (emotional intent and talker identity) of speech, and examined the extent to which linguistic (LSP) and indexical (ISP) perception skills are related. Pre-implant aided hearing, age at implantation, speech processor technology, CI-aided thresholds, sequential bilateral cochlear implantation, and academic integration with hearing age-mates were examined for their possible relationships to both LSP and ISP skills. Design Sixty 9–12 year olds, first implanted at an early age (12–38 months), participated in a comprehensive test battery that included the following LSP skills: 1) recognition of monosyllabic words at loud and soft levels, 2) repetition of phonemes and suprasegmental features from non-words, and 3) recognition of keywords from sentences presented within a noise background, and the following ISP skills: 1) discrimination of male from female and female from female talkers and 2) identification and discrimination of emotional content from spoken sentences. A group of 30 age-matched children without hearing loss completed the non-word repetition, and talker- and emotion-perception tasks for comparison. Results Word recognition scores decreased with signal level from a mean of 77% correct at 70 dB SPL to 52% at 50 dB SPL. On average, CI users recognized 50% of keywords presented in sentences that were 9.8 dB above background noise. Phonetic properties were repeated from non-word stimuli at about the same level of accuracy as suprasegmental attributes (70% and 75%, respectively). The majority of CI users identified emotional content and differentiated talkers significantly above chance levels. Scores on LSP and ISP measures were combined into separate principal component scores and these components were highly correlated (r = .76). Both LSP and ISP component scores were higher for children who received a CI at the youngest ages, upgraded to more recent CI technology and had lower CI-aided thresholds. Higher scores, for both LSP and ISP components, were also associated with higher language levels and mainstreaming at younger ages. Higher ISP scores were associated with better social skills. Conclusions Results strongly support a link between indexical and linguistic properties in perceptual analysis of speech. These two channels of information appear to be processed together in parallel by the auditory system and are inseparable in perception. Better speech performance, for both linguistic and indexical perception, is associated with younger age at implantation and use of more recent speech processor technology. Children with better speech perception demonstrated better spoken language, earlier academic mainstreaming, and placement in more typically-sized classrooms (i.e., >20 students). Well-developed social skills were more highly associated with the ability to discriminate the nuances of talker identity and emotion than with the ability to recognize words and sentences through listening. The extent to which early cochlear implantation enabled these early-implanted children to make use of both linguistic and indexical properties of speech influenced not only their development of spoken language, but also their ability to function successfully in a hearing world. PMID:23652814
NASA Astrophysics Data System (ADS)
Liberman, A. M.
1982-03-01
This report is one of a regular series on the status and progress of studies on the nature of speech, instrumentation for its investigation and practical applications. Manuscripts cover the following topics: Speech perception and memory coding in relation to reading ability; The use of orthographic structure by deaf adults: Recognition of finger-spelled letters; Exploring the information support for speech; The stream of speech; Using the acoustic signal to make inferences about place and duration of tongue-palate contact. Patterns of human interlimb coordination emerge from the the properties of nonlinear limit cycle oscillatory processes: Theory and data; Motor control: Which themes do we orchestrate? Exploring the nature of motor control in Down's syndrome; Periodicity and auditory memory: A pilot study; Reading skill and language skill: On the role of sign order and morphological structure in memory for American Sign Language sentences; Perception of nasal consonants with special reference to Catalan; and Speech production Characteristics of the hearing impaired.
Speech Processing and Recognition (SPaRe)
2011-01-01
results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing ( NLP ), and...Processing ( NLP ), Information Retrieval (IR) 16. SECURITY CLASSIFICATION OF: UNCLASSIFED 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...Figure 9, the IOC was only expected to provide document submission and search; automatic speech recognition (ASR) for English, Spanish, Arabic , and
ERIC Educational Resources Information Center
Wigmore, Angela; Hunter, Gordon; Pflugel, Eckhard; Denholm-Price, James; Binelli, Vincent
2009-01-01
Speech technology--especially automatic speech recognition--has now advanced to a level where it can be of great benefit both to able-bodied people and those with various disabilities. In this paper we describe an application "TalkMaths" which, using the output from a commonly-used conventional automatic speech recognition system,…
Performing speech recognition research with hypercard
NASA Technical Reports Server (NTRS)
Shepherd, Chip
1993-01-01
The purpose of this paper is to describe a HyperCard-based system for performing speech recognition research and to instruct Human Factors professionals on how to use the system to obtain detailed data about the user interface of a prototype speech recognition application.
Perceptual learning for speech in noise after application of binary time-frequency masks
Ahmadi, Mahnaz; Gross, Vauna L.; Sinex, Donal G.
2013-01-01
Ideal time-frequency (TF) masks can reject noise and improve the recognition of speech-noise mixtures. An ideal TF mask is constructed with prior knowledge of the target speech signal. The intelligibility of a processed speech-noise mixture depends upon the threshold criterion used to define the TF mask. The study reported here assessed the effect of training on the recognition of speech in noise after processing by ideal TF masks that did not restore perfect speech intelligibility. Two groups of listeners with normal hearing listened to speech-noise mixtures processed by TF masks calculated with different threshold criteria. For each group, a threshold criterion that initially produced word recognition scores between 0.56–0.69 was chosen for training. Listeners practiced with one set of TF-masked sentences until their word recognition performance approached asymptote. Perceptual learning was quantified by comparing word-recognition scores in the first and last training sessions. Word recognition scores improved with practice for all listeners with the greatest improvement observed for the same materials used in training. PMID:23464038
Desmond, Jill M; Collins, Leslie M; Throckmorton, Chandra S
2014-06-01
Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221-3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.
Wie, Ona Bø; Falkenberg, Eva-Signe; Tvete, Ole; Tomblin, Bruce
2007-05-01
The objectives of the study were to describe the characteristics of the first 79 prelingually deaf cochlear implant users in Norway and to investigate to what degree the variation in speech recognition, speech- recognition growth rate, and speech production could be explained by the characteristics of the child, the cochlear implant, the family, and the educational setting. Data gathered longitudinally were analysed using descriptive statistics, multiple regression, and growth-curve analysis. The results show that more than 50% of the variation could be explained by these characteristics. Daily user-time, non-verbal intelligence, mode of communication, length of CI experience, and educational placement had the highest effect on the outcome. The results also indicate that children educated in a bilingual approach to education have better speech perception and faster speech perception growth rate with increased focus on spoken language.
Davidson, Lisa S; Skinner, Margaret W; Holstad, Beth A; Fears, Beverly T; Richter, Marie K; Matusofsky, Margaret; Brenner, Christine; Holden, Timothy; Birath, Amy; Kettel, Jerrica L; Scollie, Susan
2009-06-01
The purpose of this study was to examine the effects of a wider instantaneous input dynamic range (IIDR) setting on speech perception and comfort in quiet and noise for children wearing the Nucleus 24 implant system and the Freedom speech processor. In addition, children's ability to understand soft and conversational level speech in relation to aided sound-field thresholds was examined. Thirty children (age, 7 to 17 years) with the Nucleus 24 cochlear implant system and the Freedom speech processor with two different IIDR settings (30 versus 40 dB) were tested on the Consonant Nucleus Consonant (CNC) word test at 50 and 60 dB SPL, the Bamford-Kowal-Bench Speech in Noise Test, and a loudness rating task for four-talker speech noise. Aided thresholds for frequency-modulated tones, narrowband noise, and recorded Ling sounds were obtained with the two IIDRs and examined in relation to CNC scores at 50 dB SPL. Speech Intelligibility Indices were calculated using the long-term average speech spectrum of the CNC words at 50 dB SPL measured at each test site and aided thresholds. Group mean CNC scores at 50 dB SPL with the 40 IIDR were significantly higher (p < 0.001) than with the 30 IIDR. Group mean CNC scores at 60 dB SPL, loudness ratings, and the signal to noise ratios-50 for Bamford-Kowal-Bench Speech in Noise Test were not significantly different for the two IIDRs. Significantly improved aided thresholds at 250 to 6000 Hz as well as higher Speech Intelligibility Indices afforded improved audibility for speech presented at soft levels (50 dB SPL). These results indicate that an increased IIDR provides improved word recognition for soft levels of speech without compromising comfort of higher levels of speech sounds or sentence recognition in noise.
Ellis, Rachel J; Rönnberg, Jerker
2015-01-01
Proactive interference (PI) is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.
Ellis, Rachel J.; Rönnberg, Jerker
2015-01-01
Proactive interference (PI) is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility. PMID:26283981
Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.
Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina
2015-07-01
It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line with the 'auditory-visual view' of auditory speech perception, which assumes that auditory speech recognition is optimized by using predictions from previously encoded speaker-specific audio-visual internal models. Copyright © 2015 Elsevier Ltd. All rights reserved.
Valente, Daniel L.; Plevinsky, Hallie M.; Franco, John M.; Heinrichs-Graham, Elizabeth C.; Lewis, Dawna E.
2012-01-01
The potential effects of acoustical environment on speech understanding are especially important as children enter school where students’ ability to hear and understand complex verbal information is critical to learning. However, this ability is compromised because of widely varied and unfavorable classroom acoustics. The extent to which unfavorable classroom acoustics affect children’s performance on longer learning tasks is largely unknown as most research has focused on testing children using words, syllables, or sentences as stimuli. In the current study, a simulated classroom environment was used to measure comprehension performance of two classroom learning activities: a discussion and lecture. Comprehension performance was measured for groups of elementary-aged students in one of four environments with varied reverberation times and background noise levels. The reverberation time was either 0.6 or 1.5 s, and the signal-to-noise level was either +10 or +7 dB. Performance is compared to adult subjects as well as to sentence-recognition in the same condition. Significant differences were seen in comprehension scores as a function of age and condition; both increasing background noise and reverberation degraded performance in comprehension tasks compared to minimal differences in measures of sentence-recognition. PMID:22280587
Audiovisual sentence recognition not predicted by susceptibility to the McGurk effect.
Van Engen, Kristin J; Xie, Zilong; Chandrasekaran, Bharath
2017-02-01
In noisy situations, visual information plays a critical role in the success of speech communication: listeners are better able to understand speech when they can see the speaker. Visual influence on auditory speech perception is also observed in the McGurk effect, in which discrepant visual information alters listeners' auditory perception of a spoken syllable. When hearing /ba/ while seeing a person saying /ga/, for example, listeners may report hearing /da/. Because these two phenomena have been assumed to arise from a common integration mechanism, the McGurk effect has often been used as a measure of audiovisual integration in speech perception. In this study, we test whether this assumed relationship exists within individual listeners. We measured participants' susceptibility to the McGurk illusion as well as their ability to identify sentences in noise across a range of signal-to-noise ratios in audio-only and audiovisual modalities. Our results do not show a relationship between listeners' McGurk susceptibility and their ability to use visual cues to understand spoken sentences in noise, suggesting that McGurk susceptibility may not be a valid measure of audiovisual integration in everyday speech processing.
NASA Astrophysics Data System (ADS)
Studdert-Kennedy, M.; Obrien, N.
1983-05-01
This report is one of a regular series on the status and progress of studies on the nature of speech, instrumentation for its investigation, and practical applications. Manuscripts cover the following topics: The influence of subcategorical mismatches on lexical access; The Serbo-Croatian orthography constraints the reader to a phonologically analytic strategy; Grammatical priming effects between pronouns and inflected verb forms; Misreadings by beginning readers of Serrbo-Croatian; Bi-alphabetism and work recognition; Orthographic and phonemic coding for word identification: Evidence for Hebrew; Stress and vowel duration effects on syllable recognition; Phonetic and auditory trading relations between acoustic cues in speech perception: Further results; Linguistic coding by deaf children in relation beginning reading success; Determinants of spelling ability in deaf and hearing adults: Access to linguistic structures; A dynamical basis for action systems; On the space-time structure of human interlimb coordination; Some acoustic and physiological observations on diphthongs; Relationship between pitch control and vowel articulation; Laryngeal vibrations: A comparison between high-speed filming and glottographic techniques; Compensatory articulation in hearing impaired speakers: A cinefluorographic study; and Review (Pierre Delattre: Studies in comparative phonetics.)
Distributed Fusion in Sensor Networks with Information Genealogy
2011-06-28
image processing [2], acoustic and speech recognition [3], multitarget tracking [4], distributed fusion [5], and Bayesian inference [6-7]. For...Adaptation for Distant-Talking Speech Recognition." in Proc Acoustics. Speech , and Signal Processing, 2004 |4| Y Bar-Shalom and T 1-. Fortmann...used in speech recognition and other classification applications [8]. But their use in underwater mine classification is limited. In this paper, we
Jürgens, Tim; Ewert, Stephan D; Kollmeier, Birger; Brand, Thomas
2014-03-01
Consonant recognition was assessed in normal-hearing (NH) and hearing-impaired (HI) listeners in quiet as a function of speech level using a nonsense logatome test. Average recognition scores were analyzed and compared to recognition scores of a speech recognition model. In contrast to commonly used spectral speech recognition models operating on long-term spectra, a "microscopic" model operating in the time domain was used. Variations of the model (accounting for hearing impairment) and different model parameters (reflecting cochlear compression) were tested. Using these model variations this study examined whether speech recognition performance in quiet is affected by changes in cochlear compression, namely, a linearization, which is often observed in HI listeners. Consonant recognition scores for HI listeners were poorer than for NH listeners. The model accurately predicted the speech reception thresholds of the NH and most HI listeners. A partial linearization of the cochlear compression in the auditory model, while keeping audibility constant, produced higher recognition scores and improved the prediction accuracy. However, including listener-specific information about the exact form of the cochlear compression did not improve the prediction further.
Heimbauer, Lisa A; Beran, Michael J; Owren, Michael J
2011-07-26
A long-standing debate concerns whether humans are specialized for speech perception, which some researchers argue is demonstrated by the ability to understand synthetic speech with significantly reduced acoustic cues to phonetic content. We tested a chimpanzee (Pan troglodytes) that recognizes 128 spoken words, asking whether she could understand such speech. Three experiments presented 48 individual words, with the animal selecting a corresponding visuographic symbol from among four alternatives. Experiment 1 tested spectrally reduced, noise-vocoded (NV) synthesis, originally developed to simulate input received by human cochlear-implant users. Experiment 2 tested "impossibly unspeechlike" sine-wave (SW) synthesis, which reduces speech to just three moving tones. Although receiving only intermittent and noncontingent reward, the chimpanzee performed well above chance level, including when hearing synthetic versions for the first time. Recognition of SW words was least accurate but improved in experiment 3 when natural words in the same session were rewarded. The chimpanzee was more accurate with NV than SW versions, as were 32 human participants hearing these items. The chimpanzee's ability to spontaneously recognize acoustically reduced synthetic words suggests that experience rather than specialization is critical for speech-perception capabilities that some have suggested are uniquely human. Copyright © 2011 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Oxenham, Andrew J.; Rosengard, Peninah S.; Braida, Louis D.
2004-05-01
Cochlear damage can lead to a reduction in the overall amount of peripheral auditory compression, presumably due to outer hair cell (OHC) loss or dysfunction. The perceptual consequences of functional OHC loss include loudness recruitment and reduced dynamic range, poorer frequency selectivity, and poorer effective temporal resolution. These in turn may lead to a reduced ability to make use of spectral and temporal fluctuations in background noise when listening to a target sound, such as speech. We tested the effect of OHC function on speech reception in hearing-impaired listeners by comparing psychoacoustic measures of cochlear compression and sentence recognition in a variety of noise backgrounds. In line with earlier studies, we found weak (nonsignificant) correlations between the psychoacoustic tasks and speech reception thresholds in quiet or in steady-state noise. However, when spectral and temporal fluctuations were introduced in the masker, speech reception improved to an extent that was well predicted by the psychoacoustic measures. Thus, our initial results suggest a strong relationship between measures of cochlear compression and the ability of listeners to take advantage of spectral and temporal masker fluctuations in recognizing speech. [Work supported by NIH Grants Nos. R01DC03909, T32DC00038, and R01DC00117.
Neger, Thordis M.; Rietveld, Toni; Janse, Esther
2014-01-01
Within a few sentences, listeners learn to understand severely degraded speech such as noise-vocoded speech. However, individuals vary in the amount of such perceptual learning and it is unclear what underlies these differences. The present study investigates whether perceptual learning in speech relates to statistical learning, as sensitivity to probabilistic information may aid identification of relevant cues in novel speech input. If statistical learning and perceptual learning (partly) draw on the same general mechanisms, then statistical learning in a non-auditory modality using non-linguistic sequences should predict adaptation to degraded speech. In the present study, 73 older adults (aged over 60 years) and 60 younger adults (aged between 18 and 30 years) performed a visual artificial grammar learning task and were presented with 60 meaningful noise-vocoded sentences in an auditory recall task. Within age groups, sentence recognition performance over exposure was analyzed as a function of statistical learning performance, and other variables that may predict learning (i.e., hearing, vocabulary, attention switching control, working memory, and processing speed). Younger and older adults showed similar amounts of perceptual learning, but only younger adults showed significant statistical learning. In older adults, improvement in understanding noise-vocoded speech was constrained by age. In younger adults, amount of adaptation was associated with lexical knowledge and with statistical learning ability. Thus, individual differences in general cognitive abilities explain listeners' variability in adapting to noise-vocoded speech. Results suggest that perceptual and statistical learning share mechanisms of implicit regularity detection, but that the ability to detect statistical regularities is impaired in older adults if visual sequences are presented quickly. PMID:25225475
Neger, Thordis M; Rietveld, Toni; Janse, Esther
2014-01-01
Within a few sentences, listeners learn to understand severely degraded speech such as noise-vocoded speech. However, individuals vary in the amount of such perceptual learning and it is unclear what underlies these differences. The present study investigates whether perceptual learning in speech relates to statistical learning, as sensitivity to probabilistic information may aid identification of relevant cues in novel speech input. If statistical learning and perceptual learning (partly) draw on the same general mechanisms, then statistical learning in a non-auditory modality using non-linguistic sequences should predict adaptation to degraded speech. In the present study, 73 older adults (aged over 60 years) and 60 younger adults (aged between 18 and 30 years) performed a visual artificial grammar learning task and were presented with 60 meaningful noise-vocoded sentences in an auditory recall task. Within age groups, sentence recognition performance over exposure was analyzed as a function of statistical learning performance, and other variables that may predict learning (i.e., hearing, vocabulary, attention switching control, working memory, and processing speed). Younger and older adults showed similar amounts of perceptual learning, but only younger adults showed significant statistical learning. In older adults, improvement in understanding noise-vocoded speech was constrained by age. In younger adults, amount of adaptation was associated with lexical knowledge and with statistical learning ability. Thus, individual differences in general cognitive abilities explain listeners' variability in adapting to noise-vocoded speech. Results suggest that perceptual and statistical learning share mechanisms of implicit regularity detection, but that the ability to detect statistical regularities is impaired in older adults if visual sequences are presented quickly.
Mispronunciation Detection for Language Learning and Speech Recognition Adaptation
ERIC Educational Resources Information Center
Ge, Zhenhao
2013-01-01
The areas of "mispronunciation detection" (or "accent detection" more specifically) within the speech recognition community are receiving increased attention now. Two application areas, namely language learning and speech recognition adaptation, are largely driving this research interest and are the focal points of this work.…
Longitudinal changes in speech recognition in older persons.
Dubno, Judy R; Lee, Fu-Shing; Matthews, Lois J; Ahlstrom, Jayne B; Horwitz, Amy R; Mills, John H
2008-01-01
Recognition of isolated monosyllabic words in quiet and recognition of key words in low- and high-context sentences in babble were measured in a large sample of older persons enrolled in a longitudinal study of age-related hearing loss. Repeated measures were obtained yearly or every 2 to 3 years. To control for concurrent changes in pure-tone thresholds and speech levels, speech-recognition scores were adjusted using an importance-weighted speech-audibility metric (AI). Linear-regression slope estimated the rate of change in adjusted speech-recognition scores. Recognition of words in quiet declined significantly faster with age than predicted by declines in speech audibility. As subjects aged, observed scores deviated increasingly from AI-predicted scores, but this effect did not accelerate with age. Rate of decline in word recognition was significantly faster for females than males and for females with high serum progesterone levels, whereas noise history had no effect. Rate of decline did not accelerate with age but increased with degree of hearing loss, suggesting that with more severe injury to the auditory system, impairments to auditory function other than reduced audibility resulted in faster declines in word recognition as subjects aged. Recognition of key words in low- and high-context sentences in babble did not decline significantly with age.
Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902
Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.
Statistical assessment of speech system performance
NASA Technical Reports Server (NTRS)
Moshier, Stephen L.
1977-01-01
Methods for the normalization of performance tests results of speech recognition systems are presented. Technological accomplishments in speech recognition systems, as well as planned research activities are described.
Measures of Working Memory, Sequence Learning, and Speech Recognition in the Elderly.
ERIC Educational Resources Information Center
Humes, Larry E.; Floyd, Shari S.
2005-01-01
This study describes the measurement of 2 cognitive functions, working-memory capacity and sequence learning, in 2 groups of listeners: young adults with normal hearing and elderly adults with impaired hearing. The measurement of these 2 cognitive abilities with a unique, nonverbal technique capable of auditory, visual, and auditory-visual…
ERIC Educational Resources Information Center
Constantino, John N.; Yang, Dan; Gray, Teddi L.; Gross, Maggie M.; Abbacchi, Anna M.; Smith, Sarah C.; Kohn, Catherine E.; Kuhl, Patricia K.
2007-01-01
Autism spectrum disorders (ASDs) are characterized by correlated deficiencies in social and language development. This study explored a fundamental aspect of auditory information processing (AIP) that is dependent on social experience and critical to early language development: the ability to compartmentalize close-sounding speech sounds into…
He Said, She Said: Effects of Bilingualism on Cross-Talker Word Recognition in Infancy
ERIC Educational Resources Information Center
Singh, Leher
2018-01-01
The purpose of the current study was to examine effects of bilingual language input on infant word segmentation and on talker generalization. In the present study, monolingually and bilingually exposed infants were compared on their abilities to recognize familiarized words in speech and to maintain generalizable representations of familiarized…
A Picture-Identification Test for Hearing-Impaired Children. Final Report.
ERIC Educational Resources Information Center
Ross, Mark; Lerman, Jay
The Word Intelligibility by Picture Identification Test (WIPI) was developed to measure speech discrimination ability in hearing impaired children. In the first phase of development, the word stimuli were evaluated to determine whether they were within the recognition vocabulary of 15 hearing impaired children (aged 6 to 12) and whether the…
Building Searchable Collections of Enterprise Speech Data.
ERIC Educational Resources Information Center
Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret
The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…
Liu, David; Zucherman, Mark; Tulloss, William B
2006-03-01
The reporting of radiological images is undergoing dramatic changes due to the introduction of two new technologies: structured reporting and speech recognition. Each technology has its own unique advantages. The highly organized content of structured reporting facilitates data mining and billing, whereas speech recognition offers a natural succession from the traditional dictation-transcription process. This article clarifies the distinction between the process and outcome of structured reporting, describes fundamental requirements for any effective structured reporting system, and describes the potential development of a novel, easy-to-use, customizable structured reporting system that incorporates speech recognition. This system should have all the advantages derived from structured reporting, accommodate a wide variety of user needs, and incorporate speech recognition as a natural component and extension of the overall reporting process.
Military applications of automatic speech recognition and future requirements
NASA Technical Reports Server (NTRS)
Beek, Bruno; Cupples, Edward J.
1977-01-01
An updated summary of the state-of-the-art of automatic speech recognition and its relevance to military applications is provided. A number of potential systems for military applications are under development. These include: (1) digital narrowband communication systems; (2) automatic speech verification; (3) on-line cartographic processing unit; (4) word recognition for militarized tactical data system; and (5) voice recognition and synthesis for aircraft cockpit.
ERIC Educational Resources Information Center
Stinson, Michael; Elliot, Lisa; McKee, Barbara; Coyne, Gina
This report discusses a project that adapted new automatic speech recognition (ASR) technology to provide real-time speech-to-text transcription as a support service for students who are deaf and hard of hearing (D/HH). In this system, as the teacher speaks, a hearing intermediary, or captionist, dictates into the speech recognition system in a…
Jürgens, Tim; Brand, Thomas
2009-11-01
This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.
The influence of speech rate and accent on access and use of semantic information.
Sajin, Stanislav M; Connine, Cynthia M
2017-04-01
Circumstances in which the speech input is presented in sub-optimal conditions generally lead to processing costs affecting spoken word recognition. The current study indicates that some processing demands imposed by listening to difficult speech can be mitigated by feedback from semantic knowledge. A set of lexical decision experiments examined how foreign accented speech and word duration impact access to semantic knowledge in spoken word recognition. Results indicate that when listeners process accented speech, the reliance on semantic information increases. Speech rate was not observed to influence semantic access, except in the setting in which unusually slow accented speech was presented. These findings support interactive activation models of spoken word recognition in which attention is modulated based on speech demands.
Voice emotion recognition by cochlear-implanted children and their normally-hearing peers
Chatterjee, Monita; Zion, Danielle; Deroche, Mickael L.; Burianek, Brooke; Limb, Charles; Goren, Alison; Kulkarni, Aditya M.; Christensen, Julie A.
2014-01-01
Despite their remarkable success in bringing spoken language to hearing impaired listeners, the signal transmitted through cochlear implants (CIs) remains impoverished in spectro-temporal fine structure. As a consequence, pitch-dominant information such as voice emotion, is diminished. For young children, the ability to correctly identify the mood/intent of the speaker (which may not always be visible in their facial expression) is an important aspect of social and linguistic development. Previous work in the field has shown that children with cochlear implants (cCI) have significant deficits in voice emotion recognition relative to their normally hearing peers (cNH). Here, we report on voice emotion recognition by a cohort of 36 school-aged cCI. Additionally, we provide for the first time, a comparison of their performance to that of cNH and NH adults (aNH) listening to CI simulations of the same stimuli. We also provide comparisons to the performance of adult listeners with CIs (aCI), most of whom learned language primarily through normal acoustic hearing. Results indicate that, despite strong variability, on average, cCI perform similarly to their adult counterparts; that both groups’ mean performance is similar to aNHs’ performance with 8-channel noise-vocoded speech; that cNH achieve excellent scores in voice emotion recognition with full-spectrum speech, but on average, show significantly poorer scores than aNH with 8-channel noise-vocoded speech. A strong developmental effect was observed in the cNH with noise-vocoded speech in this task. These results point to the considerable benefit obtained by cochlear-implanted children from their devices, but also underscore the need for further research and development in this important and neglected area. PMID:25448167
Asynchronous glimpsing of speech: Spread of masking and task set-size
Ozmeral, Erol J.; Buss, Emily; Hall, Joseph W.
2012-01-01
Howard-Jones and Rosen [(1993). J. Acoust. Soc. Am. 93, 2915–2922] investigated the ability to integrate glimpses of speech that are separated in time and frequency using a “checkerboard” masker, with asynchronous amplitude modulation (AM) across frequency. Asynchronous glimpsing was demonstrated only for spectrally wide frequency bands. It is possible that the reduced evidence of spectro-temporal integration with narrower bands was due to spread of masking at the periphery. The present study tested this hypothesis with a dichotic condition, in which the even- and odd-numbered bands of the target speech and asynchronous AM masker were presented to opposite ears, minimizing the deleterious effects of masking spread. For closed-set consonant recognition, thresholds were 5.1–8.5 dB better for dichotic than for monotic asynchronous AM conditions. Results were similar for closed-set word recognition, but for open-set word recognition the benefit of dichotic presentation was more modest and level dependent, consistent with the effects of spread of masking being level dependent. There was greater evidence of asynchronous glimpsing in the open-set than closed-set tasks. Presenting stimuli dichotically supported asynchronous glimpsing with narrower frequency bands than previously shown, though the magnitude of glimpsing was reduced for narrower bandwidths even in some dichotic conditions. PMID:22894234
NASA Astrophysics Data System (ADS)
Kardava, Irakli; Tadyszak, Krzysztof; Gulua, Nana; Jurga, Stefan
2017-02-01
For more flexibility of environmental perception by artificial intelligence it is needed to exist the supporting software modules, which will be able to automate the creation of specific language syntax and to make a further analysis for relevant decisions based on semantic functions. According of our proposed approach, of which implementation it is possible to create the couples of formal rules of given sentences (in case of natural languages) or statements (in case of special languages) by helping of computer vision, speech recognition or editable text conversion system for further automatic improvement. In other words, we have developed an approach, by which it can be achieved to significantly improve the training process automation of artificial intelligence, which as a result will give us a higher level of self-developing skills independently from us (from users). At the base of our approach we have developed a software demo version, which includes the algorithm and software code for the entire above mentioned component's implementation (computer vision, speech recognition and editable text conversion system). The program has the ability to work in a multi - stream mode and simultaneously create a syntax based on receiving information from several sources.
Use of intonation contours for speech recognition in noise by cochlear implant recipients.
Meister, Hartmut; Landwehr, Markus; Pyschny, Verena; Grugel, Linda; Walger, Martin
2011-05-01
The corruption of intonation contours has detrimental effects on sentence-based speech recognition in normal-hearing listeners Binns and Culling [(2007). J. Acoust. Soc. Am. 122, 1765-1776]. This paper examines whether this finding also applies to cochlear implant (CI) recipients. The subjects' F0-discrimination and speech perception in the presence of noise were measured, using sentences with regular and inverted F0-contours. The results revealed that speech recognition for regular contours was significantly better than for inverted contours. This difference was related to the subjects' F0-discrimination providing further evidence that the perception of intonation patterns is important for the CI-mediated speech recognition in noise.
Hoth, S
2016-08-01
The Freiburg speech intelligibility test according to DIN 45621 was introduced around 60 years ago. For decades, and still today, the Freiburg test has been a standard whose relevance extends far beyond pure audiometry. It is used primarily to determine the speech perception threshold (based on two-digit numbers) and the ability to discriminate speech at suprathreshold presentation levels (based on monosyllabic nouns). Moreover, it is a measure of the degree of disability, the requirement for and success of technical hearing aids (auxiliaries directives), and the compensation for disability and handicap (Königstein recommendation). In differential audiological diagnostics, the Freiburg test contributes to the distinction between low- and high-frequency hearing loss, as well as to identification of conductive, sensory, neural, and central disorders. Currently, the phonemic and perceptual balance of the monosyllabic test lists is subject to critical discussions. Obvious deficiencies exist for testing speech recognition in noise. In this respect, alternatives such as sentence or rhyme tests in closed-answer inventories are discussed.
Schierholz, Irina; Finke, Mareike; Kral, Andrej; Büchner, Andreas; Rach, Stefan; Lenarz, Thomas; Dengler, Reinhard; Sandmann, Pascale
2017-04-01
There is substantial variability in speech recognition ability across patients with cochlear implants (CIs), auditory brainstem implants (ABIs), and auditory midbrain implants (AMIs). To better understand how this variability is related to central processing differences, the current electroencephalography (EEG) study compared hearing abilities and auditory-cortex activation in patients with electrical stimulation at different sites of the auditory pathway. Three different groups of patients with auditory implants (Hannover Medical School; ABI: n = 6, CI: n = 6; AMI: n = 2) performed a speeded response task and a speech recognition test with auditory, visual, and audio-visual stimuli. Behavioral performance and cortical processing of auditory and audio-visual stimuli were compared between groups. ABI and AMI patients showed prolonged response times on auditory and audio-visual stimuli compared with NH listeners and CI patients. This was confirmed by prolonged N1 latencies and reduced N1 amplitudes in ABI and AMI patients. However, patients with central auditory implants showed a remarkable gain in performance when visual and auditory input was combined, in both speech and non-speech conditions, which was reflected by a strong visual modulation of auditory-cortex activation in these individuals. In sum, the results suggest that the behavioral improvement for audio-visual conditions in central auditory implant patients is based on enhanced audio-visual interactions in the auditory cortex. Their findings may provide important implications for the optimization of electrical stimulation and rehabilitation strategies in patients with central auditory prostheses. Hum Brain Mapp 38:2206-2225, 2017. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Speech processing using maximum likelihood continuity mapping
Hogden, John E.
2000-01-01
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Speech processing using maximum likelihood continuity mapping
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.E.
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Supporting Dictation Speech Recognition Error Correction: The Impact of External Information
ERIC Educational Resources Information Center
Shi, Yongmei; Zhou, Lina
2011-01-01
Although speech recognition technology has made remarkable progress, its wide adoption is still restricted by notable effort made and frustration experienced by users while correcting speech recognition errors. One of the promising ways to improve error correction is by providing user support. Although support mechanisms have been proposed for…
Application of an auditory model to speech recognition.
Cohen, J R
1989-06-01
Some aspects of auditory processing are incorporated in a front end for the IBM speech-recognition system [F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE 64 (4), 532-556 (1976)]. This new process includes adaptation, loudness scaling, and mel warping. Tests show that the design is an improvement over previous algorithms.
Adaptation to spectrally-rotated speech.
Green, Tim; Rosen, Stuart; Faulkner, Andrew; Paterson, Ruth
2013-08-01
Much recent interest surrounds listeners' abilities to adapt to various transformations that distort speech. An extreme example is spectral rotation, in which the spectrum of low-pass filtered speech is inverted around a center frequency (2 kHz here). Spectral shape and its dynamics are completely altered, rendering speech virtually unintelligible initially. However, intonation, rhythm, and contrasts in periodicity and aperiodicity are largely unaffected. Four normal hearing adults underwent 6 h of training with spectrally-rotated speech using Continuous Discourse Tracking. They and an untrained control group completed pre- and post-training speech perception tests, for which talkers differed from the training talker. Significantly improved recognition of spectrally-rotated sentences was observed for trained, but not untrained, participants. However, there were no significant improvements in the identification of medial vowels in /bVd/ syllables or intervocalic consonants. Additional tests were performed with speech materials manipulated so as to isolate the contribution of various speech features. These showed that preserving intonational contrasts did not contribute to the comprehension of spectrally-rotated speech after training, and suggested that improvements involved adaptation to altered spectral shape and dynamics, rather than just learning to focus on speech features relatively unaffected by the transformation.
Specific acoustic models for spontaneous and dictated style in indonesian speech recognition
NASA Astrophysics Data System (ADS)
Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.
2018-03-01
The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.
ERIC Educational Resources Information Center
Calandruccio, Lauren; Zhou, Haibo
2014-01-01
Purpose: To examine whether improved speech recognition during linguistically mismatched target-masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method: Monolingual English speakers (n = 20) and English-Greek simultaneous bilinguals (n = 20) listened to…
McCreery, Ryan W.; Alexander, Joshua; Brennan, Marc A.; Hoover, Brenda; Kopun, Judy; Stelmachowicz, Patricia G.
2014-01-01
Objective The primary goal of nonlinear frequency compression (NFC) and other frequency lowering strategies is to increase the audibility of high-frequency sounds that are not otherwise audible with conventional hearing-aid processing due to the degree of hearing loss, limited hearing aid bandwidth or a combination of both factors. The aim of the current study was to compare estimates of speech audibility processed by NFC to improvements in speech recognition for a group of children and adults with high-frequency hearing loss. Design Monosyllabic word recognition was measured in noise for twenty-four adults and twelve children with mild to severe sensorineural hearing loss. Stimuli were amplified based on each listener’s audiogram with conventional processing (CP) with amplitude compression or with NFC and presented under headphones using a software-based hearing aid simulator. A modification of the speech intelligibility index (SII) was used to estimate audibility of information in frequency-lowered bands. The mean improvement in SII was compared to the mean improvement in speech recognition. Results All but two listeners experienced improvements in speech recognition with NFC compared to CP, consistent with the small increase in audibility that was estimated using the modification of the SII. Children and adults had similar improvements in speech recognition with NFC. Conclusion Word recognition with NFC was higher than CP for children and adults with mild to severe hearing loss. The average improvement in speech recognition with NFC (7%) was consistent with the modified SII, which indicated that listeners experienced an increase in audibility with NFC compared to CP. Further studies are necessary to determine if changes in audibility with NFC are related to speech recognition with NFC for listeners with greater degrees of hearing loss, with a greater variety of compression settings, and using auditory training. PMID:24535558
Warzybok, Anna; Brand, Thomas; Wagener, Kirsten C; Kollmeier, Birger
2015-01-01
The current study investigates the extent to which the linguistic complexity of three commonly employed speech recognition tests and second language proficiency influence speech recognition thresholds (SRTs) in noise in non-native listeners. SRTs were measured for non-natives and natives using three German speech recognition tests: the digit triplet test (DTT), the Oldenburg sentence test (OLSA), and the Göttingen sentence test (GÖSA). Sixty-four non-native and eight native listeners participated. Non-natives can show native-like SRTs in noise only for the linguistically easy speech material (DTT). Furthermore, the limitation of phonemic-acoustical cues in digit triplets affects speech recognition to the same extent in non-natives and natives. For more complex and less familiar speech materials, non-natives, ranging from basic to advanced proficiency in German, require on average 3-dB better signal-to-noise ratio for the OLSA and 6-dB for the GÖSA to obtain 50% speech recognition compared to native listeners. In clinical audiology, SRT measurements with a closed-set speech test (i.e. DTT for screening or OLSA test for clinical purposes) should be used with non-native listeners rather than open-set speech tests (such as the GÖSA or HINT), especially if a closed-set version in the patient's own native language is available.
A language-familiarity effect for speaker discrimination without comprehension.
Fleming, David; Giordano, Bruno L; Caldara, Roberto; Belin, Pascal
2014-09-23
The influence of language familiarity upon speaker identification is well established, to such an extent that it has been argued that "Human voice recognition depends on language ability" [Perrachione TK, Del Tufo SN, Gabrieli JDE (2011) Science 333(6042):595]. However, 7-mo-old infants discriminate speakers of their mother tongue better than they do foreign speakers [Johnson EK, Westrek E, Nazzi T, Cutler A (2011) Dev Sci 14(5):1002-1011] despite their limited speech comprehension abilities, suggesting that speaker discrimination may rely on familiarity with the sound structure of one's native language rather than the ability to comprehend speech. To test this hypothesis, we asked Chinese and English adult participants to rate speaker dissimilarity in pairs of sentences in English or Mandarin that were first time-reversed to render them unintelligible. Even in these conditions a language-familiarity effect was observed: Both Chinese and English listeners rated pairs of native-language speakers as more dissimilar than foreign-language speakers, despite their inability to understand the material. Our data indicate that the language familiarity effect is not based on comprehension but rather on familiarity with the phonology of one's native language. This effect may stem from a mechanism analogous to the "other-race" effect in face recognition.
Zhou, Hong; Li, Yu; Liang, Meng; Guan, Connie Qun; Zhang, Linjun; Shu, Hua; Zhang, Yang
2017-01-01
The goal of this developmental speech perception study was to assess whether and how age group modulated the influences of high-level semantic context and low-level fundamental frequency ( F 0 ) contours on the recognition of Mandarin speech by elementary and middle-school-aged children in quiet and interference backgrounds. The results revealed different patterns for semantic and F 0 information. One the one hand, age group modulated significantly the use of F 0 contours, indicating that elementary school children relied more on natural F 0 contours than middle school children during Mandarin speech recognition. On the other hand, there was no significant modulation effect of age group on semantic context, indicating that children of both age groups used semantic context to assist speech recognition to a similar extent. Furthermore, the significant modulation effect of age group on the interaction between F 0 contours and semantic context revealed that younger children could not make better use of semantic context in recognizing speech with flat F 0 contours compared with natural F 0 contours, while older children could benefit from semantic context even when natural F 0 contours were altered, thus confirming the important role of F 0 contours in Mandarin speech recognition by elementary school children. The developmental changes in the effects of high-level semantic and low-level F 0 information on speech recognition might reflect the differences in auditory and cognitive resources associated with processing of the two types of information in speech perception.
Talker variability in audio-visual speech perception
Heald, Shannon L. M.; Nusbaum, Howard C.
2014-01-01
A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition). So far, this talker variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker’s face, speech recognition is improved under adverse listening (e.g., noise or distortion) conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker’s face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker’s face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred. PMID:25076919
Talker variability in audio-visual speech perception.
Heald, Shannon L M; Nusbaum, Howard C
2014-01-01
A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition). So far, this talker variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker's face, speech recognition is improved under adverse listening (e.g., noise or distortion) conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker's face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker's face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred.
Methods and apparatus for non-acoustic speech characterization and recognition
Holzrichter, John F.
1999-01-01
By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.
Methods and apparatus for non-acoustic speech characterization and recognition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J.F.
By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.
NASA Technical Reports Server (NTRS)
Wolf, Jared J.
1977-01-01
The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.
A measure for assessing the effects of audiovisual speech integration.
Altieri, Nicholas; Townsend, James T; Wenger, Michael J
2014-06-01
We propose a measure of audiovisual speech integration that takes into account accuracy and response times. This measure should prove beneficial for researchers investigating multisensory speech recognition, since it relates to normal-hearing and aging populations. As an example, age-related sensory decline influences both the rate at which one processes information and the ability to utilize cues from different sensory modalities. Our function assesses integration when both auditory and visual information are available, by comparing performance on these audiovisual trials with theoretical predictions for performance under the assumptions of parallel, independent self-terminating processing of single-modality inputs. We provide example data from an audiovisual identification experiment and discuss applications for measuring audiovisual integration skills across the life span.
Speech Recognition and Cognitive Skills in Bimodal Cochlear Implant Users
ERIC Educational Resources Information Center
Hua, Håkan; Johansson, Björn; Magnusson, Lennart; Lyxell, Björn; Ellis, Rachel J.
2017-01-01
Purpose: To examine the relation between speech recognition and cognitive skills in bimodal cochlear implant (CI) and hearing aid users. Method: Seventeen bimodal CI users (28-74 years) were recruited to the study. Speech recognition tests were carried out in quiet and in noise. The cognitive tests employed included the Reading Span Test and the…
Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks
ERIC Educational Resources Information Center
Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya
2016-01-01
This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…
Visual Speech Primes Open-Set Recognition of Spoken Words
ERIC Educational Resources Information Center
Buchwald, Adam B.; Winters, Stephen J.; Pisoni, David B.
2009-01-01
Visual speech perception has become a topic of considerable interest to speech researchers. Previous research has demonstrated that perceivers neurally encode and use speech information from the visual modality, and this information has been found to facilitate spoken word recognition in tasks such as lexical decision (Kim, Davis, & Krins,…
Tai, Yihsin; Husain, Fatima T
2018-04-01
Despite having normal hearing sensitivity, patients with chronic tinnitus may experience more difficulty recognizing speech in adverse listening conditions as compared to controls. However, the association between the characteristics of tinnitus (severity and loudness) and speech recognition remains unclear. In this study, the Quick Speech-in-Noise test (QuickSIN) was conducted monaurally on 14 patients with bilateral tinnitus and 14 age- and hearing-matched adults to determine the relation between tinnitus characteristics and speech understanding. Further, Tinnitus Handicap Inventory (THI), tinnitus loudness magnitude estimation, and loudness matching were obtained to better characterize the perceptual and psychological aspects of tinnitus. The patients reported low THI scores, with most participants in the slight handicap category. Significant between-group differences in speech-in-noise performance were only found at the 5-dB signal-to-noise ratio (SNR) condition. The tinnitus group performed significantly worse in the left ear than in the right ear, even though bilateral tinnitus percept and symmetrical thresholds were reported in all patients. This between-ear difference is likely influenced by a right-ear advantage for speech sounds, as factors related to testing order and fatigue were ruled out. Additionally, significant correlations found between SNR loss in the left ear and tinnitus loudness matching suggest that perceptual factors related to tinnitus had an effect on speech-in-noise performance, pointing to a possible interaction between peripheral and cognitive factors in chronic tinnitus. Further studies, that take into account both hearing and cognitive abilities of patients, are needed to better parse out the effect of tinnitus in the absence of hearing impairment.
Significance of parametric spectral ratio methods in detection and recognition of whispered speech
NASA Astrophysics Data System (ADS)
Mathur, Arpit; Reddy, Shankar M.; Hegde, Rajesh M.
2012-12-01
In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.
Current trends in small vocabulary speech recognition for equipment control
NASA Astrophysics Data System (ADS)
Doukas, Nikolaos; Bardis, Nikolaos G.
2017-09-01
Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.
Effects of age and hearing loss on recognition of unaccented and accented multisyllabic words.
Gordon-Salant, Sandra; Yeni-Komshian, Grace H; Fitzgibbons, Peter J; Cohen, Julie I
2015-02-01
The effects of age and hearing loss on recognition of unaccented and accented words of varying syllable length were investigated. It was hypothesized that with increments in length of syllables, there would be atypical alterations in syllable stress in accented compared to native English, and that these altered stress patterns would be sensitive to auditory temporal processing deficits with aging. Sets of one-, two-, three-, and four-syllable words with the same initial syllable were recorded by one native English and two Spanish-accented talkers. Lists of these words were presented in isolation and in sentence contexts to younger and older normal-hearing listeners and to older hearing-impaired listeners. Hearing loss effects were apparent for unaccented and accented monosyllabic words, whereas age effects were observed for recognition of accented multisyllabic words, consistent with the notion that altered syllable stress patterns with accent are sensitive for revealing effects of age. Older listeners also exhibited lower recognition scores for moderately accented words in sentence contexts than in isolation, suggesting that the added demands on working memory for words in sentence contexts impact recognition of accented speech. The general pattern of results suggests that hearing loss, age, and cognitive factors limit the ability to recognize Spanish-accented speech.
Effects of age and hearing loss on recognition of unaccented and accented multisyllabic words
Gordon-Salant, Sandra; Yeni-Komshian, Grace H.; Fitzgibbons, Peter J.; Cohen, Julie I.
2015-01-01
The effects of age and hearing loss on recognition of unaccented and accented words of varying syllable length were investigated. It was hypothesized that with increments in length of syllables, there would be atypical alterations in syllable stress in accented compared to native English, and that these altered stress patterns would be sensitive to auditory temporal processing deficits with aging. Sets of one-, two-, three-, and four-syllable words with the same initial syllable were recorded by one native English and two Spanish-accented talkers. Lists of these words were presented in isolation and in sentence contexts to younger and older normal-hearing listeners and to older hearing-impaired listeners. Hearing loss effects were apparent for unaccented and accented monosyllabic words, whereas age effects were observed for recognition of accented multisyllabic words, consistent with the notion that altered syllable stress patterns with accent are sensitive for revealing effects of age. Older listeners also exhibited lower recognition scores for moderately accented words in sentence contexts than in isolation, suggesting that the added demands on working memory for words in sentence contexts impact recognition of accented speech. The general pattern of results suggests that hearing loss, age, and cognitive factors limit the ability to recognize Spanish-accented speech. PMID:25698021
ERIC Educational Resources Information Center
Ylinen, Sari; Bosseler, Alexis; Junttila, Katja; Huotilainen, Minna
2017-01-01
The ability to predict future events in the environment and learn from them is a fundamental component of adaptive behavior across species. Here we propose that inferring predictions facilitates speech processing and word learning in the early stages of language development. Twelve- and 24-month olds' electrophysiological brain responses to heard…
American or British? L2 Speakers' Recognition and Evaluations of Accent Features in English
ERIC Educational Resources Information Center
Carrie, Erin; McKenzie, Robert M.
2018-01-01
Recent language attitude research has attended to the processes involved in identifying and evaluating spoken language varieties. This article investigates the ability of second-language learners of English in Spain (N = 71) to identify Received Pronunciation (RP) and General American (GenAm) speech and their perceptions of linguistic variation…
Simulation of talking faces in the human brain improves auditory speech recognition
von Kriegstein, Katharina; Dogan, Özgür; Grüter, Martina; Giraud, Anne-Lise; Kell, Christian A.; Grüter, Thomas; Kleinschmidt, Andreas; Kiebel, Stefan J.
2008-01-01
Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. PMID:18436648
Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.
Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger
2011-01-01
The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
Holzrichter, J.F.; Ng, L.C.
1998-03-17
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
Holzrichter, John F.; Ng, Lawrence C.
1998-01-01
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.
Characteristics of speaking style and implications for speech recognition.
Shinozaki, Takahiro; Ostendorf, Mari; Atlas, Les
2009-09-01
Differences in speaking style are associated with more or less spectral variability, as well as different modulation characteristics. The greater variation in some styles (e.g., spontaneous speech and infant-directed speech) poses challenges for recognition but possibly also opportunities for learning more robust models, as evidenced by prior work and motivated by child language acquisition studies. In order to investigate this possibility, this work proposes a new method for characterizing speaking style (the modulation spectrum), examines spontaneous, read, adult-directed, and infant-directed styles in this space, and conducts pilot experiments in style detection and sampling for improved speech recognizer training. Speaking style classification is improved by using the modulation spectrum in combination with standard pitch and energy variation. Speech recognition experiments on a small vocabulary conversational speech recognition task show that sampling methods for training with a small amount of data benefit from the new features.
How should a speech recognizer work?
Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M
2005-11-12
Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.
Wolfe, Jace; Morais, Mila; Schafer, Erin; Agrawal, Smita; Koch, Dawn
2015-05-01
Cochlear implant recipients often experience difficulty with understanding speech in the presence of noise. Cochlear implant manufacturers have developed sound processing algorithms designed to improve speech recognition in noise, and research has shown these technologies to be effective. Remote microphone technology utilizing adaptive, digital wireless radio transmission has also been shown to provide significant improvement in speech recognition in noise. There are no studies examining the potential improvement in speech recognition in noise when these two technologies are used simultaneously. The goal of this study was to evaluate the potential benefits and limitations associated with the simultaneous use of a sound processing algorithm designed to improve performance in noise (Advanced Bionics ClearVoice) and a remote microphone system that incorporates adaptive, digital wireless radio transmission (Phonak Roger). A two-by-two way repeated measures design was used to examine performance differences obtained without these technologies compared to the use of each technology separately as well as the simultaneous use of both technologies. Eleven Advanced Bionics (AB) cochlear implant recipients, ages 11 to 68 yr. AzBio sentence recognition was measured in quiet and in the presence of classroom noise ranging in level from 50 to 80 dBA in 5-dB steps. Performance was evaluated in four conditions: (1) No ClearVoice and no Roger, (2) ClearVoice enabled without the use of Roger, (3) ClearVoice disabled with Roger enabled, and (4) simultaneous use of ClearVoice and Roger. Speech recognition in quiet was better than speech recognition in noise for all conditions. Use of ClearVoice and Roger each provided significant improvement in speech recognition in noise. The best performance in noise was obtained with the simultaneous use of ClearVoice and Roger. ClearVoice and Roger technology each improves speech recognition in noise, particularly when used at the same time. Because ClearVoice does not degrade performance in quiet settings, clinicians should consider recommending ClearVoice for routine, full-time use for AB implant recipients. Roger should be used in all instances in which remote microphone technology may assist the user in understanding speech in the presence of noise. American Academy of Audiology.
Speech recognition: how good is good enough?
Krohn, Richard
2002-03-01
Since its infancy in the early 1990s, the technology of speech recognition has undergone a rapid evolution. Not only has the reliability of the programming improved dramatically, the return on investment has become increasingly compelling. The author describes some of the latest health care applications of speech-recognition technology, and how the next advances will be made in this area.
Internship Abstract and Final Reflection
NASA Technical Reports Server (NTRS)
Sandor, Edward
2016-01-01
The primary objective for this internship is the evaluation of an embedded natural language processor (NLP) as a way to introduce voice control into future space suits. An embedded natural language processor would provide an astronaut hands-free control for making adjustments to the environment of the space suit and checking status of consumables procedures and navigation. Additionally, the use of an embedded NLP could potentially reduce crew fatigue, increase the crewmember's situational awareness during extravehicular activity (EVA) and improve the ability to focus on mission critical details. The use of an embedded NLP may be valuable for other human spaceflight applications desiring hands-free control as well. An embedded NLP is unique because it is a small device that performs language tasks, including speech recognition, which normally require powerful processors. The dedicated device could perform speech recognition locally with a smaller form-factor and lower power consumption than traditional methods.
Basirat, Anahita
2017-01-01
Cochlear implant (CI) users frequently achieve good speech understanding based on phoneme and word recognition. However, there is a significant variability between CI users in processing prosody. The aim of this study was to examine the abilities of an excellent CI user to segment continuous speech using intonational cues. A post-lingually deafened adult CI user and 22 normal hearing (NH) subjects segmented phonemically identical and prosodically different sequences in French such as 'l'affiche' (the poster) versus 'la fiche' (the sheet), both [lafiʃ]. All participants also completed a minimal pair discrimination task. Stimuli were presented in auditory-only and audiovisual presentation modalities. The performance of the CI user in the minimal pair discrimination task was 97% in the auditory-only and 100% in the audiovisual condition. In the segmentation task, contrary to the NH participants, the performance of the CI user did not differ from the chance level. Visual speech did not improve word segmentation. This result suggests that word segmentation based on intonational cues is challenging when using CIs even when phoneme/word recognition is very well rehabilitated. This finding points to the importance of the assessment of CI users' skills in prosody processing and the need for specific interventions focusing on this aspect of speech communication.
Speech recognition systems on the Cell Broadband Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Y; Jones, H; Vaidya, S
In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousandsmore » of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.« less
Cao, Beiming; Kim, Myungjong; Mau, Ted; Wang, Jun
2017-01-01
Individuals with larynx (vocal folds) impaired have problems in controlling their glottal vibration, producing whispered speech with extreme hoarseness. Standard automatic speech recognition using only acoustic cues is typically ineffective for whispered speech because the corresponding spectral characteristics are distorted. Articulatory cues such as the tongue and lip motion may help in recognizing whispered speech since articulatory motion patterns are generally not affected. In this paper, we investigated whispered speech recognition for patients with reconstructed larynx using articulatory movement data. A data set with both acoustic and articulatory motion data was collected from a patient with surgically reconstructed larynx using an electromagnetic articulograph. Two speech recognition systems, Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural network-HMM (DNN-HMM), were used in the experiments. Experimental results showed adding either tongue or lip motion data to acoustic features such as mel-frequency cepstral coefficient (MFCC) significantly reduced the phone error rates on both speech recognition systems. Adding both tongue and lip data achieved the best performance. PMID:29423453
Tao, Duoduo; Deng, Rui; Jiang, Ye; Galvin, John J; Fu, Qian-Jie; Chen, Bing
2014-01-01
To investigate how auditory working memory relates to speech perception performance by Mandarin-speaking cochlear implant (CI) users. Auditory working memory and speech perception was measured in Mandarin-speaking CI and normal-hearing (NH) participants. Working memory capacity was measured using forward digit span and backward digit span; working memory efficiency was measured using articulation rate. Speech perception was assessed with: (a) word-in-sentence recognition in quiet, (b) word-in-sentence recognition in speech-shaped steady noise at +5 dB signal-to-noise ratio, (c) Chinese disyllable recognition in quiet, (d) Chinese lexical tone recognition in quiet. Self-reported school rank was also collected regarding performance in schoolwork. There was large inter-subject variability in auditory working memory and speech performance for CI participants. Working memory and speech performance were significantly poorer for CI than for NH participants. All three working memory measures were strongly correlated with each other for both CI and NH participants. Partial correlation analyses were performed on the CI data while controlling for demographic variables. Working memory efficiency was significantly correlated only with sentence recognition in quiet when working memory capacity was partialled out. Working memory capacity was correlated with disyllable recognition and school rank when efficiency was partialled out. There was no correlation between working memory and lexical tone recognition in the present CI participants. Mandarin-speaking CI users experience significant deficits in auditory working memory and speech performance compared with NH listeners. The present data suggest that auditory working memory may contribute to CI users' difficulties in speech understanding. The present pattern of results with Mandarin-speaking CI users is consistent with previous auditory working memory studies with English-speaking CI users, suggesting that the lexical importance of voice pitch cues (albeit poorly coded by the CI) did not influence the relationship between working memory and speech perception.
Implementation of the Intelligent Voice System for Kazakh
NASA Astrophysics Data System (ADS)
Yessenbayev, Zh; Saparkhojayev, N.; Tibeyev, T.
2014-04-01
Modern speech technologies are highly advanced and widely used in day-to-day applications. However, this is mostly concerned with the languages of well-developed countries such as English, German, Japan, Russian, etc. As for Kazakh, the situation is less prominent and research in this field is only starting to evolve. In this research and application-oriented project, we introduce an intelligent voice system for the fast deployment of call-centers and information desks supporting Kazakh speech. The demand on such a system is obvious if the country's large size and small population is considered. The landline and cell phones become the only means of communication for the distant villages and suburbs. The system features Kazakh speech recognition and synthesis modules as well as a web-GUI for efficient dialog management. For speech recognition we use CMU Sphinx engine and for speech synthesis- MaryTTS. The web-GUI is implemented in Java enabling operators to quickly create and manage the dialogs in user-friendly graphical environment. The call routines are handled by Asterisk PBX and JBoss Application Server. The system supports such technologies and protocols as VoIP, VoiceXML, FastAGI, Java SpeechAPI and J2EE. For the speech recognition experiments we compiled and used the first Kazakh speech corpus with the utterances from 169 native speakers. The performance of the speech recognizer is 4.1% WER on isolated word recognition and 6.9% WER on clean continuous speech recognition tasks. The speech synthesis experiments include the training of male and female voices.
Lexical and sublexical units in speech perception.
Giroux, Ibrahima; Rey, Arnaud
2009-03-01
Saffran, Newport, and Aslin (1996a) found that human infants are sensitive to statistical regularities corresponding to lexical units when hearing an artificial spoken language. Two sorts of segmentation strategies have been proposed to account for this early word-segmentation ability: bracketing strategies, in which infants are assumed to insert boundaries into continuous speech, and clustering strategies, in which infants are assumed to group certain speech sequences together into units (Swingley, 2005). In the present study, we test the predictions of two computational models instantiating each of these strategies i.e., Serial Recurrent Networks: Elman, 1990; and Parser: Perruchet & Vinter, 1998 in an experiment where we compare the lexical and sublexical recognition performance of adults after hearing 2 or 10 min of an artificial spoken language. The results are consistent with Parser's predictions and the clustering approach, showing that performance on words is better than performance on part-words only after 10 min. This result suggests that word segmentation abilities are not merely due to stronger associations between sublexical units but to the emergence of stronger lexical representations during the development of speech perception processes. Copyright © 2009, Cognitive Science Society, Inc.
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J.F.; Ng, L.C.
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used formore » purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.« less
Nixon, C; Anderson, T; Morris, L; McCavitt, A; McKinley, R; Yeager, D; McDaniel, M
1998-11-01
The intelligibility of female and male speech is equivalent under most ordinary living conditions. However, due to small differences between their acoustic speech signals, called speech spectra, one can be more or less intelligible than the other in certain situations such as high levels of noise. Anecdotal information, supported by some empirical observations, suggests that some of the high intensity noise spectra of military aircraft cockpits may degrade the intelligibility of female speech more than that of male speech. In an applied research study, the intelligibility of female and male speech was measured in several high level aircraft cockpit noise conditions experienced in military aviation. In Part I, (Nixon CW, et al. Aviat Space Environ Med 1998; 69:675-83) female speech intelligibility measured in the spectra and levels of aircraft cockpit noises and with noise-canceling microphones was lower than that of the male speech in all conditions. However, the differences were small and only those at some of the highest noise levels were significant. Although speech intelligibility of both genders was acceptable during normal cruise noises, improvements are required in most of the highest levels of noise created during maximum aircraft operating conditions. These results are discussed in a Part I technical report. This Part II report examines the intelligibility in the same aircraft cockpit noises of vocoded female and male speech and the accuracy with which female and male speech in some of the cockpit noises were understood by automatic speech recognition systems. The intelligibility of vocoded female speech was generally the same as that of vocoded male speech. No significant differences were measured between the recognition accuracy of male and female speech by the automatic speech recognition systems. The intelligibility of female and male speech was equivalent for these conditions.
Schall, Sonja; von Kriegstein, Katharina
2014-01-01
It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers' voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker's face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas.
ERIC Educational Resources Information Center
Gordon-Salant, Sandra; Fitzgibbons, Peter J.; Friedman, Sarah A.
2007-01-01
Purpose: The goal of this experiment was to determine whether selective slowing of speech segments improves recognition performance by young and elderly listeners. The hypotheses were (a) the benefits of time expansion occur for rapid speech but not for natural-rate speech, (b) selective time expansion of consonants produces greater score…
Hierarchical singleton-type recurrent neural fuzzy networks for noisy speech recognition.
Juang, Chia-Feng; Chiou, Chyi-Tian; Lai, Chun-Lung
2007-05-01
This paper proposes noisy speech recognition using hierarchical singleton-type recurrent neural fuzzy networks (HSRNFNs). The proposed HSRNFN is a hierarchical connection of two singleton-type recurrent neural fuzzy networks (SRNFNs), where one is used for noise filtering and the other for recognition. The SRNFN is constructed by recurrent fuzzy if-then rules with fuzzy singletons in the consequences, and their recurrent properties make them suitable for processing speech patterns with temporal characteristics. In n words recognition, n SRNFNs are created for modeling n words, where each SRNFN receives the current frame feature and predicts the next one of its modeling word. The prediction error of each SRNFN is used as recognition criterion. In filtering, one SRNFN is created, and each SRNFN recognizer is connected to the same SRNFN filter, which filters noisy speech patterns in the feature domain before feeding them to the SRNFN recognizer. Experiments with Mandarin word recognition under different types of noise are performed. Other recognizers, including multilayer perceptron (MLP), time-delay neural networks (TDNNs), and hidden Markov models (HMMs), are also tested and compared. These experiments and comparisons demonstrate good results with HSRNFN for noisy speech recognition tasks.
Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review
NASA Astrophysics Data System (ADS)
Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH
2017-09-01
This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.
Using Speech Recognition Software to Improve Writing Skills
ERIC Educational Resources Information Center
Diaz, Felix
2014-01-01
Orthopedically impaired (OI) students face a formidable challenge during the writing process due to their limited or non-existing ability to use their hands to hold a pen or pencil or even to press the keys on a keyboard. While they may have a clear mental picture of what they want to write, the biggest hurdle comes well before having to tackle…
Kurzweil Reading Machine: A Partial Evaluation of Its Optical Character Recognition Error Rate.
ERIC Educational Resources Information Center
Goodrich, Gregory L.; And Others
1979-01-01
A study designed to assess the ability of the Kurzweil reading machine (a speech reading device for the visually handicapped) to read three different type styles produced by five different means indicated that the machines tested had different error rates depending upon the means of producing the copy and upon the type style used. (Author/CL)
Working Memory Load Affects Processing Time in Spoken Word Recognition: Evidence from Eye-Movements
Hadar, Britt; Skrzypek, Joshua E.; Wingfield, Arthur; Ben-David, Boaz M.
2016-01-01
In daily life, speech perception is usually accompanied by other tasks that tap into working memory capacity. However, the role of working memory on speech processing is not clear. The goal of this study was to examine how working memory load affects the timeline for spoken word recognition in ideal listening conditions. We used the “visual world” eye-tracking paradigm. The task consisted of spoken instructions referring to one of four objects depicted on a computer monitor (e.g., “point at the candle”). Half of the trials presented a phonological competitor to the target word that either overlapped in the initial syllable (onset) or at the last syllable (offset). Eye movements captured listeners' ability to differentiate the target noun from its depicted phonological competitor (e.g., candy or sandal). We manipulated working memory load by using a digit pre-load task, where participants had to retain either one (low-load) or four (high-load) spoken digits for the duration of a spoken word recognition trial. The data show that the high-load condition delayed real-time target discrimination. Specifically, a four-digit load was sufficient to delay the point of discrimination between the spoken target word and its phonological competitor. Our results emphasize the important role working memory plays in speech perception, even when performed by young adults in ideal listening conditions. PMID:27242424
Robot Command Interface Using an Audio-Visual Speech Recognition System
NASA Astrophysics Data System (ADS)
Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy
In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.
Lawler, Marshall; Yu, Jeffrey; Aronoff, Justin M
Although speech perception is the gold standard for measuring cochlear implant (CI) users' performance, speech perception tests often require extensive adaptation to obtain accurate results, particularly after large changes in maps. Spectral ripple tests, which measure spectral resolution, are an alternate measure that has been shown to correlate with speech perception. A modified spectral ripple test, the spectral-temporally modulated ripple test (SMRT) has recently been developed, and the objective of this study was to compare speech perception and performance on the SMRT for a heterogeneous population of unilateral CI users, bilateral CI users, and bimodal users. Twenty-five CI users (eight using unilateral CIs, nine using bilateral CIs, and eight using a CI and a hearing aid) were tested on the Arizona Biomedical Institute Sentence Test (AzBio) with a +8 dB signal to noise ratio, and on the SMRT. All participants were tested with their clinical programs. There was a significant correlation between SMRT and AzBio performance. After a practice block, an improvement of one ripple per octave for SMRT corresponded to an improvement of 12.1% for AzBio. Additionally, there was no significant difference in slope or intercept between any of the CI populations. The results indicate that performance on the SMRT correlates with speech recognition in noise when measured across unilateral, bilateral, and bimodal CI populations. These results suggest that SMRT scores are strongly associated with speech recognition in noise ability in experienced CI users. Further studies should focus on increasing both the size and diversity of the tested participants, and on determining whether the SMRT technique can be used for early predictions of long-term speech scores, or for evaluating differences among different stimulation strategies or parameter settings.
Anderson, Karen L; Goldstein, Howard
2004-04-01
Children typically learn in classroom environments that have background noise and reverberation that interfere with accurate speech perception. Amplification technology can enhance the speech perception of students who are hard of hearing. This study used a single-subject alternating treatments design to compare the speech recognition abilities of children who are, hard of hearing when they were using hearing aids with each of three frequency modulated (FM) or infrared devices. Eight 9-12-year-olds with mild to severe hearing loss repeated Hearing in Noise Test (HINT) sentence lists under controlled conditions in a typical kindergarten classroom with a background noise level of +10 dB signal-to-noise (S/N) ratio and 1.1 s reverberation time. Participants listened to HINT lists using hearing aids alone and hearing aids in combination with three types of S/N-enhancing devices that are currently used in mainstream classrooms: (a) FM systems linked to personal hearing aids, (b) infrared sound field systems with speakers placed throughout the classroom, and (c) desktop personal sound field FM systems. The infrared ceiling sound field system did not provide benefit beyond that provided by hearing aids alone. Desktop and personal FM systems in combination with personal hearing aids provided substantial improvements in speech recognition. This information can assist in making S/N-enhancing device decisions for students using hearing aids. In a reverberant and noisy classroom setting, classroom sound field devices are not beneficial to speech perception for students with hearing aids, whereas either personal FM or desktop sound field systems provide listening benefits.
Caballero-Morales, Santiago-Omar
2013-01-01
An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness). Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR's output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech. PMID:23935410
Enhancing speech recognition using improved particle swarm optimization based hidden Markov model.
Selvaraj, Lokesh; Ganesan, Balakrishnan
2014-01-01
Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO) is suggested. The suggested methodology contains four stages, namely, (i) denoising, (ii) feature mining (iii), vector quantization, and (iv) IPSO based hidden Markov model (HMM) technique (IP-HMM). At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC), mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.
Liu, Yuying; Dong, Ruijuan; Li, Yuling; Xu, Tianqiu; Li, Yongxin; Chen, Xueqing; Gong, Shusheng
2014-12-01
To evaluate the auditory and speech abilities in children with auditory neuropathy spectrum disorder (ANSD) after cochlear implantation (CI) and determine the role of age at implantation. Ten children participated in this retrospective case series study. All children had evidence of ANSD. All subjects had no cochlear nerve deficiency on magnetic resonance imaging and had used the cochlear implants for a period of 12-84 months. We divided our children into two groups: children who underwent implantation before 24 months of age and children who underwent implantation after 24 months of age. Their auditory and speech abilities were evaluated using the following: behavioral audiometry, the Categories of Auditory Performance (CAP), the Meaningful Auditory Integration Scale (MAIS), the Infant-Toddler Meaningful Auditory Integration Scale (IT-MAIS), the Standard-Chinese version of the Monosyllabic Lexical Neighborhood Test (LNT), the Multisyllabic Lexical Neighborhood Test (MLNT), the Speech Intelligibility Rating (SIR) and the Meaningful Use of Speech Scale (MUSS). All children showed progress in their auditory and language abilities. The 4-frequency average hearing level (HL) (500Hz, 1000Hz, 2000Hz and 4000Hz) of aided hearing thresholds ranged from 17.5 to 57.5dB HL. All children developed time-related auditory perception and speech skills. Scores of children with ANSD who received cochlear implants before 24 months tended to be better than those of children who received cochlear implants after 24 months. Seven children completed the Mandarin Lexical Neighborhood Test. Approximately half of the children showed improved open-set speech recognition. Cochlear implantation is helpful for children with ANSD and may be a good optional treatment for many ANSD children. In addition, children with ANSD fitted with cochlear implants before 24 months tended to acquire auditory and speech skills better than children fitted with cochlear implants after 24 months. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Bidelman, Gavin M; Dexter, Lauren
2015-04-01
We examined a consistent deficit observed in bilinguals: poorer speech-in-noise (SIN) comprehension for their nonnative language. We recorded neuroelectric mismatch potentials in mono- and bi-lingual listeners in response to contrastive speech sounds in noise. Behaviorally, late bilinguals required ∼10dB more favorable signal-to-noise ratios to match monolinguals' SIN abilities. Source analysis of cortical activity demonstrated monotonic increase in response latency with noise in superior temporal gyrus (STG) for both groups, suggesting parallel degradation of speech representations in auditory cortex. Contrastively, we found differential speech encoding between groups within inferior frontal gyrus (IFG)-adjacent to Broca's area-where noise delays observed in nonnative listeners were offset in monolinguals. Notably, brain-behavior correspondences double dissociated between language groups: STG activation predicted bilinguals' SIN, whereas IFG activation predicted monolinguals' performance. We infer higher-order brain areas act compensatorily to enhance impoverished sensory representations but only when degraded speech recruits linguistic brain mechanisms downstream from initial auditory-sensory inputs. Copyright © 2015 Elsevier Inc. All rights reserved.
Speaker recognition with temporal cues in acoustic and electric hearing
NASA Astrophysics Data System (ADS)
Vongphoe, Michael; Zeng, Fan-Gang
2005-08-01
Natural spoken language processing includes not only speech recognition but also identification of the speaker's gender, age, emotional, and social status. Our purpose in this study is to evaluate whether temporal cues are sufficient to support both speech and speaker recognition. Ten cochlear-implant and six normal-hearing subjects were presented with vowel tokens spoken by three men, three women, two boys, and two girls. In one condition, the subject was asked to recognize the vowel. In the other condition, the subject was asked to identify the speaker. Extensive training was provided for the speaker recognition task. Normal-hearing subjects achieved nearly perfect performance in both tasks. Cochlear-implant subjects achieved good performance in vowel recognition but poor performance in speaker recognition. The level of the cochlear implant performance was functionally equivalent to normal performance with eight spectral bands for vowel recognition but only to one band for speaker recognition. These results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants. Several methods, including explicit encoding of fundamental frequency and frequency modulation, are proposed to improve speaker recognition for current cochlear implant users.
Leong, Victoria; Goswami, Usha
2014-01-01
Dyslexia is associated with impaired neural representation of the sound structure of words (phonology). The “phonological deficit” in dyslexia may arise in part from impaired speech rhythm perception, thought to depend on neural oscillatory phase-locking to slow amplitude modulation (AM) patterns in the speech envelope. Speech contains AM patterns at multiple temporal rates, and these different AM rates are associated with phonological units of different grain sizes, e.g., related to stress, syllables or phonemes. Here, we assess the ability of adults with dyslexia to use speech AMs to identify rhythm patterns (RPs). We study 3 important temporal rates: “Stress” (~2 Hz), “Syllable” (~4 Hz) and “Sub-beat” (reduced syllables, ~14 Hz). 21 dyslexics and 21 controls listened to nursery rhyme sentences that had been tone-vocoded using either single AM rates from the speech envelope (Stress only, Syllable only, Sub-beat only) or pairs of AM rates (Stress + Syllable, Syllable + Sub-beat). They were asked to use the acoustic rhythm of the stimulus to identity the original nursery rhyme sentence. The data showed that dyslexics were significantly poorer at detecting rhythm compared to controls when they had to utilize multi-rate temporal information from pairs of AMs (Stress + Syllable or Syllable + Sub-beat). These data suggest that dyslexia is associated with a reduced ability to utilize AMs <20 Hz for rhythm recognition. This perceptual deficit in utilizing AM patterns in speech could be underpinned by less efficient neuronal phase alignment and cross-frequency neuronal oscillatory synchronization in dyslexia. Dyslexics' perceptual difficulties in capturing the full spectro-temporal complexity of speech over multiple timescales could contribute to the development of impaired phonological representations for words, the cognitive hallmark of dyslexia across languages. PMID:24605099
A speech-controlled environmental control system for people with severe dysarthria.
Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca
2007-06-01
Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.
Sullivan, Jessica R.; Thibodeau, Linda M.; Assmann, Peter F.
2013-01-01
Previous studies have indicated that individuals with normal hearing (NH) experience a perceptual advantage for speech recognition in interrupted noise compared to continuous noise. In contrast, adults with hearing impairment (HI) and younger children with NH receive a minimal benefit. The objective of this investigation was to assess whether auditory training in interrupted noise would improve speech recognition in noise for children with HI and perhaps enhance their utilization of glimpsing skills. A partially-repeated measures design was used to evaluate the effectiveness of seven 1-h sessions of auditory training in interrupted and continuous noise. Speech recognition scores in interrupted and continuous noise were obtained from pre-, post-, and 3 months post-training from 24 children with moderate-to-severe hearing loss. Children who participated in auditory training in interrupted noise demonstrated a significantly greater improvement in speech recognition compared to those who trained in continuous noise. Those who trained in interrupted noise demonstrated similar improvements in both noise conditions while those who trained in continuous noise only showed modest improvements in the interrupted noise condition. This study presents direct evidence that auditory training in interrupted noise can be beneficial in improving speech recognition in noise for children with HI. PMID:23297921
ERIC Educational Resources Information Center
Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian
2017-01-01
Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist…
Effects of intelligibility on working memory demand for speech perception.
Francis, Alexander L; Nusbaum, Howard C
2009-08-01
Understanding low-intelligibility speech is effortful. In three experiments, we examined the effects of intelligibility on working memory (WM) demands imposed by perception of synthetic speech. In all three experiments, a primary speeded word recognition task was paired with a secondary WM-load task designed to vary the availability of WM capacity during speech perception. Speech intelligibility was varied either by training listeners to use available acoustic cues in a more diagnostic manner (as in Experiment 1) or by providing listeners with more informative acoustic cues (i.e., better speech quality, as in Experiments 2 and 3). In the first experiment, training significantly improved intelligibility and recognition speed; increasing WM load significantly slowed recognition. A significant interaction between training and load indicated that the benefit of training on recognition speed was observed only under low memory load. In subsequent experiments, listeners received no training; intelligibility was manipulated by changing synthesizers. Improving intelligibility without training improved recognition accuracy, and increasing memory load still decreased it, but more intelligible speech did not produce more efficient use of available WM capacity. This suggests that perceptual learning modifies the way available capacity is used, perhaps by increasing the use of more phonetically informative features and/or by decreasing use of less informative ones.
Eckert, Mark A; Matthews, Lois J; Dubno, Judy R
2017-01-01
Even older adults with relatively mild hearing loss report hearing handicap, suggesting that hearing handicap is not completely explained by reduced speech audibility. We examined the extent to which self-assessed ratings of hearing handicap using the Hearing Handicap Inventory for the Elderly (HHIE; Ventry & Weinstein, 1982) were significantly associated with measures of speech recognition in noise that controlled for differences in speech audibility. One hundred sixty-two middle-aged and older adults had HHIE total scores that were significantly associated with audibility-adjusted measures of speech recognition for low-context but not high-context sentences. These findings were driven by HHIE items involving negative feelings related to communication difficulties that also captured variance in subjective ratings of effort and frustration that predicted speech recognition. The average pure-tone threshold accounted for some of the variance in the association between the HHIE and audibility-adjusted speech recognition, suggesting an effect of central and peripheral auditory system decline related to elevated thresholds. The accumulation of difficult listening experiences appears to produce a self-assessment of hearing handicap resulting from (a) reduced audibility of stimuli, (b) declines in the central and peripheral auditory system function, and (c) additional individual variation in central nervous system function.
Matthews, Lois J.; Dubno, Judy R.
2017-01-01
Purpose Even older adults with relatively mild hearing loss report hearing handicap, suggesting that hearing handicap is not completely explained by reduced speech audibility. Method We examined the extent to which self-assessed ratings of hearing handicap using the Hearing Handicap Inventory for the Elderly (HHIE; Ventry & Weinstein, 1982) were significantly associated with measures of speech recognition in noise that controlled for differences in speech audibility. Results One hundred sixty-two middle-aged and older adults had HHIE total scores that were significantly associated with audibility-adjusted measures of speech recognition for low-context but not high-context sentences. These findings were driven by HHIE items involving negative feelings related to communication difficulties that also captured variance in subjective ratings of effort and frustration that predicted speech recognition. The average pure-tone threshold accounted for some of the variance in the association between the HHIE and audibility-adjusted speech recognition, suggesting an effect of central and peripheral auditory system decline related to elevated thresholds. Conclusion The accumulation of difficult listening experiences appears to produce a self-assessment of hearing handicap resulting from (a) reduced audibility of stimuli, (b) declines in the central and peripheral auditory system function, and (c) additional individual variation in central nervous system function. PMID:28060993
Continuous multiword recognition performance of young and elderly listeners in ambient noise
NASA Astrophysics Data System (ADS)
Sato, Hiroshi
2005-09-01
Hearing threshold shift due to aging is known as a dominant factor to degrade speech recognition performance in noisy conditions. On the other hand, cognitive factors of aging-relating speech recognition performance in various speech-to-noise conditions are not well established. In this study, two kinds of speech test were performed to examine how working memory load relates to speech recognition performance. One is word recognition test with high-familiarity, four-syllable Japanese words (single-word test). In this test, each word was presented to listeners; the listeners were asked to write the word down on paper with enough time to answer. In the other test, five continuous word were presented to listeners and listeners were asked to write the word down after just five words were presented (multiword test). Both tests were done in various speech-to-noise ratios under 50-dBA Hoth spectrum noise with more than 50 young and elderly subjects. The results of two experiments suggest that (1) Hearing level is related to scores of both tests. (2) Scores of single-word test are well correlated with those of multiword test. (3) Scores of multiword test are not improved as speech-to-noise ratio improves in the condition where scores of single-word test reach their ceiling.
Speech recognition: Acoustic phonetic and lexical knowledge representation
NASA Astrophysics Data System (ADS)
Zue, V. W.
1983-02-01
The purpose of this program is to develop a speech data base facility under which the acoustic characteristics of speech sounds in various contexts can be studied conveniently; investigate the phonological properties of a large lexicon of, say 10,000 words, and determine to what extent the phontactic constraints can be utilized in speech recognition; study the acoustic cues that are used to mark work boundaries; develop a test bed in the form of a large-vocabulary, IWR system to study the interactions of acoustic, phonetic and lexical knowledge; and develop a limited continuous speech recognition system with the goal of recognizing any English word from its spelling in order to assess the interactions of higher-level knowledge sources.
NASA Astrophysics Data System (ADS)
Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.
2016-10-01
Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.
Relation between measures of speech-in-noise performance and measures of efferent activity
NASA Astrophysics Data System (ADS)
Smith, Brad; Harkrider, Ashley; Burchfield, Samuel; Nabelek, Anna
2003-04-01
Individual differences in auditory perceptual abilities in noise are well documented but the factors causing such variability are unclear. The purpose of this study was to determine if individual differences in responses measured from the auditory efferent system were correlated to individual variations in speech-in-noise performance. The relation between behavioral performance on three speech-in-noise tasks and two objective measures of the efferent auditory system were examined in thirty normal-hearing, young adults. Two of the speech-in-noise tasks measured an acceptable noise level, the maximum level of speech-babble noise that a subject is willing to accept while listening to a story. For these, the acceptable noise level was evaluated using both an ipsilateral (story and noise in same ear) and a contralateral (story and noise in opposite ears) paradigm. The third speech-in-noise task evaluated speech recognition using monosyllabic words presented in competing speech babble. Auditory efferent activity was assessed by examining the resulting suppression of click-evoked otoacoustic emissions following the introduction of a contralateral, broad-band stimulus and the activity of the ipsilateral and contralateral acoustic reflex arc was evaluated using tones and broad-band noise. Results will be discussed relative to current theories of speech in noise performance and auditory inhibitory processes.
Schall, Sonja; von Kriegstein, Katharina
2014-01-01
It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers’ voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker’s face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas. PMID:24466026
A preliminary comparison of speech recognition functionality in dental practice management systems.
Irwin, Jeannie Y; Schleyer, Titus
2008-11-06
In this study, we examined speech recognition functionality in four leading dental practice management systems. Twenty dental students used voice to chart a simulated patient with 18 findings in each system. Results show it can take over a minute to chart one finding and that users frequently have to repeat commands. Limited functionality, poor usability and a high error rate appear to retard adoption of speech recognition in dentistry.
Two Stage Data Augmentation for Low Resourced Speech Recognition (Author’s Manuscript)
2016-09-12
speech recognition, deep neural networks, data augmentation 1. Introduction When training data is limited—whether it be audio or text—the obvious...Schwartz, and S. Tsakalidis, “Enhancing low resource keyword spotting with au- tomatically retrieved web documents,” in Interspeech, 2015, pp. 839–843. [2...and F. Seide, “Feature learning in deep neural networks - a study on speech recognition tasks,” in International Conference on Learning Representations
Getting What You Want: Accurate Document Filtering in a Terabyte World
2002-11-01
models are used widely in speech recognition and have shown promise for ad-hoc information retrieval (Ponte and Croft, 1998; Lafferty and Zhai, 2001...tasks is focused on developing techniques similar to those used in speech recognition. However the differing requirements of speech recognition and...Conference on Research and Development in Information Retrieval. ACM. 6. T.Ault, and Y. Yang. (2001.) kNN at TREC-9: A failure analysis. In
V2S: Voice to Sign Language Translation System for Malaysian Deaf People
NASA Astrophysics Data System (ADS)
Mean Foong, Oi; Low, Tang Jung; La, Wai Wan
The process of learning and understand the sign language may be cumbersome to some, and therefore, this paper proposes a solution to this problem by providing a voice (English Language) to sign language translation system using Speech and Image processing technique. Speech processing which includes Speech Recognition is the study of recognizing the words being spoken, regardless of whom the speaker is. This project uses template-based recognition as the main approach in which the V2S system first needs to be trained with speech pattern based on some generic spectral parameter set. These spectral parameter set will then be stored as template in a database. The system will perform the recognition process through matching the parameter set of the input speech with the stored templates to finally display the sign language in video format. Empirical results show that the system has 80.3% recognition rate.
Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina
2014-11-15
Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on an inter-hemispheric mechanism which exploits both a right-hemispheric sensitivity to pitch information and a left-hemispheric dominance in speech processing. Copyright © 2014 Elsevier Inc. All rights reserved.
Stanley, Nicholas; Davis, Tara; Estis, Julie
2017-03-01
Aging effects on speech understanding in noise have primarily been assessed through speech recognition tasks. Recognition tasks, which focus on bottom-up, perceptual aspects of speech understanding, intentionally limit linguistic and cognitive factors by asking participants to only repeat what they have heard. On the other hand, linguistic processing tasks require bottom-up and top-down (linguistic, cognitive) processing skills and are, therefore, more reflective of speech understanding abilities used in everyday communication. The effect of signal-to-noise ratio (SNR) on linguistic processing ability is relatively unknown for either young (YAs) or older adults (OAs). To determine if reduced SNRs would be more deleterious to the linguistic processing of OAs than YAs, as measured by accuracy and reaction time in a semantic judgment task in competing speech. In the semantic judgment task, participants indicated via button press whether word pairs were a semantic Match or No Match. This task was performed in quiet, as well as, +3, 0, -3, and -6 dB SNR with two-talker speech competition. Seventeen YAs (20-30 yr) with normal hearing sensitivity and 17 OAs (60-68 yr) with normal hearing sensitivity or mild-to-moderate sensorineural hearing loss within age-appropriate norms. Accuracy, reaction time, and false alarm rate were measured and analyzed using a mixed design analysis of variance. A decrease in SNR level significantly reduced accuracy and increased reaction time in both YAs and OAs. However, poor SNRs affected accuracy and reaction time of Match and No Match word pairs differently. Accuracy for Match pairs declined at a steeper rate than No Match pairs in both groups as SNR decreased. In addition, reaction time for No Match pairs increased at a greater rate than Match pairs in more difficult SNRs, particularly at -3 and -6 dB SNR. False-alarm rates indicated that participants had a response bias to No Match pairs as the SNR decreased. Age-related differences were limited to No Match pair accuracies at -6 dB SNR. The ability to correctly identify semantically matched word pairs was more susceptible to disruption by a poor SNR than semantically unrelated words in both YAs and OAs. The effect of SNR on this semantic judgment task implies that speech competition differentially affected the facilitation of semantically related words and the inhibition of semantically incompatible words, although processing speed, as measured by reaction time, remained faster for semantically matched pairs. Overall, the semantic judgment task in competing speech elucidated the effect of a poor listening environment on the higher order processing of words. American Academy of Audiology
Messaoud-Galusi, Souhila; Hazan, Valerie; Rosen, Stuart
2012-01-01
Purpose The claim that speech perception abilities are impaired in dyslexia was investigated in a group of 62 dyslexic children and 51 average readers matched in age. Method To test whether there was robust evidence of speech perception deficits in children with dyslexia, speech perception in noise and quiet was measured using eight different tasks involving the identification and discrimination of a complex and highly natural synthetic ‘pea’-‘bee’ contrast (copy synthesised from natural models) and the perception of naturally-produced words. Results Children with dyslexia, on average, performed more poorly than average readers in the synthetic syllables identification task in quiet and in across-category discrimination (but not when tested using an adaptive procedure). They did not differ from average readers on two tasks of word recognition in noise or identification of synthetic syllables in noise. For all tasks, a majority of individual children with dyslexia performed within norms. Finally, speech perception generally did not correlate with pseudo-word reading or phonological processing, the core skills related to dyslexia. Conclusions On the tasks and speech stimuli we used, most children with dyslexia do not appear to show a consistent deficit in speech perception. PMID:21930615
Analytic study of the Tadoma method: background and preliminary results.
Norton, S J; Schultz, M C; Reed, C M; Braida, L D; Durlach, N I; Rabinowitz, W M; Chomsky, C
1977-09-01
Certain deaf-blind persons have been taught, through the Tadoma method of speechreading, to use vibrotactile cues from the face and neck to understand speech. This paper reports the results of preliminary tests of the speechreading ability of one adult Tadoma user. The tests were of four major types: (1) discrimination of speech stimuli; (2) recognition of words in isolation and in sentences; (3) interpretation of prosodic and syntactic features in sentences; and (4) comprehension of written (Braille) and oral speech. Words in highly contextual environments were much better perceived than were words in low-context environments. Many of the word errors involved phonemic substitutions which shared articulatory features with the target phonemes, with a higher error rate for vowels than consonants. Relative to performance on word-recognition tests, performance on some of the discrimination tests was worse than expected. Perception of sentences appeared to be mildly sensitive to rate of talking and to speaker differences. Results of the tests on perception of prosodic and syntactic features, while inconclusive, indicate that many of the features tested were not used in interpreting sentences. On an English comprehension test, a higher score was obtained for items administered in Braille than through oral presentation.
Mehraei, Golbarg; Gallun, Frederick J; Leek, Marjorie R; Bernstein, Joshua G W
2014-07-01
Poor speech understanding in noise by hearing-impaired (HI) listeners is only partly explained by elevated audiometric thresholds. Suprathreshold-processing impairments such as reduced temporal or spectral resolution or temporal fine-structure (TFS) processing ability might also contribute. Although speech contains dynamic combinations of temporal and spectral modulation and TFS content, these capabilities are often treated separately. Modulation-depth detection thresholds for spectrotemporal modulation (STM) applied to octave-band noise were measured for normal-hearing and HI listeners as a function of temporal modulation rate (4-32 Hz), spectral ripple density [0.5-4 cycles/octave (c/o)] and carrier center frequency (500-4000 Hz). STM sensitivity was worse than normal for HI listeners only for a low-frequency carrier (1000 Hz) at low temporal modulation rates (4-12 Hz) and a spectral ripple density of 2 c/o, and for a high-frequency carrier (4000 Hz) at a high spectral ripple density (4 c/o). STM sensitivity for the 4-Hz, 4-c/o condition for a 4000-Hz carrier and for the 4-Hz, 2-c/o condition for a 1000-Hz carrier were correlated with speech-recognition performance in noise after partialling out the audiogram-based speech-intelligibility index. Poor speech-reception and STM-detection performance for HI listeners may be related to a combination of reduced frequency selectivity and a TFS-processing deficit limiting the ability to track spectral-peak movements.
Brouwer, Susanne; Van Engen, Kristin J; Calandruccio, Lauren; Bradlow, Ann R
2012-02-01
This study examined whether speech-on-speech masking is sensitive to variation in the degree of similarity between the target and the masker speech. Three experiments investigated whether speech-in-speech recognition varies across different background speech languages (English vs Dutch) for both English and Dutch targets, as well as across variation in the semantic content of the background speech (meaningful vs semantically anomalous sentences), and across variation in listener status vis-à-vis the target and masker languages (native, non-native, or unfamiliar). The results showed that the more similar the target speech is to the masker speech (e.g., same vs different language, same vs different levels of semantic content), the greater the interference on speech recognition accuracy. Moreover, the listener's knowledge of the target and the background language modulate the size of the release from masking. These factors had an especially strong effect on masking effectiveness in highly unfavorable listening conditions. Overall this research provided evidence that that the degree of target-masker similarity plays a significant role in speech-in-speech recognition. The results also give insight into how listeners assign their resources differently depending on whether they are listening to their first or second language. © 2012 Acoustical Society of America
Brouwer, Susanne; Van Engen, Kristin J.; Calandruccio, Lauren; Bradlow, Ann R.
2012-01-01
This study examined whether speech-on-speech masking is sensitive to variation in the degree of similarity between the target and the masker speech. Three experiments investigated whether speech-in-speech recognition varies across different background speech languages (English vs Dutch) for both English and Dutch targets, as well as across variation in the semantic content of the background speech (meaningful vs semantically anomalous sentences), and across variation in listener status vis-à-vis the target and masker languages (native, non-native, or unfamiliar). The results showed that the more similar the target speech is to the masker speech (e.g., same vs different language, same vs different levels of semantic content), the greater the interference on speech recognition accuracy. Moreover, the listener’s knowledge of the target and the background language modulate the size of the release from masking. These factors had an especially strong effect on masking effectiveness in highly unfavorable listening conditions. Overall this research provided evidence that that the degree of target-masker similarity plays a significant role in speech-in-speech recognition. The results also give insight into how listeners assign their resources differently depending on whether they are listening to their first or second language. PMID:22352516
Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping
2016-01-01
Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions.
Voice emotion recognition by cochlear-implanted children and their normally-hearing peers.
Chatterjee, Monita; Zion, Danielle J; Deroche, Mickael L; Burianek, Brooke A; Limb, Charles J; Goren, Alison P; Kulkarni, Aditya M; Christensen, Julie A
2015-04-01
Despite their remarkable success in bringing spoken language to hearing impaired listeners, the signal transmitted through cochlear implants (CIs) remains impoverished in spectro-temporal fine structure. As a consequence, pitch-dominant information such as voice emotion, is diminished. For young children, the ability to correctly identify the mood/intent of the speaker (which may not always be visible in their facial expression) is an important aspect of social and linguistic development. Previous work in the field has shown that children with cochlear implants (cCI) have significant deficits in voice emotion recognition relative to their normally hearing peers (cNH). Here, we report on voice emotion recognition by a cohort of 36 school-aged cCI. Additionally, we provide for the first time, a comparison of their performance to that of cNH and NH adults (aNH) listening to CI simulations of the same stimuli. We also provide comparisons to the performance of adult listeners with CIs (aCI), most of whom learned language primarily through normal acoustic hearing. Results indicate that, despite strong variability, on average, cCI perform similarly to their adult counterparts; that both groups' mean performance is similar to aNHs' performance with 8-channel noise-vocoded speech; that cNH achieve excellent scores in voice emotion recognition with full-spectrum speech, but on average, show significantly poorer scores than aNH with 8-channel noise-vocoded speech. A strong developmental effect was observed in the cNH with noise-vocoded speech in this task. These results point to the considerable benefit obtained by cochlear-implanted children from their devices, but also underscore the need for further research and development in this important and neglected area. This article is part of a Special Issue entitled
Speech Recognition Thresholds for Multilingual Populations.
ERIC Educational Resources Information Center
Ramkissoon, Ishara
2001-01-01
This article traces the development of speech audiometry in the United States and reports on the current status, focusing on the needs of a multilingual population in terms of measuring speech recognition threshold (SRT). It also discusses sociolinguistic considerations, alternative SRT stimuli for second language learners, and research on using…
ERIC Educational Resources Information Center
Sharp, Kathryn M; Gathercole, Virginia C. Mueller
2013-01-01
In recent years, there has been growing recognition of a need for a general, non-language-specific assessment tool that could be used to evaluate general speech and language abilities in children, especially to assist in identifying atypical development in bilingual children who speak a language unfamiliar to the assessor. It has been suggested…
Investigation of potential cognitive tests for use with older adults in audiology clinics.
Vaughan, Nancy; Storzbach, Daniel; Furukawa, Izumi
2008-01-01
Cognitive declines in working memory and processing speed are hallmarks of aging. Deficits in speech understanding also are seen in aging individuals. A clinical test to determine whether the cognitive aging changes contribute to aging speech understanding difficulties would be helpful for determining rehabilitation strategies in audiology clinics. To identify a clinical neurocognitive test or battery of tests that could be used in audiology clinics to help explain deficits in speech recognition in some older listeners. A correlational study examining the association between certain cognitive test scores and speech recognition performance. Speeded (time-compressed) speech was used to increase the cognitive processing load. Two hundred twenty-five adults aged 50 through 75 years were participants in this study. Both batteries of tests were administered to all participants in two separate sessions. A selected battery of neurocognitive tests and a time-compressed speech recognition test battery using various rates of speech were administered. Principal component analysis was used to extract the important component factors from each set of tests, and regression models were constructed to examine the association between tests and to identify the neurocognitive test most strongly associated with speech recognition performance. A sequencing working memory test (Letter-Number Sequencing [LNS]) was most strongly associated with rapid speech understanding. The association between the LNS test results and the compressed sentence recognition scores (CSRS) was strong even when age and hearing loss were controlled. The LNS is a sequencing test that provides information about temporal processing at the cognitive level and may prove useful in diagnosis of speech understanding problems, and in the development of aural rehabilitation and training strategies.
NASA Astrophysics Data System (ADS)
Maskeliunas, Rytis; Rudzionis, Vytautas
2011-06-01
In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90%.
Liu, Xunying; Zhang, Chao; Woodland, Phil; Fonteneau, Elisabeth
2017-01-01
There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain. PMID:28945744
Speech Recognition for Medical Dictation: Overview in Quebec and Systematic Review.
Poder, Thomas G; Fisette, Jean-François; Déry, Véronique
2018-04-03
Speech recognition is increasingly used in medical reporting. The aim of this article is to identify in the literature the strengths and weaknesses of this technology, as well as barriers to and facilitators of its implementation. A systematic review of systematic reviews was performed using PubMed, Scopus, the Cochrane Library and the Center for Reviews and Dissemination through August 2017. The gray literature has also been consulted. The quality of systematic reviews has been assessed with the AMSTAR checklist. The main inclusion criterion was use of speech recognition for medical reporting (front-end or back-end). A survey has also been conducted in Quebec, Canada, to identify the dissemination of this technology in this province, as well as the factors leading to the success or failure of its implementation. Five systematic reviews were identified. These reviews indicated a high level of heterogeneity across studies. The quality of the studies reported was generally poor. Speech recognition is not as accurate as human transcription, but it can dramatically reduce turnaround times for reporting. In front-end use, medical doctors need to spend more time on dictation and correction than required with human transcription. With speech recognition, major errors occur up to three times more frequently. In back-end use, a potential increase in productivity of transcriptionists was noted. In conclusion, speech recognition offers several advantages for medical reporting. However, these advantages are countered by an increased burden on medical doctors and by risks of additional errors in medical reports. It is also hard to identify for which medical specialties and which clinical activities the use of speech recognition will be the most beneficial.
Söderlund, Göran B. W.; Jobs, Elisabeth Nilsson
2016-01-01
The most common neuropsychiatric condition in the in children is attention deficit hyperactivity disorder (ADHD), affecting ∼6–9% of the population. ADHD is distinguished by inattention and hyperactive, impulsive behaviors as well as poor performance in various cognitive tasks often leading to failures at school. Sensory and perceptual dysfunctions have also been noticed. Prior research has mainly focused on limitations in executive functioning where differences are often explained by deficits in pre-frontal cortex activation. Less notice has been given to sensory perception and subcortical functioning in ADHD. Recent research has shown that children with ADHD diagnosis have a deviant auditory brain stem response compared to healthy controls. The aim of the present study was to investigate if the speech recognition threshold differs between attentive and children with ADHD symptoms in two environmental sound conditions, with and without external noise. Previous research has namely shown that children with attention deficits can benefit from white noise exposure during cognitive tasks and here we investigate if noise benefit is present during an auditory perceptual task. For this purpose we used a modified Hagerman’s speech recognition test where children with and without attention deficits performed a binaural speech recognition task to assess the speech recognition threshold in no noise and noise conditions (65 dB). Results showed that the inattentive group displayed a higher speech recognition threshold than typically developed children and that the difference in speech recognition threshold disappeared when exposed to noise at supra threshold level. From this we conclude that inattention can partly be explained by sensory perceptual limitations that can possibly be ameliorated through noise exposure. PMID:26858679
Burk, Matthew H; Humes, Larry E; Amos, Nathan E; Strauser, Lauren E
2006-06-01
The objective of this study was to evaluate the effectiveness of a training program for hearing-impaired listeners to improve their speech-recognition performance within a background noise when listening to amplified speech. Both noise-masked young normal-hearing listeners, used to model the performance of elderly hearing-impaired listeners, and a group of elderly hearing-impaired listeners participated in the study. Of particular interest was whether training on an isolated word list presented by a standardized talker can generalize to everyday speech communication across novel talkers. Word-recognition performance was measured for both young normal-hearing (n = 16) and older hearing-impaired (n = 7) adults. Listeners were trained on a set of 75 monosyllabic words spoken by a single female talker over a 9- to 14-day period. Performance for the familiar (trained) talker was measured before and after training in both open-set and closed-set response conditions. Performance on the trained words of the familiar talker were then compared with those same words spoken by three novel talkers and to performance on a second set of untrained words presented by both the familiar and unfamiliar talkers. The hearing-impaired listeners returned 6 mo after their initial training to examine retention of the trained words as well as their ability to transfer any knowledge gained from word training to sentences containing both trained and untrained words. Both young normal-hearing and older hearing-impaired listeners performed significantly better on the word list in which they were trained versus a second untrained list presented by the same talker. Improvements on the untrained words were small but significant, indicating some generalization to novel words. The large increase in performance on the trained words, however, was maintained across novel talkers, pointing to the listener's greater focus on lexical memorization of the words rather than a focus on talker-specific acoustic characteristics. On return in 6 mo, listeners performed significantly better on the trained words relative to their initial baseline performance. Although the listeners performed significantly better on trained versus untrained words in isolation, once the trained words were embedded in sentences, no improvement in recognition over untrained words within the same sentences was shown. Older hearing-impaired listeners were able to significantly improve their word-recognition abilities through training with one talker and to the same degree as young normal-hearing listeners. The improved performance was maintained across talkers and across time. This might imply that training a listener using a standardized list and talker may still provide benefit when these same words are presented by novel talkers outside the clinic. However, training on isolated words was not sufficient to transfer to fluent speech for the specific sentence materials used within this study. Further investigation is needed regarding approaches to improve a hearing aid user's speech understanding in everyday communication situations.
Stropahl, Maren; Plotz, Karsten; Schönfeld, Rüdiger; Lenarz, Thomas; Sandmann, Pascale; Yovel, Galit; De Vos, Maarten; Debener, Stefan
2015-11-01
There is converging evidence that the auditory cortex takes over visual functions during a period of auditory deprivation. A residual pattern of cross-modal take-over may prevent the auditory cortex to adapt to restored sensory input as delivered by a cochlear implant (CI) and limit speech intelligibility with a CI. The aim of the present study was to investigate whether visual face processing in CI users activates auditory cortex and whether this has adaptive or maladaptive consequences. High-density electroencephalogram data were recorded from CI users (n=21) and age-matched normal hearing controls (n=21) performing a face versus house discrimination task. Lip reading and face recognition abilities were measured as well as speech intelligibility. Evaluation of event-related potential (ERP) topographies revealed significant group differences over occipito-temporal scalp regions. Distributed source analysis identified significantly higher activation in the right auditory cortex for CI users compared to NH controls, confirming visual take-over. Lip reading skills were significantly enhanced in the CI group and appeared to be particularly better after a longer duration of deafness, while face recognition was not significantly different between groups. However, auditory cortex activation in CI users was positively related to face recognition abilities. Our results confirm a cross-modal reorganization for ecologically valid visual stimuli in CI users. Furthermore, they suggest that residual takeover, which can persist even after adaptation to a CI is not necessarily maladaptive. Copyright © 2015 Elsevier Inc. All rights reserved.
Strategies for distant speech recognitionin reverberant environments
NASA Astrophysics Data System (ADS)
Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro
2015-12-01
Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.
2014-01-01
How speech is separated perceptually from other speech remains poorly understood. Recent research indicates that the ability of an extraneous formant to impair intelligibility depends on the variation of its frequency contour. This study explored the effects of manipulating the depth and pattern of that variation. Three formants (F1+F2+F3) constituting synthetic analogues of natural sentences were distributed across the 2 ears, together with a competitor for F2 (F2C) that listeners must reject to optimize recognition (left = F1+F2C; right = F2+F3). The frequency contours of F1 − F3 were each scaled to 50% of their natural depth, with little effect on intelligibility. Competitors were created either by inverting the frequency contour of F2 about its geometric mean (a plausibly speech-like pattern) or using a regular and arbitrary frequency contour (triangle wave, not plausibly speech-like) matched to the average rate and depth of variation for the inverted F2C. Adding a competitor typically reduced intelligibility; this reduction depended on the depth of F2C variation, being greatest for 100%-depth, intermediate for 50%-depth, and least for 0%-depth (constant) F2Cs. This suggests that competitor impact depends on overall depth of frequency variation, not depth relative to that for the target formants. The absence of tuning (i.e., no minimum in intelligibility for the 50% case) suggests that the ability to reject an extraneous formant does not depend on similarity in the depth of formant-frequency variation. Furthermore, triangle-wave competitors were as effective as their more speech-like counterparts, suggesting that the selection of formants from the ensemble also does not depend on speech-specific constraints. PMID:24842068
Optimal pattern synthesis for speech recognition based on principal component analysis
NASA Astrophysics Data System (ADS)
Korsun, O. N.; Poliyev, A. V.
2018-02-01
The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.
Age and measurement time-of-day effects on speech recognition in noise.
Veneman, Carrie E; Gordon-Salant, Sandra; Matthews, Lois J; Dubno, Judy R
2013-01-01
The purpose of this study was to determine the effect of measurement time of day on speech recognition in noise and the extent to which time-of-day effects differ with age. Older adults tend to have more difficulty understanding speech in noise than younger adults, even when hearing is normal. Two possible contributors to this age difference in speech recognition may be measurement time of day and inhibition. Most younger adults are "evening-type," showing peak circadian arousal in the evening, whereas most older adults are "morning-type," with circadian arousal peaking in the morning. Tasks that require inhibition of irrelevant information have been shown to be affected by measurement time of day, with maximum performance attained at one's peak time of day. The authors hypothesized that a change in inhibition will be associated with measurement time of day and therefore affect speech recognition in noise, with better performance in the morning for older adults and in the evening for younger adults. Fifteen younger evening-type adults (20-28 years) and 15 older morning-type adults with normal hearing (66-78 years) listened to the Hearing in Noise Test (HINT) and the Quick Speech in Noise (QuickSIN) test in the morning and evening (peak and off-peak times). Time of day preference was assessed using the Morningness-Eveningness Questionnaire. Sentences and noise were presented binaurally through insert earphones. During morning and evening sessions, participants solved word-association problems within the visual-distraction task (VDT), which was used as an estimate of inhibition. After each session, participants rated perceived mental demand of the tasks using a revised version of the NASA Task Load Index. Younger adults performed significantly better on the speech-in-noise tasks and rated themselves as requiring significantly less mental demand when tested at their peak (evening) than off-peak (morning) time of day. In contrast, time-of-day effects were not observed for the older adults on the speech recognition or rating tasks. Although older adults required significantly more advantageous signal-to-noise ratios than younger adults for equivalent speech-recognition performance, a significantly larger younger versus older age difference in speech recognition was observed in the evening than in the morning. Older adults performed significantly poorer than younger adults on the VDT, but performance was not affected by measurement time of day. VDT performance for misleading distracter items was significantly correlated with HINT and QuickSIN test performance at the peak measurement time of day. Although all participants had normal hearing, speech recognition in noise was significantly poorer for older than younger adults, with larger age-related differences in the evening (an off-peak time for older adults) than in the morning. The significant effect of measurement time of day suggests that this factor may impact the clinical assessment of speech recognition in noise for all individuals. It appears that inhibition, as estimated by a visual distraction task for misleading visual items, is a cognitive mechanism that is related to speech-recognition performance in noise, at least at a listener's peak time of day.
NASA Astrophysics Data System (ADS)
Mioulet, L.; Bideault, G.; Chatelain, C.; Paquet, T.; Brunessaux, S.
2015-01-01
The BLSTM-CTC is a novel recurrent neural network architecture that has outperformed previous state of the art algorithms in tasks such as speech recognition or handwriting recognition. It has the ability to process long term dependencies in temporal signals in order to label unsegmented data. This paper describes different ways of combining features using a BLSTM-CTC architecture. Not only do we explore the low level combination (feature space combination) but we also explore high level combination (decoding combination) and mid-level (internal system representation combination). The results are compared on the RIMES word database. Our results show that the low level combination works best, thanks to the powerful data modeling of the LSTM neurons.
Processing Electromyographic Signals to Recognize Words
NASA Technical Reports Server (NTRS)
Jorgensen, C. C.; Lee, D. D.
2009-01-01
A recently invented speech-recognition method applies to words that are articulated by means of the tongue and throat muscles but are otherwise not voiced or, at most, are spoken sotto voce. This method could satisfy a need for speech recognition under circumstances in which normal audible speech is difficult, poses a hazard, is disturbing to listeners, or compromises privacy. The method could also be used to augment traditional speech recognition by providing an additional source of information about articulator activity. The method can be characterized as intermediate between (1) conventional speech recognition through processing of voice sounds and (2) a method, not yet developed, of processing electroencephalographic signals to extract unspoken words directly from thoughts. This method involves computational processing of digitized electromyographic (EMG) signals from muscle innervation acquired by surface electrodes under a subject's chin near the tongue and on the side of the subject s throat near the larynx. After preprocessing, digitization, and feature extraction, EMG signals are processed by a neural-network pattern classifier, implemented in software, that performs the bulk of the recognition task as described.
Spontaneous Speech Collection for the CSR Corpus
1992-01-01
Menlo Park, California 94025 1. ABSTRACT As part of a pilot data collection for DARPA’s Continuous Speech Recognition ( CSR ) speech corpus, SRI...International experi- mented with the collection of spontaneous speeoh material. The bulk of the CSR pilot data was read versions of news articles from...variable. 2. INTRODUCTION The CSR (Continuous Speech Recognition) Corpus collec- tion can be considered the successor to the Resource Man- agemen t
Sommers, M S; Kirk, K I; Pisoni, D B
1997-04-01
The purpose of the present studies was to assess the validity of using closed-set response formats to measure two cognitive processes essential for recognizing spoken words---perceptual normalization (the ability to accommodate acoustic-phonetic variability) and lexical discrimination (the ability to isolate words in the mental lexicon). In addition, the experiments were designed to examine the effects of response format on evaluation of these two abilities in normal-hearing (NH), noise-masked normal-hearing (NMNH), and cochlear implant (CI) subject populations. The speech recognition performance of NH, NMNH, and CI listeners was measured using both open- and closed-set response formats under a number of experimental conditions. To assess talker normalization abilities, identification scores for words produced by a single talker were compared with recognition performance for items produced by multiple talkers. To examine lexical discrimination, performance for words that are phonetically similar to many other words (hard words) was compared with scores for items with few phonetically similar competitors (easy words). Open-set word identification for all subjects was significantly poorer when stimuli were produced in lists with multiple talkers compared with conditions in which all of the words were spoken by a single talker. Open-set word recognition also was better for lexically easy compared with lexically hard words. Closed-set tests, in contrast, failed to reveal the effects of either talker variability or lexical difficulty even when the response alternatives provided were systematically selected to maximize confusability with target items. These findings suggest that, although closed-set tests may provide important information for clinical assessment of speech perception, they may not adequately evaluate a number of cognitive processes that are necessary for recognizing spoken words. The parallel results obtained across all subject groups indicate that NH, NMNH, and CI listeners engage similar perceptual operations to identify spoken words. Implications of these findings for the design of new test batteries that can provide comprehensive evaluations of the individual capacities needed for processing spoken language are discussed.
Eyes and ears: Using eye tracking and pupillometry to understand challenges to speech recognition.
Van Engen, Kristin J; McLaughlin, Drew J
2018-05-04
Although human speech recognition is often experienced as relatively effortless, a number of common challenges can render the task more difficult. Such challenges may originate in talkers (e.g., unfamiliar accents, varying speech styles), the environment (e.g. noise), or in listeners themselves (e.g., hearing loss, aging, different native language backgrounds). Each of these challenges can reduce the intelligibility of spoken language, but even when intelligibility remains high, they can place greater processing demands on listeners. Noisy conditions, for example, can lead to poorer recall for speech, even when it has been correctly understood. Speech intelligibility measures, memory tasks, and subjective reports of listener difficulty all provide critical information about the effects of such challenges on speech recognition. Eye tracking and pupillometry complement these methods by providing objective physiological measures of online cognitive processing during listening. Eye tracking records the moment-to-moment direction of listeners' visual attention, which is closely time-locked to unfolding speech signals, and pupillometry measures the moment-to-moment size of listeners' pupils, which dilate in response to increased cognitive load. In this paper, we review the uses of these two methods for studying challenges to speech recognition. Copyright © 2018. Published by Elsevier B.V.
Robust recognition of loud and Lombard speech in the fighter cockpit environment
NASA Astrophysics Data System (ADS)
Stanton, Bill J., Jr.
1988-08-01
There are a number of challenges associated with incorporating speech recognition technology into the fighter cockpit. One of the major problems is the wide range of variability in the pilot's voice. That can result from changing levels of stress and workload. Increasing the training set to include abnormal speech is not an attractive option because of the innumerable conditions that would have to be represented and the inordinate amount of time to collect such a training set. A more promising approach is to study subsets of abnormal speech that have been produced under controlled cockpit conditions with the purpose of characterizing reliable shifts that occur relative to normal speech. Such was the initiative of this research. Analyses were conducted for 18 features on 17671 phoneme tokens across eight speakers for normal, loud, and Lombard speech. It was discovered that there was a consistent migration of energy in the sonorants. This discovery of reliable energy shifts led to the development of a method to reduce or eliminate these shifts in the Euclidean distances between LPC log magnitude spectra. This combination significantly improved recognition performance of loud and Lombard speech. Discrepancies in recognition error rates between normal and abnormal speech were reduced by approximately 50 percent for all eight speakers combined.
Automatic lip reading by using multimodal visual features
NASA Astrophysics Data System (ADS)
Takahashi, Shohei; Ohya, Jun
2013-12-01
Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.
Cullington, Helen E; Zeng, Fan-Gang
2011-02-01
Despite excellent performance in speech recognition in quiet, most cochlear implant users have great difficulty with speech recognition in noise, music perception, identifying tone of voice, and discriminating different talkers. This may be partly due to the pitch coding in cochlear implant speech processing. Most current speech processing strategies use only the envelope information; the temporal fine structure is discarded. One way to improve electric pitch perception is to use residual acoustic hearing via a hearing aid on the nonimplanted ear (bimodal hearing). This study aimed to test the hypothesis that bimodal users would perform better than bilateral cochlear implant users on tasks requiring good pitch perception. Four pitch-related tasks were used. 1. Hearing in Noise Test (HINT) sentences spoken by a male talker with a competing female, male, or child talker. 2. Montreal Battery of Evaluation of Amusia. This is a music test with six subtests examining pitch, rhythm and timing perception, and musical memory. 3. Aprosodia Battery. This has five subtests evaluating aspects of affective prosody and recognition of sarcasm. 4. Talker identification using vowels spoken by 10 different talkers (three men, three women, two boys, and two girls). Bilateral cochlear implant users were chosen as the comparison group. Thirteen bimodal and 13 bilateral adult cochlear implant users were recruited; all had good speech perception in quiet. There were no significant differences between the mean scores of the bimodal and bilateral groups on any of the tests, although the bimodal group did perform better than the bilateral group on almost all tests. Performance on the different pitch-related tasks was not correlated, meaning that if a subject performed one task well they would not necessarily perform well on another. The correlation between the bimodal users' hearing threshold levels in the aided ear and their performance on these tasks was weak. Although the bimodal cochlear implant group performed better than the bilateral group on most parts of the four pitch-related tests, the differences were not statistically significant. The lack of correlation between test results shows that the tasks used are not simply providing a measure of pitch ability. Even if the bimodal users have better pitch perception, the real-world tasks used are reflecting more diverse skills than pitch. This research adds to the existing speech perception, language, and localization studies that show no significant difference between bimodal and bilateral cochlear implant users.
Speech endpoint detection with non-language speech sounds for generic speech processing applications
NASA Astrophysics Data System (ADS)
McClain, Matthew; Romanowski, Brian
2009-05-01
Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.
Dmitrieva, E S; Gel'man, V Ia
2011-01-01
The listener-distinctive features of recognition of different emotional intonations (positive, negative and neutral) of male and female speakers in the presence or absence of background noise were studied in 49 adults aged 20-79 years. In all the listeners noise produced the most pronounced decrease in recognition accuracy for positive emotional intonation ("joy") as compared to other intonations, whereas it did not influence the recognition accuracy of "anger" in 65-79-year-old listeners. The higher emotion recognition rates of a noisy signal were observed for speech emotional intonations expressed by female speakers. Acoustic characteristics of noisy and clear speech signals underlying perception of speech emotional prosody were found for adult listeners of different age and gender.
Evaluating deep learning architectures for Speech Emotion Recognition.
Fayek, Haytham M; Lech, Margaret; Cavedon, Lawrence
2017-08-01
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. Copyright © 2017 Elsevier Ltd. All rights reserved.
Vocal Tract Representation in the Recognition of Cerebral Palsied Speech
ERIC Educational Resources Information Center
Rudzicz, Frank; Hirst, Graeme; van Lieshout, Pascal
2012-01-01
Purpose: In this study, the authors explored articulatory information as a means of improving the recognition of dysarthric speech by machine. Method: Data were derived chiefly from the TORGO database of dysarthric articulation (Rudzicz, Namasivayam, & Wolff, 2011) in which motions of various points in the vocal tract are measured during speech.…
Micro-Based Speech Recognition: Instructional Innovation for Handicapped Learners.
ERIC Educational Resources Information Center
Horn, Carin E.; Scott, Brian L.
A new voice based learning system (VBLS), which allows the handicapped user to interact with a microcomputer by voice commands, is described. Speech or voice recognition is the computerized process of identifying a spoken word or phrase, including those resulting from speech impediments. This new technology is helpful to the severely physically…
ERIC Educational Resources Information Center
Cordier, Deborah
2009-01-01
A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…
Noise Robust Speech Recognition Applied to Voice-Driven Wheelchair
NASA Astrophysics Data System (ADS)
Sasou, Akira; Kojima, Hiroaki
2009-12-01
Conventional voice-driven wheelchairs usually employ headset microphones that are capable of achieving sufficient recognition accuracy, even in the presence of surrounding noise. However, such interfaces require users to wear sensors such as a headset microphone, which can be an impediment, especially for the hand disabled. Conversely, it is also well known that the speech recognition accuracy drastically degrades when the microphone is placed far from the user. In this paper, we develop a noise robust speech recognition system for a voice-driven wheelchair. This system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors. We verified the effectiveness of our system in experiments in different environments, and confirmed that our system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors.
Studies in automatic speech recognition and its application in aerospace
NASA Astrophysics Data System (ADS)
Taylor, Michael Robinson
Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.
Zheng, Yingjun; Wu, Chao; Li, Juanhua; Li, Ruikeng; Peng, Hongjun; She, Shenglin; Ning, Yuping; Li, Liang
2018-04-04
Speech recognition under noisy "cocktail-party" environments involves multiple perceptual/cognitive processes, including target detection, selective attention, irrelevant signal inhibition, sensory/working memory, and speech production. Compared to health listeners, people with schizophrenia are more vulnerable to masking stimuli and perform worse in speech recognition under speech-on-speech masking conditions. Although the schizophrenia-related speech-recognition impairment under "cocktail-party" conditions is associated with deficits of various perceptual/cognitive processes, it is crucial to know whether the brain substrates critically underlying speech detection against informational speech masking are impaired in people with schizophrenia. Using functional magnetic resonance imaging (fMRI), this study investigated differences between people with schizophrenia (n = 19, mean age = 33 ± 10 years) and their matched healthy controls (n = 15, mean age = 30 ± 9 years) in intra-network functional connectivity (FC) specifically associated with target-speech detection under speech-on-speech-masking conditions. The target-speech detection performance under the speech-on-speech-masking condition in participants with schizophrenia was significantly worse than that in matched healthy participants (healthy controls). Moreover, in healthy controls, but not participants with schizophrenia, the strength of intra-network FC within the bilateral caudate was positively correlated with the speech-detection performance under the speech-masking conditions. Compared to controls, patients showed altered spatial activity pattern and decreased intra-network FC in the caudate. In people with schizophrenia, the declined speech-detection performance under speech-on-speech masking conditions is associated with reduced intra-caudate functional connectivity, which normally contributes to detecting target speech against speech masking via its functions of suppressing masking-speech signals.
Mendez, M F
2001-02-01
After a right temporoparietal stroke, a left-handed man lost the ability to understand speech and environmental sounds but developed greater appreciation for music. The patient had preserved reading and writing but poor verbal comprehension. Slower speech, single syllable words, and minimal written cues greatly facilitated his verbal comprehension. On identifying environmental sounds, he made predominant acoustic errors. Although he failed to name melodies, he could match, describe, and sing them. The patient had normal hearing except for presbyacusis, right-ear dominance for phonemes, and normal discrimination of basic psychoacoustic features and rhythm. Further testing disclosed difficulty distinguishing tone sequences and discriminating two clicks and short-versus-long tones, particularly in the left ear. Together, these findings suggest impairment in a direct route for temporal analysis and auditory word forms in his right hemisphere to Wernicke's area in his left hemisphere. The findings further suggest a separate and possibly rhythm-based mechanism for music recognition.
Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers
NASA Astrophysics Data System (ADS)
Caballero Morales, Santiago Omar; Cox, Stephen J.
2009-12-01
Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.
Application of advanced speech technology in manned penetration bombers
NASA Astrophysics Data System (ADS)
North, R.; Lea, W.
1982-03-01
This report documents research on the potential use of speech technology in a manned penetration bomber aircraft (B-52/G and H). The objectives of the project were to analyze the pilot/copilot crewstation tasks over a three-hour-and forty-minute mission and determine the tasks that would benefit the most from conversion to speech recognition/generation, determine the technological feasibility of each of the identified tasks, and prioritize these tasks based on these criteria. Secondary objectives of the program were to enunciate research strategies in the application of speech technologies in airborne environments, and develop guidelines for briefing user commands on the potential of using speech technologies in the cockpit. The results of this study indicated that for the B-52 crewmember, speech recognition would be most beneficial for retrieving chart and procedural data that is contained in the flight manuals. Technological feasibility of these tasks indicated that the checklist and procedural retrieval tasks would be highly feasible for a speech recognition system.
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; O'Shaughnessy, Douglas
2003-12-01
Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.
The Effect of Dynamic Pitch on Speech Recognition in Temporally Modulated Noise
ERIC Educational Resources Information Center
Shen, Jung; Souza, Pamela E.
2017-01-01
Purpose: This study investigated the effect of dynamic pitch in target speech on older and younger listeners' speech recognition in temporally modulated noise. First, we examined whether the benefit from dynamic-pitch cues depends on the temporal modulation of noise. Second, we tested whether older listeners can benefit from dynamic-pitch cues for…
Introduction and Overview of the Vicens-Reddy Speech Recognition System.
ERIC Educational Resources Information Center
Kameny, Iris; Ritea, H.
The Vicens-Reddy System is unique in the sense that it approaches the problem of speech recognition as a whole, rather than treating particular aspects of the problems as in previous attempts. For example, where earlier systems treated only segmentation of speech into phoneme groups, or detected phonemes in a given context, the Vicens-Reddy System…
Churchill, Tyler H; Kan, Alan; Goupell, Matthew J; Litovsky, Ruth Y
2014-09-01
Most contemporary cochlear implant (CI) processing strategies discard acoustic temporal fine structure (TFS) information, and this may contribute to the observed deficits in bilateral CI listeners' ability to localize sounds when compared to normal hearing listeners. Additionally, for best speech envelope representation, most contemporary speech processing strategies use high-rate carriers (≥900 Hz) that exceed the limit for interaural pulse timing to provide useful binaural information. Many bilateral CI listeners are sensitive to interaural time differences (ITDs) in low-rate (<300 Hz) constant-amplitude pulse trains. This study explored the trade-off between superior speech temporal envelope representation with high-rate carriers and binaural pulse timing sensitivity with low-rate carriers. The effects of carrier pulse rate and pulse timing on ITD discrimination, ITD lateralization, and speech recognition in quiet were examined in eight bilateral CI listeners. Stimuli consisted of speech tokens processed at different electrical stimulation rates, and pulse timings that either preserved or did not preserve acoustic TFS cues. Results showed that CI listeners were able to use low-rate pulse timing cues derived from acoustic TFS when presented redundantly on multiple electrodes for ITD discrimination and lateralization of speech stimuli.
Accommodation and Compliance Series: Employees with Arthritis
... handed keyboard, an articulating keyboard tray, speech recognition software, a trackball, and office equipment for a workstation ... space heater, additional window insulation, and speech recognition software. An insurance clerk with arthritis from systemic lupus ...
[Research on Barrier-free Home Environment System Based on Speech Recognition].
Zhu, Husheng; Yu, Hongliu; Shi, Ping; Fang, Youfang; Jian, Zhuo
2015-10-01
The number of people with physical disabilities is increasing year by year, and the trend of population aging is more and more serious. In order to improve the quality of the life, a control system of accessible home environment for the patients with serious disabilities was developed to control the home electrical devices with the voice of the patients. The control system includes a central control platform, a speech recognition module, a terminal operation module, etc. The system combines the speech recognition control technology and wireless information transmission technology with the embedded mobile computing technology, and interconnects the lamp, electronic locks, alarms, TV and other electrical devices in the home environment as a whole system through a wireless network node. The experimental results showed that speech recognition success rate was more than 84% in the home environment.
Noise-robust speech recognition through auditory feature detection and spike sequence decoding.
Schafer, Phillip B; Jin, Dezhe Z
2014-03-01
Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.
Hemispheric association and dissociation of voice and speech information processing in stroke.
Jones, Anna B; Farrall, Andrew J; Belin, Pascal; Pernet, Cyril R
2015-10-01
As we listen to someone speaking, we extract both linguistic and non-linguistic information. Knowing how these two sets of information are processed in the brain is fundamental for the general understanding of social communication, speech recognition and therapy of language impairments. We investigated the pattern of performances in phoneme versus gender categorization in left and right hemisphere stroke patients, and found an anatomo-functional dissociation in the right frontal cortex, establishing a new syndrome in voice discrimination abilities. In addition, phoneme and gender performances were most often associated than dissociated in the left hemisphere patients, suggesting a common neural underpinnings. Copyright © 2015 Elsevier Ltd. All rights reserved.
Speech-recognition interfaces for music information retrieval
NASA Astrophysics Data System (ADS)
Goto, Masataka
2005-09-01
This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)
Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.
ERIC Educational Resources Information Center
Makhoul, John I.
The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…
The effect of background noise on the word activation process in nonnative spoken-word recognition.
Scharenborg, Odette; Coumans, Juul M J; van Hout, Roeland
2018-02-01
This article investigates 2 questions: (1) does the presence of background noise lead to a differential increase in the number of simultaneously activated candidate words in native and nonnative listening? And (2) do individual differences in listeners' cognitive and linguistic abilities explain the differential effect of background noise on (non-)native speech recognition? English and Dutch students participated in an English word recognition experiment, in which either a word's onset or offset was masked by noise. The native listeners outperformed the nonnative listeners in all listening conditions. Importantly, however, the effect of noise on the multiple activation process was found to be remarkably similar in native and nonnative listening. The presence of noise increased the set of candidate words considered for recognition in both native and nonnative listening. The results indicate that the observed performance differences between the English and Dutch listeners should not be primarily attributed to a differential effect of noise, but rather to the difference between native and nonnative listening. Additional analyses showed that word-initial information was found to be more important than word-final information during spoken-word recognition. When word-initial information was no longer reliably available word recognition accuracy dropped and word frequency information could no longer be used suggesting that word frequency information is strongly tied to the onset of words and the earliest moments of lexical access. Proficiency and inhibition ability were found to influence nonnative spoken-word recognition in noise, with a higher proficiency in the nonnative language and worse inhibition ability leading to improved recognition performance. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Duke, Mila Morais; Wolfe, Jace; Schafer, Erin
2016-05-01
Cochlear implant (CI) recipients often experience difficulty understanding speech in noise and speech that originates from a distance. Many CI recipients also experience difficulty understanding speech originating from a television. Use of hearing assistance technology (HAT) may improve speech recognition in noise and for signals that originate from more than a few feet from the listener; however, there are no published studies evaluating the potential benefits of a wireless HAT designed to deliver audio signals from a television directly to a CI sound processor. The objective of this study was to compare speech recognition in quiet and in noise of CI recipients with the use of their CI alone and with the use of their CI and a wireless HAT (Cochlear Wireless TV Streamer). A two-way repeated measures design was used to evaluate performance differences obtained in quiet and in competing noise (65 dBA) with the CI sound processor alone and with the sound processor coupled to the Cochlear Wireless TV Streamer. Sixteen users of Cochlear Nucleus 24 Freedom, CI512, and CI422 implants were included in the study. Participants were evaluated in four conditions including use of the sound processor alone and use of the sound processor with the wireless streamer in quiet and in the presence of competing noise at 65 dBA. Speech recognition was evaluated in each condition with two full lists of Computer-Assisted Speech Perception Testing and Training Sentence-Level Test sentences presented from a light-emitting diode television. Speech recognition in noise was significantly better with use of the wireless streamer compared to participants' performance with their CI sound processor alone. There was also a nonsignificant trend toward better performance in quiet with use of the TV Streamer. Performance was significantly poorer when evaluated in noise compared to performance in quiet when the TV Streamer was not used. Use of the Cochlear Wireless TV Streamer designed to stream audio from a television directly to a CI sound processor provides better speech recognition in quiet and in noise when compared to performance obtained with use of the CI sound processor alone. American Academy of Audiology.
Development and preliminary evaluation of a pediatric Spanish-English speech perception task.
Calandruccio, Lauren; Gomez, Bianca; Buss, Emily; Leibold, Lori J
2014-06-01
The purpose of this study was to develop a task to evaluate children's English and Spanish speech perception abilities in either noise or competing speech maskers. Eight bilingual Spanish-English and 8 age-matched monolingual English children (ages 4.9-16.4 years) were tested. A forced-choice, picture-pointing paradigm was selected for adaptively estimating masked speech reception thresholds. Speech stimuli were spoken by simultaneous bilingual Spanish-English talkers. The target stimuli were 30 disyllabic English and Spanish words, familiar to 5-year-olds and easily illustrated. Competing stimuli included either 2-talker English or 2-talker Spanish speech (corresponding to target language) and spectrally matched noise. For both groups of children, regardless of test language, performance was significantly worse for the 2-talker than for the noise masker condition. No difference in performance was found between bilingual and monolingual children. Bilingual children performed significantly better in English than in Spanish in competing speech. For all listening conditions, performance improved with increasing age. Results indicated that the stimuli and task were appropriate for speech recognition testing in both languages, providing a more conventional measure of speech-in-noise perception as well as a measure of complex listening. Further research is needed to determine performance for Spanish-dominant listeners and to evaluate the feasibility of implementation into routine clinical use.
Development and preliminary evaluation of a pediatric Spanish/English speech perception task
Calandruccio, Lauren; Gomez, Bianca; Buss, Emily; Leibold, Lori J.
2014-01-01
Purpose To develop a task to evaluate children’s English and Spanish speech perception abilities in either noise or competing speech maskers. Methods Eight bilingual Spanish/English and eight age matched monolingual English children (ages 4.9 –16.4 years) were tested. A forced-choice, picture-pointing paradigm was selected for adaptively estimating masked speech reception thresholds. Speech stimuli were spoken by simultaneous bilingual Spanish/English talkers. The target stimuli were thirty disyllabic English and Spanish words, familiar to five-year-olds, and easily illustrated. Competing stimuli included either two-talker English or two-talker Spanish speech (corresponding to target language) and spectrally matched noise. Results For both groups of children, regardless of test language, performance was significantly worse for the two-talker than the noise masker. No difference in performance was found between bilingual and monolingual children. Bilingual children performed significantly better in English than in Spanish in competing speech. For all listening conditions, performance improved with increasing age. Conclusions Results indicate that the stimuli and task are appropriate for speech recognition testing in both languages, providing a more conventional measure of speech-in-noise perception as well as a measure of complex listening. Further research is needed to determine performance for Spanish-dominant listeners and to evaluate the feasibility of implementation into routine clinical use. PMID:24686915
1989-06-01
12 1.7 Application of the Modified Speech Transmission Index to Monaural and Binaural Speech Recognition in Normal and Impaired...describe all of the data from both groups. 1.7 Application of the Modified Speech Transmission Index to Monaural and Binaural Speech Recognition in...were obtained for materials presented to each ear separately (monaurally) and to both ears ( binaurally ). Results from the normal listeners are accurately
Emotion recognition from speech: tools and challenges
NASA Astrophysics Data System (ADS)
Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.
2015-05-01
Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.
Zhu, Lianzhang; Chen, Leiming; Zhao, Dehai
2017-01-01
Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed. PMID:28737705
Hearing Handicap and Speech Recognition Correlate With Self-Reported Listening Effort and Fatigue.
Alhanbali, Sara; Dawes, Piers; Lloyd, Simon; Munro, Kevin J
To investigate the correlations between hearing handicap, speech recognition, listening effort, and fatigue. Eighty-four adults with hearing loss (65 to 85 years) completed three self-report questionnaires: the Fatigue Assessment Scale, the Effort Assessment Scale, and the Hearing Handicap Inventory for Elderly. Audiometric assessment included pure-tone audiometry and speech recognition in noise. There was a significant positive correlation between handicap and fatigue (r = 0.39, p < 0.05) and handicap and effort (r = 0.73, p < 0.05). There were significant (but lower) correlations between speech recognition and fatigue (r = 0.22, p < 0.05) or effort (r = 0.32, p< 0.05). There was no significant correlation between hearing level and fatigue or effort. Hearing handicap and speech recognition both correlate with self-reported listening effort and fatigue, which is consistent with a model of listening effort and fatigue where perceived difficulty is related to sustained effort and fatigue for unrewarding tasks over which the listener has low control. A clinical implication is that encouraging clients to recognize and focus on the pleasure and positive experiences of listening may result in greater satisfaction and benefit from hearing aid use.
NASA Astrophysics Data System (ADS)
Whang, Tom; Ratib, Osman M.; Umamoto, Kathleen; Grant, Edward G.; McCoy, Michael J.
2002-05-01
The goal of this study is to determine the financial value and workflow improvements achievable by replacing traditional transcription services with a speech recognition system in a large, university hospital setting. Workflow metrics were measured at two hospitals, one of which exclusively uses a transcription service (UCLA Medical Center), and the other which exclusively uses speech recognition (West Los Angeles VA Hospital). Workflow metrics include time spent per report (the sum of time spent interpreting, dictating, reviewing, and editing), transcription turnaround, and total report turnaround. Compared to traditional transcription, speech recognition resulted in radiologists spending 13-32% more time per report, but it also resulted in reduction of report turnaround time by 22-62% and reduction of marginal cost per report by 94%. The model developed here helps justify the introduction of a speech recognition system by showing that the benefits of reduced operating costs and decreased turnaround time outweigh the cost of increased time spent per report. Whether the ultimate goal is to achieve a financial objective or to improve operational efficiency, it is important to conduct a thorough analysis of workflow before implementation.
Peng, Shu-Chen; Lu, Nelson; Chatterjee, Monita
2009-01-01
Cochlear implant (CI) recipients have only limited access to fundamental frequency (F0) information, and thus exhibit deficits in speech intonation recognition. For speech intonation, F0 serves as the primary cue, and other potential acoustic cues (e.g. intensity properties) may also contribute. This study examined the effects of cooperating or conflicting acoustic cues on speech intonation recognition by adult CI and normal hearing (NH) listeners with full-spectrum and spectrally degraded speech stimuli. Identification of speech intonation that signifies question and statement contrasts was measured in 13 CI recipients and 4 NH listeners, using resynthesized bi-syllabic words, where F0 and intensity properties were systematically manipulated. The stimulus set was comprised of tokens whose acoustic cues (i.e. F0 contour and intensity patterns) were either cooperating or conflicting. Subjects identified if each stimulus is a 'statement' or a 'question' in a single-interval, 2-alternative forced-choice (2AFC) paradigm. Logistic models were fitted to the data, and estimated coefficients were compared under cooperating and conflicting conditions, between the subject groups (CI vs. NH), and under full-spectrum and spectrally degraded conditions for NH listeners. The results indicated that CI listeners' intonation recognition was enhanced by cooperating F0 contour and intensity cues, but was adversely affected by these cues being conflicting. On the other hand, with full-spectrum stimuli, NH listeners' intonation recognition was not affected by cues being cooperating or conflicting. The effects of cues being cooperating or conflicting were comparable between the CI group and NH listeners with spectrally degraded stimuli. These findings suggest the importance of taking multiple acoustic sources for speech recognition into consideration in aural rehabilitation for CI recipients. Copyright (C) 2009 S. Karger AG, Basel.
Peng, Shu-Chen; Lu, Nelson; Chatterjee, Monita
2009-01-01
Cochlear implant (CI) recipients have only limited access to fundamental frequency (F0) information, and thus exhibit deficits in speech intonation recognition. For speech intonation, F0 serves as the primary cue, and other potential acoustic cues (e.g., intensity properties) may also contribute. This study examined the effects of acoustic cues being cooperating or conflicting on speech intonation recognition by adult cochlear implant (CI), and normal-hearing (NH) listeners with full-spectrum and spectrally degraded speech stimuli. Identification of speech intonation that signifies question and statement contrasts was measured in 13 CI recipients and 4 NH listeners, using resynthesized bi-syllabic words, where F0 and intensity properties were systematically manipulated. The stimulus set was comprised of tokens whose acoustic cues, i.e., F0 contour and intensity patterns, were either cooperating or conflicting. Subjects identified if each stimulus is a “statement” or a “question” in a single-interval, two-alternative forced-choice (2AFC) paradigm. Logistic models were fitted to the data, and estimated coefficients were compared under cooperating and conflicting conditions, between the subject groups (CI vs. NH), and under full-spectrum and spectrally degraded conditions for NH listeners. The results indicated that CI listeners’ intonation recognition was enhanced by F0 contour and intensity cues being cooperating, but was adversely affected by these cues being conflicting. On the other hand, with full-spectrum stimuli, NH listeners’ intonation recognition was not affected by cues being cooperating or conflicting. The effects of cues being cooperating or conflicting were comparable between the CI group and NH listeners with spectrally-degraded stimuli. These findings suggest the importance of taking multiple acoustic sources for speech recognition into consideration in aural rehabilitation for CI recipients. PMID:19372651
Kam, Anna Chi Shan; Sung, John Ka Keung; Lee, Tan; Wong, Terence Ka Cheong; van Hasselt, Andrew
In this study, the authors evaluated the effect of personalized amplification on mobile phone speech recognition in people with and without hearing loss. This prospective study used double-blind, within-subjects, repeated measures, controlled trials to evaluate the effectiveness of applying personalized amplification based on the hearing level captured on the mobile device. The personalized amplification settings were created using modified one-third gain targets. The participants in this study included 100 adults of age between 20 and 78 years (60 with age-adjusted normal hearing and 40 with hearing loss). The performance of the participants with personalized amplification and standard settings was compared using both subjective and speech-perception measures. Speech recognition was measured in quiet and in noise using Cantonese disyllabic words. Subjective ratings on the quality, clarity, and comfortableness of the mobile signals were measured with an 11-point visual analog scale. Subjective preferences of the settings were also obtained by a paired-comparison procedure. The personalized amplification application provided better speech recognition via the mobile phone both in quiet and in noise for people with hearing impairment (improved 8 to 10%) and people with normal hearing (improved 1 to 4%). The improvement in speech recognition was significantly better for people with hearing impairment. When the average device output level was matched, more participants preferred to have the individualized gain than not to have it. The personalized amplification application has the potential to improve speech recognition for people with mild-to-moderate hearing loss, as well as people with normal hearing, in particular when listening in noisy environments.
Mehraei, Golbarg; Gallun, Frederick J.; Leek, Marjorie R.; Bernstein, Joshua G. W.
2014-01-01
Poor speech understanding in noise by hearing-impaired (HI) listeners is only partly explained by elevated audiometric thresholds. Suprathreshold-processing impairments such as reduced temporal or spectral resolution or temporal fine-structure (TFS) processing ability might also contribute. Although speech contains dynamic combinations of temporal and spectral modulation and TFS content, these capabilities are often treated separately. Modulation-depth detection thresholds for spectrotemporal modulation (STM) applied to octave-band noise were measured for normal-hearing and HI listeners as a function of temporal modulation rate (4–32 Hz), spectral ripple density [0.5–4 cycles/octave (c/o)] and carrier center frequency (500–4000 Hz). STM sensitivity was worse than normal for HI listeners only for a low-frequency carrier (1000 Hz) at low temporal modulation rates (4–12 Hz) and a spectral ripple density of 2 c/o, and for a high-frequency carrier (4000 Hz) at a high spectral ripple density (4 c/o). STM sensitivity for the 4-Hz, 4-c/o condition for a 4000-Hz carrier and for the 4-Hz, 2-c/o condition for a 1000-Hz carrier were correlated with speech-recognition performance in noise after partialling out the audiogram-based speech-intelligibility index. Poor speech-reception and STM-detection performance for HI listeners may be related to a combination of reduced frequency selectivity and a TFS-processing deficit limiting the ability to track spectral-peak movements. PMID:24993215
NASA Astrophysics Data System (ADS)
Nishiura, Takanobu; Nakamura, Satoshi
2003-10-01
Humans communicate with each other through speech by focusing on the target speech among environmental sounds in real acoustic environments. We can easily identify the target sound from other environmental sounds. For hands-free speech recognition, the identification of the target speech from environmental sounds is imperative. This mechanism may also be important for a self-moving robot to sense the acoustic environments and communicate with humans. Therefore, this paper first proposes hidden Markov model (HMM)-based environmental sound source identification. Environmental sounds are modeled by three states of HMMs and evaluated using 92 kinds of environmental sounds. The identification accuracy was 95.4%. This paper also proposes a new HMM composition method that composes speech HMMs and an HMM of categorized environmental sounds for robust environmental sound-added speech recognition. As a result of the evaluation experiments, we confirmed that the proposed HMM composition outperforms the conventional HMM composition with speech HMMs and a noise (environmental sound) HMM trained using noise periods prior to the target speech in a captured signal. [Work supported by Ministry of Public Management, Home Affairs, Posts and Telecommunications of Japan.
Speech therapy and voice recognition instrument
NASA Technical Reports Server (NTRS)
Cohen, J.; Babcock, M. L.
1972-01-01
Characteristics of electronic circuit for examining variations in vocal excitation for diagnostic purposes and in speech recognition for determiniog voice patterns and pitch changes are described. Operation of the circuit is discussed and circuit diagram is provided.
Steinke, W R; Cuddy, L L; Jakobson, L S
2001-07-01
This study describes an amateur musician, KB, who became amusic following a right-hemisphere stroke. A series of assessments conducted post-stroke revealed that KB functioned in the normal range for most verbal skills. However, compared with controls matched in age and music training, KB showed severe loss of pitch and rhythmic processing abilities. His ability to recognise and identify familiar instrumental melodies was also lost. Despite these deficits, KB performed remarkably well when asked to recognise and identify familiar song melodies presented without accompanying lyrics. This dissociation between the ability to recognise/identify song vs. instrumental melodies was replicated across different sets of musical materials, including newly learned melodies. Analyses of the acoustical and musical features of song and instrumental melodies discounted an explanation of the dissociation based on these features alone. Rather, the results suggest a functional dissociation resulting from a focal brain lesion. We propose that, in the case of song melodies, there remains sufficient activation in KB's melody analysis system to coactivate an intact representation of both associative information and the lyrics in the speech lexicon, making recognition and identification possible. In the case of instrumental melodies, no such associative processes exist; thus recognition and identification do not occur.
Multilevel Analysis in Analyzing Speech Data
ERIC Educational Resources Information Center
Guddattu, Vasudeva; Krishna, Y.
2011-01-01
The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…
NASA Astrophysics Data System (ADS)
Tanioka, Toshimasa; Egashira, Hiroyuki; Takata, Mayumi; Okazaki, Yasuhisa; Watanabe, Kenzi; Kondo, Hiroki
We have designed and implemented a PC operation support system for a physically disabled person with a speech impediment via voice. Voice operation is an effective method for a physically disabled person with involuntary movement of the limbs and the head. We have applied a commercial speech recognition engine to develop our system for practical purposes. Adoption of a commercial engine reduces development cost and will contribute to make our system useful to another speech impediment people. We have customized commercial speech recognition engine so that it can recognize the utterance of a person with a speech impediment. We have restricted the words that the recognition engine recognizes and separated a target words from similar words in pronunciation to avoid misrecognition. Huge number of words registered in commercial speech recognition engines cause frequent misrecognition for speech impediments' utterance, because their utterance is not clear and unstable. We have solved this problem by narrowing the choice of input down in a small number and also by registering their ambiguous pronunciations in addition to the original ones. To realize all character inputs and all PC operation with a small number of words, we have designed multiple input modes with categorized dictionaries and have introduced two-step input in each mode except numeral input to enable correct operation with small number of words. The system we have developed is in practical level. The first author of this paper is physically disabled with a speech impediment. He has been able not only character input into PC but also to operate Windows system smoothly by using this system. He uses this system in his daily life. This paper is written by him with this system. At present, the speech recognition is customized to him. It is, however, possible to customize for other users by changing words and registering new pronunciation according to each user's utterance.
Multilingual Phoneme Models for Rapid Speech Processing System Development
2006-09-01
processes are used to develop an Arabic speech recognition system starting from monolingual English models, In- ternational Phonetic Association (IPA...clusters. It was found that multilingual bootstrapping methods out- perform monolingual English bootstrapping methods on the Arabic evaluation data initially...International Phonetic Alphabet . . . . . . . . . 7 2.3.2 Multilingual vs. Monolingual Speech Recognition 7 2.3.3 Data-Driven Approaches
ERIC Educational Resources Information Center
Wald, Mike
2006-01-01
The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search…
A novel probabilistic framework for event-based speech recognition
NASA Astrophysics Data System (ADS)
Juneja, Amit; Espy-Wilson, Carol
2003-10-01
One of the reasons for unsatisfactory performance of the state-of-the-art automatic speech recognition (ASR) systems is the inferior acoustic modeling of low-level acoustic-phonetic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal, but such a system for continuous speech recognition (CSR) is not known to exist. A probabilistic and statistical framework for CSR based on the idea of the representation of speech sounds by bundles of binary valued articulatory phonetic features is proposed. Multiple probabilistic sequences of linguistically motivated landmarks are obtained using binary classifiers of manner phonetic features-syllabic, sonorant and continuant-and the knowledge-based acoustic parameters (APs) that are acoustic correlates of those features. The landmarks are then used for the extraction of knowledge-based APs for source and place phonetic features and their binary classification. Probabilistic landmark sequences are constrained using manner class language models for isolated or connected word recognition. The proposed method could overcome the disadvantages encountered by the early acoustic-phonetic knowledge-based systems that led the ASR community to switch to systems highly dependent on statistical pattern analysis methods and probabilistic language or grammar models.
Kim, Min-Beom; Chung, Won-Ho; Choi, Jeesun; Hong, Sung Hwa; Cho, Yang-Sun; Park, Gyuseok; Lee, Sangmin
2014-06-01
The object was to evaluate speech perception improvement through Bluetooth-implemented hearing aids in hearing-impaired adults. Thirty subjects with bilateral symmetric moderate sensorineural hearing loss participated in this study. A Bluetooth-implemented hearing aid was fitted unilaterally in all study subjects. Objective speech recognition score and subjective satisfaction were measured with a Bluetooth-implemented hearing aid to replace the acoustic connection from either a cellular phone or a loudspeaker system. In each system, participants were assigned to 4 conditions: wireless speech signal transmission into hearing aid (wireless mode) in quiet or noisy environment and conventional speech signal transmission using external microphone of hearing aid (conventional mode) in quiet or noisy environment. Also, participants completed questionnaires to investigate subjective satisfaction. Both cellular phone and loudspeaker system situation, participants showed improvements in sentence and word recognition scores with wireless mode compared to conventional mode in both quiet and noise conditions (P < .001). Participants also reported subjective improvements, including better sound quality, less noise interference, and better accuracy naturalness, when using the wireless mode (P < .001). Bluetooth-implemented hearing aids helped to improve subjective and objective speech recognition performances in quiet and noisy environments during the use of electronic audio devices.
Speech recognition in individuals with sensorineural hearing loss.
de Andrade, Adriana Neves; Iorio, Maria Cecilia Martinelli; Gil, Daniela
2016-01-01
Hearing loss can negatively influence the communication performance of individuals, who should be evaluated with suitable material and in situations of listening close to those found in everyday life. To analyze and compare the performance of patients with mild-to-moderate sensorineural hearing loss in speech recognition tests carried out in silence and with noise, according to the variables ear (right and left) and type of stimulus presentation. The study included 19 right-handed individuals with mild-to-moderate symmetrical bilateral sensorineural hearing loss, submitted to the speech recognition test with words in different modalities and speech test with white noise and pictures. There was no significant difference between right and left ears in any of the tests. The mean number of correct responses in the speech recognition test with pictures, live voice, and recorded monosyllables was 97.1%, 85.9%, and 76.1%, respectively, whereas after the introduction of noise, the performance decreased to 72.6% accuracy. The best performances in the Speech Recognition Percentage Index were obtained using monosyllabic stimuli, represented by pictures presented in silence, with no significant differences between the right and left ears. After the introduction of competitive noise, there was a decrease in individuals' performance. Copyright © 2015 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.
McMurray, Bob; Munson, Cheyenne; Tomblin, J Bruce
2014-08-01
The authors examined speech perception deficits associated with individual differences in language ability, contrasting auditory, phonological, or lexical accounts by asking whether lexical competition is differentially sensitive to fine-grained acoustic variation. Adolescents with a range of language abilities (N = 74, including 35 impaired) participated in an experiment based on McMurray, Tanenhaus, and Aslin (2002). Participants heard tokens from six 9-step voice onset time (VOT) continua spanning 2 words (beach/peach, beak/peak, etc.) while viewing a screen containing pictures of those words and 2 unrelated objects. Participants selected the referent while eye movements to each picture were monitored as a measure of lexical activation. Fixations were examined as a function of both VOT and language ability. Eye movements were sensitive to within-category VOT differences: As VOT approached the boundary, listeners made more fixations to the competing word. This did not interact with language ability, suggesting that language impairment is not associated with differential auditory sensitivity or phonetic categorization. Listeners with poorer language skills showed heightened competitors fixations overall, suggesting a deficit in lexical processes. Language impairment may be better characterized by a deficit in lexical competition (inability to suppress competing words), rather than differences in phonological categorization or auditory abilities.
ERIC Educational Resources Information Center
Suttora, Chiara; Salerni, Nicoletta; Zanchi, Paola; Zampini, Laura; Spinelli, Maria; Fasolo, Mirco
2017-01-01
This study aimed to investigate specific associations between structural and acoustic characteristics of infant-directed (ID) speech and word recognition. Thirty Italian-acquiring children and their mothers were tested when the children were 1;3. Children's word recognition was measured with the looking-while-listening task. Maternal ID speech was…
ERIC Educational Resources Information Center
Moberly, Aaron C.; Harris, Michael S.; Boyce, Lauren; Nittrouer, Susan
2017-01-01
Purpose: Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced…
The Effect of Asymmetrical Signal Degradation on Binaural Speech Recognition in Children and Adults.
ERIC Educational Resources Information Center
Rothpletz, Ann M.; Tharpe, Anne Marie; Grantham, D. Wesley
2004-01-01
To determine the effect of asymmetrical signal degradation on binaural speech recognition, 28 children and 14 adults were administered a sentence recognition task amidst multitalker babble. There were 3 listening conditions: (a) monaural, with mild degradation in 1 ear; (b) binaural, with mild degradation in both ears (symmetric degradation); and…
Learning Models and Real-Time Speech Recognition.
ERIC Educational Resources Information Center
Danforth, Douglas G.; And Others
This report describes the construction and testing of two "psychological" learning models for the purpose of computer recognition of human speech over the telephone. One of the two models was found to be superior in all tests. A regression analysis yielded a 92.3% recognition rate for 14 subjects ranging in age from 6 to 13 years. Tests…
Use of Authentic-Speech Technique for Teaching Sound Recognition to EFL Students
ERIC Educational Resources Information Center
Sersen, William J.
2011-01-01
The main objective of this research was to test an authentic-speech technique for improving the sound-recognition skills of EFL (English as a foreign language) students at Roi-Et Rajabhat University. The secondary objective was to determine the correlation, if any, between students' self-evaluation of sound-recognition progress and the actual…
What happens to the motor theory of perception when the motor system is damaged?
Stasenko, Alena; Garcea, Frank E; Mahon, Bradford Z
2013-09-01
Motor theories of perception posit that motor information is necessary for successful recognition of actions. Perhaps the most well known of this class of proposals is the motor theory of speech perception, which argues that speech recognition is fundamentally a process of identifying the articulatory gestures (i.e. motor representations) that were used to produce the speech signal. Here we review neuropsychological evidence from patients with damage to the motor system, in the context of motor theories of perception applied to both manual actions and speech. Motor theories of perception predict that patients with motor impairments will have impairments for action recognition. Contrary to that prediction, the available neuropsychological evidence indicates that recognition can be spared despite profound impairments to production. These data falsify strong forms of the motor theory of perception, and frame new questions about the dynamical interactions that govern how information is exchanged between input and output systems.
What happens to the motor theory of perception when the motor system is damaged?
Stasenko, Alena; Garcea, Frank E.; Mahon, Bradford Z.
2016-01-01
Motor theories of perception posit that motor information is necessary for successful recognition of actions. Perhaps the most well known of this class of proposals is the motor theory of speech perception, which argues that speech recognition is fundamentally a process of identifying the articulatory gestures (i.e. motor representations) that were used to produce the speech signal. Here we review neuropsychological evidence from patients with damage to the motor system, in the context of motor theories of perception applied to both manual actions and speech. Motor theories of perception predict that patients with motor impairments will have impairments for action recognition. Contrary to that prediction, the available neuropsychological evidence indicates that recognition can be spared despite profound impairments to production. These data falsify strong forms of the motor theory of perception, and frame new questions about the dynamical interactions that govern how information is exchanged between input and output systems. PMID:26823687
Schädler, Marc René; Warzybok, Anna; Meyer, Bernd T.; Brand, Thomas
2016-01-01
To characterize the individual patient’s hearing impairment as obtained with the matrix sentence recognition test, a simulation Framework for Auditory Discrimination Experiments (FADE) is extended here using the Attenuation and Distortion (A+D) approach by Plomp as a blueprint for setting the individual processing parameters. FADE has been shown to predict the outcome of both speech recognition tests and psychoacoustic experiments based on simulations using an automatic speech recognition system requiring only few assumptions. It builds on the closed-set matrix sentence recognition test which is advantageous for testing individual speech recognition in a way comparable across languages. Individual predictions of speech recognition thresholds in stationary and in fluctuating noise were derived using the audiogram and an estimate of the internal level uncertainty for modeling the individual Plomp curves fitted to the data with the Attenuation (A-) and Distortion (D-) parameters of the Plomp approach. The “typical” audiogram shapes from Bisgaard et al with or without a “typical” level uncertainty and the individual data were used for individual predictions. As a result, the individualization of the level uncertainty was found to be more important than the exact shape of the individual audiogram to accurately model the outcome of the German Matrix test in stationary or fluctuating noise for listeners with hearing impairment. The prediction accuracy of the individualized approach also outperforms the (modified) Speech Intelligibility Index approach which is based on the individual threshold data only. PMID:27604782
Winn, Matthew B; Won, Jong Ho; Moon, Il Joon
This study was conducted to measure auditory perception by cochlear implant users in the spectral and temporal domains, using tests of either categorization (using speech-based cues) or discrimination (using conventional psychoacoustic tests). The authors hypothesized that traditional nonlinguistic tests assessing spectral and temporal auditory resolution would correspond to speech-based measures assessing specific aspects of phonetic categorization assumed to depend on spectral and temporal auditory resolution. The authors further hypothesized that speech-based categorization performance would ultimately be a superior predictor of speech recognition performance, because of the fundamental nature of speech recognition as categorization. Nineteen cochlear implant listeners and 10 listeners with normal hearing participated in a suite of tasks that included spectral ripple discrimination, temporal modulation detection, and syllable categorization, which was split into a spectral cue-based task (targeting the /ba/-/da/ contrast) and a timing cue-based task (targeting the /b/-/p/ and /d/-/t/ contrasts). Speech sounds were manipulated to contain specific spectral or temporal modulations (formant transitions or voice onset time, respectively) that could be categorized. Categorization responses were quantified using logistic regression to assess perceptual sensitivity to acoustic phonetic cues. Word recognition testing was also conducted for cochlear implant listeners. Cochlear implant users were generally less successful at utilizing both spectral and temporal cues for categorization compared with listeners with normal hearing. For the cochlear implant listener group, spectral ripple discrimination was significantly correlated with the categorization of formant transitions; both were correlated with better word recognition. Temporal modulation detection using 100- and 10-Hz-modulated noise was not correlated either with the cochlear implant subjects' categorization of voice onset time or with word recognition. Word recognition was correlated more closely with categorization of the controlled speech cues than with performance on the psychophysical discrimination tasks. When evaluating people with cochlear implants, controlled speech-based stimuli are feasible to use in tests of auditory cue categorization, to complement traditional measures of auditory discrimination. Stimuli based on specific speech cues correspond to counterpart nonlinguistic measures of discrimination, but potentially show better correspondence with speech perception more generally. The ubiquity of the spectral (formant transition) and temporal (voice onset time) stimulus dimensions across languages highlights the potential to use this testing approach even in cases where English is not the native language.
Winn, Matthew B.; Won, Jong Ho; Moon, Il Joon
2016-01-01
Objectives This study was conducted to measure auditory perception by cochlear implant users in the spectral and temporal domains, using tests of either categorization (using speech-based cues) or discrimination (using conventional psychoacoustic tests). We hypothesized that traditional nonlinguistic tests assessing spectral and temporal auditory resolution would correspond to speech-based measures assessing specific aspects of phonetic categorization assumed to depend on spectral and temporal auditory resolution. We further hypothesized that speech-based categorization performance would ultimately be a superior predictor of speech recognition performance, because of the fundamental nature of speech recognition as categorization. Design Nineteen CI listeners and 10 listeners with normal hearing (NH) participated in a suite of tasks that included spectral ripple discrimination (SRD), temporal modulation detection (TMD), and syllable categorization, which was split into a spectral-cue-based task (targeting the /ba/-/da/ contrast) and a timing-cue-based task (targeting the /b/-/p/ and /d/-/t/ contrasts). Speech sounds were manipulated in order to contain specific spectral or temporal modulations (formant transitions or voice onset time, respectively) that could be categorized. Categorization responses were quantified using logistic regression in order to assess perceptual sensitivity to acoustic phonetic cues. Word recognition testing was also conducted for CI listeners. Results CI users were generally less successful at utilizing both spectral and temporal cues for categorization compared to listeners with normal hearing. For the CI listener group, SRD was significantly correlated with the categorization of formant transitions; both were correlated with better word recognition. TMD using 100 Hz and 10 Hz modulated noise was not correlated with the CI subjects’ categorization of VOT, nor with word recognition. Word recognition was correlated more closely with categorization of the controlled speech cues than with performance on the psychophysical discrimination tasks. Conclusions When evaluating people with cochlear implants, controlled speech-based stimuli are feasible to use in tests of auditory cue categorization, to complement traditional measures of auditory discrimination. Stimuli based on specific speech cues correspond to counterpart non-linguistic measures of discrimination, but potentially show better correspondence with speech perception more generally. The ubiquity of the spectral (formant transition) and temporal (VOT) stimulus dimensions across languages highlights the potential to use this testing approach even in cases where English is not the native language. PMID:27438871
Reeder, Ruth M; Firszt, Jill B; Cadieux, Jamie H; Strube, Michael J
2017-01-01
Whether, and if so when, a second-ear cochlear implant should be provided to older, unilaterally implanted children is an ongoing clinical question. This study evaluated rate of speech recognition progress for the second implanted ear and with bilateral cochlear implants in older sequentially implanted children and evaluated localization abilities. A prospective longitudinal study included 24 bilaterally implanted children (mean ear surgeries at 5.11 and 14.25 years). Test intervals were every 3-6 months through 24 months postbilateral. Test conditions were each ear and bilaterally for speech recognition and localization. Overall, the rate of progress for the second implanted ear was gradual. Improvements in quiet continued through the second year of bilateral use. Improvements in noise were more modest and leveled off during the second year. On all measures, results from the second ear were poorer than the first. Bilateral scores were better than either ear alone for all measures except sentences in quiet and localization. Older sequentially implanted children with several years between surgeries may obtain speech understanding in the second implanted ear; however, performance may be limited and rate of progress gradual. Continued contralateral ear hearing aid use and reduced time between surgeries may enhance outcomes.
NASA Technical Reports Server (NTRS)
Simpson, C. A.
1985-01-01
In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.
Speech recognition in advanced rotorcraft - Using speech controls to reduce manual control overload
NASA Technical Reports Server (NTRS)
Vidulich, Michael A.; Bortolussi, Michael R.
1988-01-01
An experiment has been conducted to ascertain the usefulness of helicopter pilot speech controls and their effect on time-sharing performance, under the impetus of multiple-resource theories of attention which predict that time-sharing should be more efficient with mixed manual and speech controls than with all-manual ones. The test simulation involved an advanced, single-pilot scout/attack helicopter. Performance and subjective workload levels obtained supported the claimed utility of speech recognition-based controls; specifically, time-sharing performance was improved while preparing a data-burst transmission of information during helicopter hover.
Incorporating Speech Recognition into a Natural User Interface
NASA Technical Reports Server (NTRS)
Chapa, Nicholas
2017-01-01
The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to investigate options for a Speech Recognition System. To that end I attempted to integrate Sphinx4 into a user interface. Sphinx4 had great accuracy and is the only free program able to perform offline speech dictation. However it had a limited dictionary of words that could be recognized, single syllable words were almost impossible for it to hear, and since it ran on Java it could not be integrated into the Unity based NUI. PocketSphinx ran much faster than Sphinx4 which would've made it ideal as a plugin to the Unity NUI, unfortunately creating a C# wrapper for the C code made the program unusable with Unity due to the wrapper slowing code execution and class files becoming unreachable. Unity Grammar Recognizer is the ideal speech recognition interface, it is flexible in recognizing multiple variations of the same command. It is also the most accurate program in recognizing speech due to using an XML grammar to specify speech structure instead of relying solely on a Dictionary and Language model. The Unity Grammar Recognizer will be used with the NUI for these reasons as well as being written in C# which further simplifies the incorporation.
Branstetter, Brian K; DeLong, Caroline M; Dziedzic, Brandon; Black, Amy; Bakhtiari, Kimberly
2016-01-01
Bottlenose dolphins (Tursiops truncatus) use the frequency contour of whistles produced by conspecifics for individual recognition. Here we tested a bottlenose dolphin's (Tursiops truncatus) ability to recognize frequency modulated whistle-like sounds using a three alternative matching-to-sample paradigm. The dolphin was first trained to select a specific object (object A) in response to a specific sound (sound A) for a total of three object-sound associations. The sounds were then transformed by amplitude, duration, or frequency transposition while still preserving the frequency contour of each sound. For comparison purposes, 30 human participants completed an identical task with the same sounds, objects, and training procedure. The dolphin's ability to correctly match objects to sounds was robust to changes in amplitude with only a minor decrement in performance for short durations. The dolphin failed to recognize sounds that were frequency transposed by plus or minus ½ octaves. Human participants demonstrated robust recognition with all acoustic transformations. The results indicate that this dolphin's acoustic recognition of whistle-like sounds was constrained by absolute pitch. Unlike human speech, which varies considerably in average frequency, signature whistles are relatively stable in frequency, which may have selected for a whistle recognition system invariant to frequency transposition.
Branstetter, Brian K.; DeLong, Caroline M.; Dziedzic, Brandon; Black, Amy; Bakhtiari, Kimberly
2016-01-01
Bottlenose dolphins (Tursiops truncatus) use the frequency contour of whistles produced by conspecifics for individual recognition. Here we tested a bottlenose dolphin’s (Tursiops truncatus) ability to recognize frequency modulated whistle-like sounds using a three alternative matching-to-sample paradigm. The dolphin was first trained to select a specific object (object A) in response to a specific sound (sound A) for a total of three object-sound associations. The sounds were then transformed by amplitude, duration, or frequency transposition while still preserving the frequency contour of each sound. For comparison purposes, 30 human participants completed an identical task with the same sounds, objects, and training procedure. The dolphin’s ability to correctly match objects to sounds was robust to changes in amplitude with only a minor decrement in performance for short durations. The dolphin failed to recognize sounds that were frequency transposed by plus or minus ½ octaves. Human participants demonstrated robust recognition with all acoustic transformations. The results indicate that this dolphin’s acoustic recognition of whistle-like sounds was constrained by absolute pitch. Unlike human speech, which varies considerably in average frequency, signature whistles are relatively stable in frequency, which may have selected for a whistle recognition system invariant to frequency transposition. PMID:26863519
A novel speech processing algorithm based on harmonicity cues in cochlear implant
NASA Astrophysics Data System (ADS)
Wang, Jian; Chen, Yousheng; Zhang, Zongping; Chen, Yan; Zhang, Weifeng
2017-08-01
This paper proposed a novel speech processing algorithm in cochlear implant, which used harmonicity cues to enhance tonal information in Mandarin Chinese speech recognition. The input speech was filtered by a 4-channel band-pass filter bank. The frequency ranges for the four bands were: 300-621, 621-1285, 1285-2657, and 2657-5499 Hz. In each pass band, temporal envelope and periodicity cues (TEPCs) below 400 Hz were extracted by full wave rectification and low-pass filtering. The TEPCs were modulated by a sinusoidal carrier, the frequency of which was fundamental frequency (F0) and its harmonics most close to the center frequency of each band. Signals from each band were combined together to obtain an output speech. Mandarin tone, word, and sentence recognition in quiet listening conditions were tested for the extensively used continuous interleaved sampling (CIS) strategy and the novel F0-harmonic algorithm. Results found that the F0-harmonic algorithm performed consistently better than CIS strategy in Mandarin tone, word, and sentence recognition. In addition, sentence recognition rate was higher than word recognition rate, as a result of contextual information in the sentence. Moreover, tone 3 and 4 performed better than tone 1 and tone 2, due to the easily identified features of the former. In conclusion, the F0-harmonic algorithm could enhance tonal information in cochlear implant speech processing due to the use of harmonicity cues, thereby improving Mandarin tone, word, and sentence recognition. Further study will focus on the test of the F0-harmonic algorithm in noisy listening conditions.
Working memory capacity may influence perceived effort during aided speech recognition in noise.
Rudner, Mary; Lunner, Thomas; Behrens, Thomas; Thorén, Elisabet Sundewall; Rönnberg, Jerker
2012-09-01
Recently there has been interest in using subjective ratings as a measure of perceived effort during speech recognition in noise. Perceived effort may be an indicator of cognitive load. Thus, subjective effort ratings during speech recognition in noise may covary both with signal-to-noise ratio (SNR) and individual cognitive capacity. The present study investigated the relation between subjective ratings of the effort involved in listening to speech in noise, speech recognition performance, and individual working memory (WM) capacity in hearing impaired hearing aid users. In two experiments, participants with hearing loss rated perceived effort during aided speech perception in noise. Noise type and SNR were manipulated in both experiments, and in the second experiment hearing aid compression release settings were also manipulated. Speech recognition performance was measured along with WM capacity. There were 46 participants in all with bilateral mild to moderate sloping hearing loss. In Experiment 1 there were 16 native Danish speakers (eight women and eight men) with a mean age of 63.5 yr (SD = 12.1) and average pure tone (PT) threshold of 47. 6 dB (SD = 9.8). In Experiment 2 there were 30 native Swedish speakers (19 women and 11 men) with a mean age of 70 yr (SD = 7.8) and average PT threshold of 45.8 dB (SD = 6.6). A visual analog scale (VAS) was used for effort rating in both experiments. In Experiment 1, effort was rated at individually adapted SNRs while in Experiment 2 it was rated at fixed SNRs. Speech recognition in noise performance was measured using adaptive procedures in both experiments with Dantale II sentences in Experiment 1 and Hagerman sentences in Experiment 2. WM capacity was measured using a letter-monitoring task in Experiment 1 and the reading span task in Experiment 2. In both experiments, there was a strong and significant relation between rated effort and SNR that was independent of individual WM capacity, whereas the relation between rated effort and noise type seemed to be influenced by individual WM capacity. Experiment 2 showed that hearing aid compression setting influenced rated effort. Subjective ratings of the effort involved in speech recognition in noise reflect SNRs, and individual cognitive capacity seems to influence relative rating of noise type. American Academy of Audiology.
Blind speech separation system for humanoid robot with FastICA for audio filtering and separation
NASA Astrophysics Data System (ADS)
Budiharto, Widodo; Santoso Gunawan, Alexander Agung
2016-07-01
Nowadays, there are many developments in building intelligent humanoid robot, mainly in order to handle voice and image. In this research, we propose blind speech separation system using FastICA for audio filtering and separation that can be used in education or entertainment. Our main problem is to separate the multi speech sources and also to filter irrelevant noises. After speech separation step, the results will be integrated with our previous speech and face recognition system which is based on Bioloid GP robot and Raspberry Pi 2 as controller. The experimental results show the accuracy of our blind speech separation system is about 88% in command and query recognition cases.
New Ideas for Speech Recognition and Related Technologies
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J F
The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstratedmore » by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun to think differently about practical applications of such radars. In particular, his demonstrations of heart and lung motions have opened up many new areas of application for human and animal measurements.« less
Human phoneme recognition depending on speech-intrinsic variability.
Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger
2010-11-01
The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).
A voice-input voice-output communication aid for people with severe speech impairment.
Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter
2013-01-01
A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.
Manning, Candice; Mermagen, Timothy; Scharine, Angelique
2017-06-01
Military personnel are at risk for hearing loss due to noise exposure during deployment (USACHPPM, 2008). Despite mandated use of hearing protection, hearing loss and tinnitus are prevalent due to reluctance to use hearing protection. Bone conduction headsets can offer good speech intelligibility for normal hearing (NH) listeners while allowing the ears to remain open in quiet environments and the use of hearing protection when needed. Those who suffer from tinnitus, the experience of perceiving a sound not produced by an external source, often show degraded speech recognition; however, it is unclear whether this is a result of decreased hearing sensitivity or increased distractibility (Moon et al., 2015). It has been suggested that the vibratory stimulation of a bone conduction headset might ameliorate the effects of tinnitus on speech perception; however, there is currently no research to support or refute this claim (Hoare et al., 2014). Speech recognition of words presented over air conduction and bone conduction headsets was measured for three groups of listeners: NH, sensorineural hearing impaired, and/or tinnitus sufferers. Three levels of speech-to-noise (SNR = 0, -6, -12 dB) were created by embedding speech items in pink noise. Better speech recognition performance was observed with the bone conduction headset regardless of hearing profile, and speech intelligibility was a function of SNR. Discussion will include study limitations and the implications of these findings for those serving in the military. Published by Elsevier B.V.
Speaker normalization for chinese vowel recognition in cochlear implants.
Luo, Xin; Fu, Qian-Jie
2005-07-01
Because of the limited spectra-temporal resolution associated with cochlear implants, implant patients often have greater difficulty with multitalker speech recognition. The present study investigated whether multitalker speech recognition can be improved by applying speaker normalization techniques to cochlear implant speech processing. Multitalker Chinese vowel recognition was tested with normal-hearing Chinese-speaking subjects listening to a 4-channel cochlear implant simulation, with and without speaker normalization. For each subject, speaker normalization was referenced to the speaker that produced the best recognition performance under conditions without speaker normalization. To match the remaining speakers to this "optimal" output pattern, the overall frequency range of the analysis filter bank was adjusted for each speaker according to the ratio of the mean third formant frequency values between the specific speaker and the reference speaker. Results showed that speaker normalization provided a small but significant improvement in subjects' overall recognition performance. After speaker normalization, subjects' patterns of recognition performance across speakers changed, demonstrating the potential for speaker-dependent effects with the proposed normalization technique.
ERIC Educational Resources Information Center
Pisoni, David B.; And Others
The results of three projects concerned with auditory word recognition and the structure of the lexicon are reported in this paper. The first project described was designed to test experimentally several specific predictions derived from MACS, a simulation model of the Cohort Theory of word recognition. The second project description provides the…
ERIC Educational Resources Information Center
Ventura, Paulo; Morais, Jose; Kolinsky, Regine
2007-01-01
The influence of orthography on children's on-line auditory word recognition was studied from the end of Grade 2 to the end of Grade 4, by examining the orthographic consistency effect [Ziegler, J. C., & Ferrand, L. (1998). Orthography shapes the perception of speech: The consistency effect in auditory recognition. "Psychonomic Bulletin & Review",…
Test of a motor theory of long-term auditory memory
Schulze, Katrin; Vargha-Khadem, Faraneh; Mishkin, Mortimer
2012-01-01
Monkeys can easily form lasting central representations of visual and tactile stimuli, yet they seem unable to do the same with sounds. Humans, by contrast, are highly proficient in auditory long-term memory (LTM). These mnemonic differences within and between species raise the question of whether the human ability is supported in some way by speech and language, e.g., through subvocal reproduction of speech sounds and by covert verbal labeling of environmental stimuli. If so, the explanation could be that storing rapidly fluctuating acoustic signals requires assistance from the motor system, which is uniquely organized to chain-link rapid sequences. To test this hypothesis, we compared the ability of normal participants to recognize lists of stimuli that can be easily reproduced, labeled, or both (pseudowords, nonverbal sounds, and words, respectively) versus their ability to recognize a list of stimuli that can be reproduced or labeled only with great difficulty (reversed words, i.e., words played backward). Recognition scores after 5-min delays filled with articulatory-suppression tasks were relatively high (75–80% correct) for all sound types except reversed words; the latter yielded scores that were not far above chance (58% correct), even though these stimuli were discriminated nearly perfectly when presented as reversed-word pairs at short intrapair intervals. The combined results provide preliminary support for the hypothesis that participation of the oromotor system may be essential for laying down the memory of speech sounds and, indeed, that speech and auditory memory may be so critically dependent on each other that they had to coevolve. PMID:22511719
Test of a motor theory of long-term auditory memory.
Schulze, Katrin; Vargha-Khadem, Faraneh; Mishkin, Mortimer
2012-05-01
Monkeys can easily form lasting central representations of visual and tactile stimuli, yet they seem unable to do the same with sounds. Humans, by contrast, are highly proficient in auditory long-term memory (LTM). These mnemonic differences within and between species raise the question of whether the human ability is supported in some way by speech and language, e.g., through subvocal reproduction of speech sounds and by covert verbal labeling of environmental stimuli. If so, the explanation could be that storing rapidly fluctuating acoustic signals requires assistance from the motor system, which is uniquely organized to chain-link rapid sequences. To test this hypothesis, we compared the ability of normal participants to recognize lists of stimuli that can be easily reproduced, labeled, or both (pseudowords, nonverbal sounds, and words, respectively) versus their ability to recognize a list of stimuli that can be reproduced or labeled only with great difficulty (reversed words, i.e., words played backward). Recognition scores after 5-min delays filled with articulatory-suppression tasks were relatively high (75-80% correct) for all sound types except reversed words; the latter yielded scores that were not far above chance (58% correct), even though these stimuli were discriminated nearly perfectly when presented as reversed-word pairs at short intrapair intervals. The combined results provide preliminary support for the hypothesis that participation of the oromotor system may be essential for laying down the memory of speech sounds and, indeed, that speech and auditory memory may be so critically dependent on each other that they had to coevolve.
Do What I Say! Voice Recognition Makes Major Advances.
ERIC Educational Resources Information Center
Ruley, C. Dorsey
1994-01-01
Explains voice recognition technology applications in the workplace, schools, and libraries. Highlights include a voice-controlled work station using the DragonDictate system that can be used with dyslexic students, converting text to speech, and converting speech to text. (LRW)
Emotional recognition from the speech signal for a virtual education agent
NASA Astrophysics Data System (ADS)
Tickle, A.; Raghu, S.; Elshaw, M.
2013-06-01
This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.
Robotics control using isolated word recognition of voice input
NASA Technical Reports Server (NTRS)
Weiner, J. M.
1977-01-01
A speech input/output system is presented that can be used to communicate with a task oriented system. Human speech commands and synthesized voice output extend conventional information exchange capabilities between man and machine by utilizing audio input and output channels. The speech input facility is comprised of a hardware feature extractor and a microprocessor implemented isolated word or phrase recognition system. The recognizer offers a medium sized (100 commands), syntactically constrained vocabulary, and exhibits close to real time performance. The major portion of the recognition processing required is accomplished through software, minimizing the complexity of the hardware feature extractor.
Nittrouer, Susan; Tarr, Eric; Wucinich, Taylor; Moberly, Aaron C.; Lowenstein, Joanna H.
2015-01-01
Broadened auditory filters associated with sensorineural hearing loss have clearly been shown to diminish speech recognition in noise for adults, but far less is known about potential effects for children. This study examined speech recognition in noise for adults and children using simulated auditory filters of different widths. Specifically, 5 groups (20 listeners each) of adults or children (5 and 7 yrs), were asked to recognize sentences in speech-shaped noise. Seven-year-olds listened at 0 dB signal-to-noise ratio (SNR) only; 5-yr-olds listened at +3 or 0 dB SNR; and adults listened at 0 or −3 dB SNR. Sentence materials were processed both to smear the speech spectrum (i.e., simulate broadened filters), and to enhance the spectrum (i.e., simulate narrowed filters). Results showed: (1) Spectral smearing diminished recognition for listeners of all ages; (2) spectral enhancement did not improve recognition, and in fact diminished it somewhat; and (3) interactions were observed between smearing and SNR, but only for adults. That interaction made age effects difficult to gauge. Nonetheless, it was concluded that efforts to diagnose the extent of broadening of auditory filters and to develop techniques to correct this condition could benefit patients with hearing loss, especially children. PMID:25920851
Histogram equalization with Bayesian estimation for noise robust speech recognition.
Suh, Youngjoo; Kim, Hoirin
2018-02-01
The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histogram equalization approach has been proposed. Although this approach showed substantial performance improvement under noise environments, it still suffers from performance degradation due to the overfitting problem when test data are insufficient. To address this issue, the proposed histogram equalization technique employs the Bayesian estimation method in the test cumulative distribution function estimation. It was reported in a previous study conducted on the Aurora-4 task that the proposed approach provided substantial performance gains in speech recognition systems based on the acoustic modeling of the Gaussian mixture model-hidden Markov model. In this work, the proposed approach was examined in speech recognition systems with deep neural network-hidden Markov model (DNN-HMM), the current mainstream speech recognition approach where it also showed meaningful performance improvement over the conventional maximum likelihood estimation-based method. The fusion of the proposed features with the mel-frequency cepstral coefficients provided additional performance gains in DNN-HMM systems, which otherwise suffer from performance degradation in the clean test condition.
A longitudinal study of the bilateral benefit in children with bilateral cochlear implants.
Asp, Filip; Mäki-Torkko, Elina; Karltorp, Eva; Harder, Henrik; Hergils, Leif; Eskilsson, Gunnar; Stenfelt, Stefan
2015-02-01
To study the development of the bilateral benefit in children using bilateral cochlear implants by measurements of speech recognition and sound localization. Bilateral and unilateral speech recognition in quiet, in multi-source noise, and horizontal sound localization was measured at three occasions during a two-year period, without controlling for age or implant experience. Longitudinal and cross-sectional analyses were performed. Results were compared to cross-sectional data from children with normal hearing. Seventy-eight children aged 5.1-11.9 years, with a mean bilateral cochlear implant experience of 3.3 years and a mean age of 7.8 years, at inclusion in the study. Thirty children with normal hearing aged 4.8-9.0 years provided normative data. For children with cochlear implants, bilateral and unilateral speech recognition in quiet was comparable whereas a bilateral benefit for speech recognition in noise and sound localization was found at all three test occasions. Absolute performance was lower than in children with normal hearing. Early bilateral implantation facilitated sound localization. A bilateral benefit for speech recognition in noise and sound localization continues to exist over time for children with bilateral cochlear implants, but no relative improvement is found after three years of bilateral cochlear implant experience.
Phonological mismatch makes aided speech recognition in noise cognitively taxing.
Rudner, Mary; Foo, Catharina; Rönnberg, Jerker; Lunner, Thomas
2007-12-01
The working memory framework for Ease of Language Understanding predicts that speech processing becomes more effortful, thus requiring more explicit cognitive resources, when there is mismatch between speech input and phonological representations in long-term memory. To test this prediction, we changed the compression release settings in the hearing instruments of experienced users and allowed them to train for 9 weeks with the new settings. After training, aided speech recognition in noise was tested with both the trained settings and orthogonal settings. We postulated that training would lead to acclimatization to the trained setting, which in turn would involve establishment of new phonological representations in long-term memory. Further, we postulated that after training, testing with orthogonal settings would give rise to phonological mismatch, associated with more explicit cognitive processing. Thirty-two participants (mean=70.3 years, SD=7.7) with bilateral sensorineural hearing loss (pure-tone average=46.0 dB HL, SD=6.5), bilaterally fitted for more than 1 year with digital, two-channel, nonlinear signal processing hearing instruments and chosen from the patient population at the Linköping University Hospital were randomly assigned to 9 weeks training with new, fast (40 ms) or slow (640 ms), compression release settings in both channels. Aided speech recognition in noise performance was tested according to a design with three within-group factors: test occasion (T1, T2), test setting (fast, slow), and type of noise (unmodulated, modulated) and one between-group factor: experience setting (fast, slow) for two types of speech materials-the highly constrained Hagerman sentences and the less-predictable Hearing in Noise Test (HINT). Complex cognitive capacity was measured using the reading span and letter monitoring tests. PREDICTION: We predicted that speech recognition in noise at T2 with mismatched experience and test settings would be associated with more explicit cognitive processing and thus stronger correlations with complex cognitive measures, as well as poorer performance if complex cognitive capacity was exceeded. Under mismatch conditions, stronger correlations were found between performance on speech recognition with the Hagerman sentences and reading span, along with poorer speech recognition for participants with low reading span scores. No consistent mismatch effect was found with HINT. The mismatch prediction generated by the working memory framework for Ease of Language Understanding is supported for speech recognition in noise with the highly constrained Hagerman sentences but not the less-predictable HINT.
Improving language models for radiology speech recognition.
Paulett, John M; Langlotz, Curtis P
2009-02-01
Speech recognition systems have become increasingly popular as a means to produce radiology reports, for reasons both of efficiency and of cost. However, the suboptimal recognition accuracy of these systems can affect the productivity of the radiologists creating the text reports. We analyzed a database of over two million de-identified radiology reports to determine the strongest determinants of word frequency. Our results showed that body site and imaging modality had a similar influence on the frequency of words and of three-word phrases as did the identity of the speaker. These findings suggest that the accuracy of speech recognition systems could be significantly enhanced by further tailoring their language models to body site and imaging modality, which are readily available at the time of report creation.
Connected word recognition using a cascaded neuro-computational model
NASA Astrophysics Data System (ADS)
Hoya, Tetsuya; van Leeuwen, Cees
2016-10-01
We propose a novel framework for processing a continuous speech stream that contains a varying number of words, as well as non-speech periods. Speech samples are segmented into word-tokens and non-speech periods. An augmented version of an earlier-proposed, cascaded neuro-computational model is used for recognising individual words within the stream. Simulation studies using both a multi-speaker-dependent and speaker-independent digit string database show that the proposed method yields a recognition performance comparable to that obtained by a benchmark approach using hidden Markov models with embedded training.
Measuring listening effort: driving simulator vs. simple dual-task paradigm
Wu, Yu-Hsiang; Aksan, Nazan; Rizzo, Matthew; Stangl, Elizabeth; Zhang, Xuyang; Bentler, Ruth
2014-01-01
Objectives The dual-task paradigm has been widely used to measure listening effort. The primary objectives of the study were to (1) investigate the effect of hearing aid amplification and a hearing aid directional technology on listening effort measured by a complicated, more real world dual-task paradigm, and (2) compare the results obtained with this paradigm to a simpler laboratory-style dual-task paradigm. Design The listening effort of adults with hearing impairment was measured using two dual-task paradigms, wherein participants performed a speech recognition task simultaneously with either a driving task in a simulator or a visual reaction-time task in a sound-treated booth. The speech materials and road noises for the speech recognition task were recorded in a van traveling on the highway in three hearing aid conditions: unaided, aided with omni directional processing (OMNI), and aided with directional processing (DIR). The change in the driving task or the visual reaction-time task performance across the conditions quantified the change in listening effort. Results Compared to the driving-only condition, driving performance declined significantly with the addition of the speech recognition task. Although the speech recognition score was higher in the OMNI and DIR conditions than in the unaided condition, driving performance was similar across these three conditions, suggesting that listening effort was not affected by amplification and directional processing. Results from the simple dual-task paradigm showed a similar trend: hearing aid technologies improved speech recognition performance, but did not affect performance in the visual reaction-time task (i.e., reduce listening effort). The correlation between listening effort measured using the driving paradigm and the visual reaction-time task paradigm was significant. The finding showing that our older (56 to 85 years old) participants’ better speech recognition performance did not result in reduced listening effort was not consistent with literature that evaluated younger (approximately 20 years old), normal hearing adults. Because of this, a follow-up study was conducted. In the follow-up study, the visual reaction-time dual-task experiment using the same speech materials and road noises was repeated on younger adults with normal hearing. Contrary to findings with older participants, the results indicated that the directional technology significantly improved performance in both speech recognition and visual reaction-time tasks. Conclusions Adding a speech listening task to driving undermined driving performance. Hearing aid technologies significantly improved speech recognition while driving, but did not significantly reduce listening effort. Listening effort measured by dual-task experiments using a simulated real-world driving task and a conventional laboratory-style task was generally consistent. For a given listening environment, the benefit of hearing aid technologies on listening effort measured from younger adults with normal hearing may not be fully translated to older listeners with hearing impairment. PMID:25083599
Across-site patterns of modulation detection: Relation to speech recognitiona)
Garadat, Soha N.; Zwolan, Teresa A.; Pfingst, Bryan E.
2012-01-01
The aim of this study was to identify across-site patterns of modulation detection thresholds (MDTs) in subjects with cochlear implants and to determine if removal of sites with the poorest MDTs from speech processor programs would result in improved speech recognition. Five hundred millisecond trains of symmetric-biphasic pulses were modulated sinusoidally at 10 Hz and presented at a rate of 900 pps using monopolar stimulation. Subjects were asked to discriminate a modulated pulse train from an unmodulated pulse train for all electrodes in quiet and in the presence of an interleaved unmodulated masker presented on the adjacent site. Across-site patterns of masked MDTs were then used to construct two 10-channel MAPs such that one MAP consisted of sites with the best masked MDTs and the other MAP consisted of sites with the worst masked MDTs. Subjects’ speech recognition skills were compared when they used these two different MAPs. Results showed that MDTs were variable across sites and were elevated in the presence of a masker by various amounts across sites. Better speech recognition was observed when the processor MAP consisted of sites with best masked MDTs, suggesting that temporal modulation sensitivity has important contributions to speech recognition with a cochlear implant. PMID:22559376
Kollmeier, Birger; Schädler, Marc René; Warzybok, Anna; Meyer, Bernd T; Brand, Thomas
2016-09-07
To characterize the individual patient's hearing impairment as obtained with the matrix sentence recognition test, a simulation Framework for Auditory Discrimination Experiments (FADE) is extended here using the Attenuation and Distortion (A+D) approach by Plomp as a blueprint for setting the individual processing parameters. FADE has been shown to predict the outcome of both speech recognition tests and psychoacoustic experiments based on simulations using an automatic speech recognition system requiring only few assumptions. It builds on the closed-set matrix sentence recognition test which is advantageous for testing individual speech recognition in a way comparable across languages. Individual predictions of speech recognition thresholds in stationary and in fluctuating noise were derived using the audiogram and an estimate of the internal level uncertainty for modeling the individual Plomp curves fitted to the data with the Attenuation (A-) and Distortion (D-) parameters of the Plomp approach. The "typical" audiogram shapes from Bisgaard et al with or without a "typical" level uncertainty and the individual data were used for individual predictions. As a result, the individualization of the level uncertainty was found to be more important than the exact shape of the individual audiogram to accurately model the outcome of the German Matrix test in stationary or fluctuating noise for listeners with hearing impairment. The prediction accuracy of the individualized approach also outperforms the (modified) Speech Intelligibility Index approach which is based on the individual threshold data only. © The Author(s) 2016.
Fu, Q Y; Liang, Y; Zou, A; Wang, T; Zhao, X D; Wan, J
2016-04-07
To investigate the relationships between electrophysiological characteristic of speech evoked auditory brainstem response(s-ABR) and Mandarin phonetically balanced maximum(PBmax) at different hearing impairment, so as to provide more clues for the mechanism of speech cognitive behavior. Forty-one ears in 41 normal hearing adults(NH), thirty ears in 30 conductive hearing loss patients(CHL) and twenty-seven ears in 27 sensorineural hearing loss patients(SNHL) were included in present study. The speech discrimination scores were obtained by Mandarin phonemic-balanced monosyllable lists via speech audiometric software. Their s-ABRs were recorded with speech syllables /da/ with the intensity of phonetically balanced maximum(PBmax). The electrophysiological characteristic of s-ABR, as well as the relationships between PBmax and s-ABR parameters including latency in time domain, fundamental frequency(F0) and first formant(F1) in frequency domain were analyzed statistically. All subjects completed good speech perception tests and PBmax of CHL and SNHL had no significant difference (P>0.05), but both significantly less than that of NH (P<0.05). While divided the subjects into three groups by 90%
Science 101: How Does Speech-Recognition Software Work?
ERIC Educational Resources Information Center
Robertson, Bill
2016-01-01
This column provides background science information for elementary teachers. Many innovations with computer software begin with analysis of how humans do a task. This article takes a look at how humans recognize spoken words and explains the origins of speech-recognition software.
Modernising speech audiometry: using a smartphone application to test word recognition.
van Zyl, Marianne; Swanepoel, De Wet; Myburgh, Hermanus C
2018-04-20
This study aimed to develop and assess a method to measure word recognition abilities using a smartphone application (App) connected to an audiometer. Word lists were recorded in South African English and Afrikaans. Analyses were conducted to determine the effect of hardware used for presentation (computer, compact-disc player, or smartphone) on the frequency content of recordings. An Android App was developed to enable presentation of recorded materials via a smartphone connected to the auxiliary input of the audiometer. Experiments were performed to test feasibility and validity of the developed App and recordings. Participants were 100 young adults (18-30 years) with pure tone thresholds ≤15 dB across the frequency spectrum (250-8000 Hz). Hardware used for presentation had no significant effect on the frequency content of recordings. Listening experiments indicated good inter-list reliability for recordings in both languages, with no significant differences between scores on different lists at each of the tested intensities. Performance-intensity functions had slopes of 4.05%/dB for English and 4.75%/dB for Afrikaans lists at the 50% point. The developed smartphone App constitutes a feasible and valid method for measuring word recognition scores, and can support standardisation and accessibility of recorded speech audiometry.
Watch what you say, your computer might be listening: A review of automated speech recognition
NASA Technical Reports Server (NTRS)
Degennaro, Stephen V.
1991-01-01
Spoken language is the most convenient and natural means by which people interact with each other and is, therefore, a promising candidate for human-machine interactions. Speech also offers an additional channel for hands-busy applications, complementing the use of motor output channels for control. Current speech recognition systems vary considerably across a number of important characteristics, including vocabulary size, speaking mode, training requirements for new speakers, robustness to acoustic environments, and accuracy. Algorithmically, these systems range from rule-based techniques through more probabilistic or self-learning approaches such as hidden Markov modeling and neural networks. This tutorial begins with a brief summary of the relevant features of current speech recognition systems and the strengths and weaknesses of the various algorithmic approaches.
Soli, Sigfrid D; Amano-Kusumoto, Akiko; Clavier, Odile; Wilbur, Jed; Casto, Kristen; Freed, Daniel; Laroche, Chantal; Vaillancourt, Véronique; Giguère, Christian; Dreschler, Wouter A; Rhebergen, Koenraad S
2018-05-01
Validate use of the Extended Speech Intelligibility Index (ESII) for prediction of speech intelligibility in non-stationary real-world noise environments. Define a means of using these predictions for objective occupational hearing screening for hearing-critical public safety and law enforcement jobs. Analyses of predicted and measured speech intelligibility in recordings of real-world noise environments were performed in two studies using speech recognition thresholds (SRTs) and intelligibility measures. ESII analyses of the recordings were used to predict intelligibility. Noise recordings were made in prison environments and at US Army facilities for training ground and airborne forces. Speech materials included full bandwidth sentences and bandpass filtered sentences that simulated radio transmissions. A total of 22 adults with normal hearing (NH) and 15 with mild-moderate hearing impairment (HI) participated in the two studies. Average intelligibility predictions for individual NH and HI subjects were accurate in both studies (r 2 ≥ 0.94). Pooled predictions were slightly less accurate (0.78 ≤ r 2 ≤ 0.92). An individual's SRT and audiogram can accurately predict the likelihood of effective speech communication in noise environments with known ESII characteristics, where essential hearing-critical tasks are performed. These predictions provide an objective means of occupational hearing screening.
Investigations in mechanisms and strategies to enhance hearing with cochlear implants
NASA Astrophysics Data System (ADS)
Churchill, Tyler H.
Cochlear implants (CIs) produce hearing sensations by stimulating the auditory nerve (AN) with current pulses whose amplitudes are modulated by filtered acoustic temporal envelopes. While this technology has provided hearing for multitudinous CI recipients, even bilaterally-implanted listeners have more difficulty understanding speech in noise and localizing sounds than normal hearing (NH) listeners. Three studies reported here have explored ways to improve electric hearing abilities. Vocoders are often used to simulate CIs for NH listeners. Study 1 was a psychoacoustic vocoder study examining the effects of harmonic carrier phase dispersion and simulated CI current spread on speech intelligibility in noise. Results showed that simulated current spread was detrimental to speech understanding and that speech vocoded with carriers whose components' starting phases were equal was the least intelligible. Cross-correlogram analyses of AN model simulations confirmed that carrier component phase dispersion resulted in better neural envelope representation. Localization abilities rely on binaural processing mechanisms in the brainstem and mid-brain that are not fully understood. In Study 2, several potential mechanisms were evaluated based on the ability of metrics extracted from stereo AN simulations to predict azimuthal locations. Results suggest that unique across-frequency patterns of binaural cross-correlation may provide a strong cue set for lateralization and that interaural level differences alone cannot explain NH sensitivity to lateral position. While it is known that many bilateral CI users are sensitive to interaural time differences (ITDs) in low-rate pulsatile stimulation, most contemporary CI processing strategies use high-rate, constant-rate pulse trains. In Study 3, we examined the effects of pulse rate and pulse timing on ITD discrimination, ITD lateralization, and speech recognition by bilateral CI listeners. Results showed that listeners were able to use low-rate pulse timing cues presented redundantly on multiple electrodes for ITD discrimination and lateralization of speech stimuli even when mixed with high rates on other electrodes. These results have contributed to a better understanding of those aspects of the auditory system that support speech understanding and binaural hearing, suggested vocoder parameters that may simulate aspects of electric hearing, and shown that redundant, low-rate pulse timing supports improved spatial hearing for bilateral CI listeners.
Perception of Sung Speech in Bimodal Cochlear Implant Users.
Crew, Joseph D; Galvin, John J; Fu, Qian-Jie
2016-11-11
Combined use of a hearing aid (HA) and cochlear implant (CI) has been shown to improve CI users' speech and music performance. However, different hearing devices, test stimuli, and listening tasks may interact and obscure bimodal benefits. In this study, speech and music perception were measured in bimodal listeners for CI-only, HA-only, and CI + HA conditions, using the Sung Speech Corpus, a database of monosyllabic words produced at different fundamental frequencies. Sentence recognition was measured using sung speech in which pitch was held constant or varied across words, as well as for spoken speech. Melodic contour identification (MCI) was measured using sung speech in which the words were held constant or varied across notes. Results showed that sentence recognition was poorer with sung speech relative to spoken, with little difference between sung speech with a constant or variable pitch; mean performance was better with CI-only relative to HA-only, and best with CI + HA. MCI performance was better with constant words versus variable words; mean performance was better with HA-only than with CI-only and was best with CI + HA. Relative to CI-only, a strong bimodal benefit was observed for speech and music perception. Relative to the better ear, bimodal benefits remained strong for sentence recognition but were marginal for MCI. While variations in pitch and timbre may negatively affect CI users' speech and music perception, bimodal listening may partially compensate for these deficits. © The Author(s) 2016.
Sperry Univac speech communications technology
NASA Technical Reports Server (NTRS)
Medress, Mark F.
1977-01-01
Technology and systems for effective verbal communication with computers were developed. A continuous speech recognition system for verbal input, a word spotting system to locate key words in conversational speech, prosodic tools to aid speech analysis, and a prerecorded voice response system for speech output are described.
Król, Magdalena Ewa
2018-01-01
We investigated the effect of auditory noise added to speech on patterns of looking at faces in 40 toddlers. We hypothesised that noise would increase the difficulty of processing speech, making children allocate more attention to the mouth of the speaker to gain visual speech cues from mouth movements. We also hypothesised that this shift would cause a decrease in fixation time to the eyes, potentially decreasing the ability to monitor gaze. We found that adding noise increased the number of fixations to the mouth area, at the price of a decreased number of fixations to the eyes. Thus, to our knowledge, this is the first study demonstrating a mouth-eyes trade-off between attention allocated to social cues coming from the eyes and linguistic cues coming from the mouth. We also found that children with higher word recognition proficiency and higher average pupil response had an increased likelihood of fixating the mouth, compared to the eyes and the rest of the screen, indicating stronger motivation to decode the speech.
2018-01-01
We investigated the effect of auditory noise added to speech on patterns of looking at faces in 40 toddlers. We hypothesised that noise would increase the difficulty of processing speech, making children allocate more attention to the mouth of the speaker to gain visual speech cues from mouth movements. We also hypothesised that this shift would cause a decrease in fixation time to the eyes, potentially decreasing the ability to monitor gaze. We found that adding noise increased the number of fixations to the mouth area, at the price of a decreased number of fixations to the eyes. Thus, to our knowledge, this is the first study demonstrating a mouth-eyes trade-off between attention allocated to social cues coming from the eyes and linguistic cues coming from the mouth. We also found that children with higher word recognition proficiency and higher average pupil response had an increased likelihood of fixating the mouth, compared to the eyes and the rest of the screen, indicating stronger motivation to decode the speech. PMID:29558514
ERIC Educational Resources Information Center
Raskind, Marshall
1993-01-01
This article describes assistive technologies for persons with learning disabilities, including word processing, spell checking, proofreading programs, outlining/"brainstorming" programs, abbreviation expanders, speech recognition, speech synthesis/screen review, optical character recognition systems, personal data managers, free-form databases,…
Speech Recognition for A Digital Video Library.
ERIC Educational Resources Information Center
Witbrock, Michael J.; Hauptmann, Alexander G.
1998-01-01
Production of the meta-data supporting the Informedia Digital Video Library interface is automated using techniques derived from artificial intelligence research. Speech recognition and natural-language processing, information retrieval, and image analysis are applied to produce an interface that helps users locate information and navigate more…
Speech as a pilot input medium
NASA Technical Reports Server (NTRS)
Plummer, R. P.; Coler, C. R.
1977-01-01
The speech recognition system under development is a trainable pattern classifier based on a maximum-likelihood technique. An adjustable uncertainty threshold allows the rejection of borderline cases for which the probability of misclassification is high. The syntax of the command language spoken may be used as an aid to recognition, and the system adapts to changes in pronunciation if feedback from the user is available. Words must be separated by .25 second gaps. The system runs in real time on a mini-computer (PDP 11/10) and was tested on 120,000 speech samples from 10- and 100-word vocabularies. The results of these tests were 99.9% correct recognition for a vocabulary consisting of the ten digits, and 99.6% recognition for a 100-word vocabulary of flight commands, with a 5% rejection rate in each case. With no rejection, the recognition accuracies for the same vocabularies were 99.5% and 98.6% respectively.
Everyday listening questionnaire: correlation between subjective hearing and objective performance.
Brendel, Martina; Frohne-Buechner, Carolin; Lesinski-Schiedat, Anke; Lenarz, Thomas; Buechner, Andreas
2014-01-01
Clinical experience has demonstrated that speech understanding by cochlear implant (CI) recipients has improved over recent years with the development of new technology. The Everyday Listening Questionnaire 2 (ELQ 2) was designed to collect information regarding the challenges faced by CI recipients in everyday listening. The aim of this study was to compare self-assessment of CI users using ELQ 2 with objective speech recognition measures and to compare results between users of older and newer coding strategies. During their regular clinical review appointments a group of representative adult CI recipients implanted with the Advanced Bionics implant system were asked to complete the questionnaire. The first 100 patients who agreed to participate in this survey were recruited independent of processor generation and speech coding strategy. Correlations between subjectively scored hearing performance in everyday listening situations and objectively measured speech perception abilities were examined relative to the speech coding strategies used. When subjects were grouped by strategy there were significant differences between users of older 'standard' strategies and users of the newer, currently available strategies (HiRes and HiRes 120), especially in the categories of telephone use and music perception. Significant correlations were found between certain subjective ratings and the objective speech perception data in noise. There is a good correlation between subjective and objective data. Users of more recent speech coding strategies tend to have fewer problems in difficult hearing situations.
Listeners Experience Linguistic Masking Release in Noise-Vocoded Speech-in-Speech Recognition
ERIC Educational Resources Information Center
Viswanathan, Navin; Kokkinakis, Kostas; Williams, Brittany T.
2018-01-01
Purpose: The purpose of this study was to evaluate whether listeners with normal hearing perceiving noise-vocoded speech-in-speech demonstrate better intelligibility of target speech when the background speech was mismatched in language (linguistic release from masking [LRM]) and/or location (spatial release from masking [SRM]) relative to the…
Detection of target phonemes in spontaneous and read speech.
Mehta, G; Cutler, A
1988-01-01
Although spontaneous speech occurs more frequently in most listeners' experience than read speech, laboratory studies of human speech recognition typically use carefully controlled materials read from a script. The phonological and prosodic characteristics of spontaneous and read speech differ considerably, however, which suggests that laboratory results may not generalise to the recognition of spontaneous speech. In the present study listeners were presented with both spontaneous and read speech materials, and their response time to detect word-initial target phonemes was measured. Responses were, overall, equally fast in each speech mode. However, analysis of effects previously reported in phoneme detection studies revealed significant differences between speech modes. In read speech but not in spontaneous speech, later targets were detected more rapidly than targets preceded by short words. In contrast, in spontaneous speech but not in read speech, targets were detected more rapidly in accented than in unaccented words and in strong than in weak syllables. An explanation for this pattern is offered in terms of characteristic prosodic differences between spontaneous and read speech. The results support claims from previous work that listeners pay great attention to prosodic information in the process of recognising speech.
LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.
Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu
2005-01-01
Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.
Shen, Yi; Kern, Allison B.
2018-01-01
Individual differences in the recognition of monosyllabic words, either in isolation (NU6 test) or in sentence context (SPIN test), were investigated under the theoretical framework of the speech intelligibility index (SII). An adaptive psychophysical procedure, namely the quick-band-importance-function procedure, was developed to enable the fitting of the SII model to individual listeners. Using this procedure, the band importance function (i.e., the relative weights of speech information across the spectrum) and the link function relating the SII to recognition scores can be simultaneously estimated while requiring only 200 to 300 trials of testing. Octave-frequency band importance functions and link functions were estimated separately for NU6 and SPIN materials from 30 normal-hearing listeners who were naïve to speech recognition experiments. For each type of speech material, considerable individual differences in the spectral weights were observed in some but not all frequency regions. At frequencies where the greatest intersubject variability was found, the spectral weights were correlated between the two speech materials, suggesting that the variability in spectral weights reflected listener-originated factors. PMID:29532711
González Sánchez, María José; Framiñán Torres, José Manuel; Parra Calderón, Carlos Luis; Del Río Ortega, Juan Antonio; Vigil Martín, Eduardo; Nieto Cervera, Jaime
2008-01-01
We present a methodology based on Business Process Management to guide the development of a speech recognition system in a hospital in Spain. The methodology eases the deployment of the system by 1) involving the clinical staff in the process, 2) providing the IT professionals with a description of the process and its requirements, 3) assessing advantages and disadvantages of the speech recognition system, as well as its impact in the organisation, and 4) help reorganising the healthcare process before implementing the new technology in order to identify how it can better contribute to the overall objective of the organisation.
ERIC Educational Resources Information Center
Gelfand, Stanley A.; Gelfand, Jessica T.
2012-01-01
Method: Complete psychometric functions for phoneme and word recognition scores at 8 signal-to-noise ratios from -15 dB to 20 dB were generated for the first 10, 20, and 25, as well as all 50, three-word presentations of the Tri-Word or Computer Assisted Speech Recognition Assessment (CASRA) Test (Gelfand, 1998) based on the results of 12…
NASA Astrophysics Data System (ADS)
Kuznetsov, Michael V.
2006-05-01
For reliable teamwork of various systems of automatic telecommunication including transferring systems of optical communication networks it is necessary authentic recognition of signals for one- or two-frequency service signal system. The analysis of time parameters of an accepted signal allows increasing reliability of detection and recognition of the service signal system on a background of speech.
Walczak, Adam; Ahlstrom, Jayne; Denslow, Stewart; Horwitz, Amy; Dubno, Judy R.
2008-01-01
Speech recognition can be difficult and effortful for older adults, even for those with normal hearing. Declining frontal lobe cognitive control has been hypothesized to cause age-related speech recognition problems. This study examined age-related changes in frontal lobe function for 15 clinically normal hearing adults (21–75 years) when they performed a word recognition task that was made challenging by decreasing word intelligibility. Although there were no age-related changes in word recognition, there were age-related changes in the degree of activity within left middle frontal gyrus (MFG) and anterior cingulate (ACC) regions during word recognition. Older adults engaged left MFG and ACC regions when words were most intelligible compared to younger adults who engaged these regions when words were least intelligible. Declining gray matter volume within temporal lobe regions responsive to word intelligibility significantly predicted left MFG activity, even after controlling for total gray matter volume, suggesting that declining structural integrity of brain regions responsive to speech leads to the recruitment of frontal regions when words are easily understood. Electronic supplementary material The online version of this article (doi:10.1007/s10162-008-0113-3) contains supplementary material, which is available to authorized users. PMID:18274825
An audiovisual emotion recognition system
NASA Astrophysics Data System (ADS)
Han, Yi; Wang, Guoyin; Yang, Yong; He, Kun
2007-12-01
Human emotions could be expressed by many bio-symbols. Speech and facial expression are two of them. They are both regarded as emotional information which is playing an important role in human-computer interaction. Based on our previous studies on emotion recognition, an audiovisual emotion recognition system is developed and represented in this paper. The system is designed for real-time practice, and is guaranteed by some integrated modules. These modules include speech enhancement for eliminating noises, rapid face detection for locating face from background image, example based shape learning for facial feature alignment, and optical flow based tracking algorithm for facial feature tracking. It is known that irrelevant features and high dimensionality of the data can hurt the performance of classifier. Rough set-based feature selection is a good method for dimension reduction. So 13 speech features out of 37 ones and 10 facial features out of 33 ones are selected to represent emotional information, and 52 audiovisual features are selected due to the synchronization when speech and video fused together. The experiment results have demonstrated that this system performs well in real-time practice and has high recognition rate. Our results also show that the work in multimodules fused recognition will become the trend of emotion recognition in the future.
Noble, Jack H.; Camarata, Stephen M.; Sunderhaus, Linsey W.; Dwyer, Robert T.; Dawant, Benoit M.; Dietrich, Mary S.; Labadie, Robert F.
2018-01-01
Adult cochlear implant (CI) recipients demonstrate a reliable relationship between spectral modulation detection and speech understanding. Prior studies documenting this relationship have focused on postlingually deafened adult CI recipients—leaving an open question regarding the relationship between spectral resolution and speech understanding for adults and children with prelingual onset of deafness. Here, we report CI performance on the measures of speech recognition and spectral modulation detection for 578 CI recipients including 477 postlingual adults, 65 prelingual adults, and 36 prelingual pediatric CI users. The results demonstrated a significant correlation between spectral modulation detection and various measures of speech understanding for 542 adult CI recipients. For 36 pediatric CI recipients, however, there was no significant correlation between spectral modulation detection and speech understanding in quiet or in noise nor was spectral modulation detection significantly correlated with listener age or age at implantation. These findings suggest that pediatric CI recipients might not depend upon spectral resolution for speech understanding in the same manner as adult CI recipients. It is possible that pediatric CI users are making use of different cues, such as those contained within the temporal envelope, to achieve high levels of speech understanding. Further investigation is warranted to investigate the relationship between spectral and temporal resolution and speech recognition to describe the underlying mechanisms driving peripheral auditory processing in pediatric CI users. PMID:29716437
Speech Recognition Using Multiple Features and Multiple Recognizers
1991-12-03
6 2.1 Introduction ............................................... 6 2.2 Human Speech Communication Process...119 How to Setup ASRT.......................................... 119 How to Use Interactive Menus .................................. 120...recognize a word from an acoustic signal. The human ear and brain perform this type of recognition with incredible speed and precision. Even though
Holden, Laura K; Firszt, Jill B; Reeder, Ruth M; Uchanski, Rosalie M; Dwyer, Noël Y; Holden, Timothy A
2016-12-01
To identify primary biographic and audiologic factors contributing to cochlear implant (CI) performance variability in quiet and noise by controlling electrode array type and electrode position within the cochlea. Although CI outcomes have improved over time, considerable outcome variability still exists. Biographic, audiologic, and device-related factors have been shown to influence performance. Examining CI recipients with consistent array type and electrode position may allow focused investigation into outcome variability resulting from biographic and audiologic factors. Thirty-nine adults (40 ears) implanted for at least 6 months with a perimodiolar electrode array known (via computed tomography [CT] imaging) to be in scala tympani participated. Test materials, administered CI only, included monosyllabic words, sentences in quiet and noise, and spectral ripple discrimination. In quiet, scores were high with mean word and sentence scores of 76 and 87%, respectively; however, sentence scores decreased by an average of 35 percentage points when noise was added. A principal components (PC) analysis of biographic and audiologic factors found three distinct factors, PC1 Age, PC2 Duration, and PC3 Pre-op Hearing. PC1 Age was the only factor that correlated, albeit modestly, with speech recognition in quiet and noise. Spectral ripple discrimination strongly correlated with speech measures. For these recipients with consistent electrode position, PC1 Age was related to speech recognition performance. Consistent electrode position may have contributed to high speech understanding in quiet. Inter-subject variability in noise may have been influenced by auditory/cognitive processing, known to decline with age, and mechanisms that underlie spectral resolution ability.
Potts, Lisa G; Skinner, Margaret W; Litovsky, Ruth A; Strube, Michael J; Kuk, Francis
2009-06-01
The use of bilateral amplification is now common clinical practice for hearing aid users but not for cochlear implant recipients. In the past, most cochlear implant recipients were implanted in one ear and wore only a monaural cochlear implant processor. There has been recent interest in benefits arising from bilateral stimulation that may be present for cochlear implant recipients. One option for bilateral stimulation is the use of a cochlear implant in one ear and a hearing aid in the opposite nonimplanted ear (bimodal hearing). This study evaluated the effect of wearing a cochlear implant in one ear and a digital hearing aid in the opposite ear on speech recognition and localization. A repeated-measures correlational study was completed. Nineteen adult Cochlear Nucleus 24 implant recipients participated in the study. The participants were fit with a Widex Senso Vita 38 hearing aid to achieve maximum audibility and comfort within their dynamic range. Soundfield thresholds, loudness growth, speech recognition, localization, and subjective questionnaires were obtained six-eight weeks after the hearing aid fitting. Testing was completed in three conditions: hearing aid only, cochlear implant only, and cochlear implant and hearing aid (bimodal). All tests were repeated four weeks after the first test session. Repeated-measures analysis of variance was used to analyze the data. Significant effects were further examined using pairwise comparison of means or in the case of continuous moderators, regression analyses. The speech-recognition and localization tasks were unique, in that a speech stimulus presented from a variety of roaming azimuths (140 degree loudspeaker array) was used. Performance in the bimodal condition was significantly better for speech recognition and localization compared to the cochlear implant-only and hearing aid-only conditions. Performance was also different between these conditions when the location (i.e., side of the loudspeaker array that presented the word) was analyzed. In the bimodal condition, the speech-recognition and localization tasks were equal regardless of which side of the loudspeaker array presented the word, while performance was significantly poorer for the monaural conditions (hearing aid only and cochlear implant only) when the words were presented on the side with no stimulation. Binaural loudness summation of 1-3 dB was seen in soundfield thresholds and loudness growth in the bimodal condition. Measures of the audibility of sound with the hearing aid, including unaided thresholds, soundfield thresholds, and the Speech Intelligibility Index, were significant moderators of speech recognition and localization. Based on the questionnaire responses, participants showed a strong preference for bimodal stimulation. These findings suggest that a well-fit digital hearing aid worn in conjunction with a cochlear implant is beneficial to speech recognition and localization. The dynamic test procedures used in this study illustrate the importance of bilateral hearing for locating, identifying, and switching attention between multiple speakers. It is recommended that unilateral cochlear implant recipients, with measurable unaided hearing thresholds, be fit with a hearing aid.
The Effectiveness of Clear Speech as a Masker
ERIC Educational Resources Information Center
Calandruccio, Lauren; Van Engen, Kristin; Dhar, Sumitrajit; Bradlow, Ann R.
2010-01-01
Purpose: It is established that speaking clearly is an effective means of enhancing intelligibility. Because any signal-processing scheme modeled after known acoustic-phonetic features of clear speech will likely affect both target and competing speech, it is important to understand how speech recognition is affected when a competing speech signal…
Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil
2006-01-01
The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent proximity to "normal" articulatory patterns. Severe dysarthric output may also be characterized by a small mass of distinguishable phonetic tokens making the acoustic differentiation of target words difficult. Speaker dependent computer speech recognition using Hidden Markov Models was achieved by the identification of robust phonetic elements within the individual speaker output patterns. A new system of speech training using computer generated visual and auditory feedback reduced the inconsistent production of key phonetic tokens over time.
Visual speech information: a help or hindrance in perceptual processing of dysarthric speech.
Borrie, Stephanie A
2015-03-01
This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.
Chatterjee, Monita; Peng, Shu-Chen
2008-01-01
Fundamental frequency (F0) processing by cochlear implant (CI) listeners was measured using a psychophysical task and a speech intonation recognition task. Listeners' Weber fractions for modulation frequency discrimination were measured using an adaptive, 3-interval, forced-choice paradigm: stimuli were presented through a custom research interface. In the speech intonation recognition task, listeners were asked to indicate whether resynthesized bisyllabic words, when presented in the free field through the listeners' everyday speech processor, were question-like or statement-like. The resynthesized tokens were systematically manipulated to have different initial-F0s to represent male vs. female voices, and different F0 contours (i.e. falling, flat, and rising) Although the CI listeners showed considerable variation in performance on both tasks, significant correlations were observed between the CI listeners' sensitivity to modulation frequency in the psychophysical task and their performance in intonation recognition. Consistent with their greater reliance on temporal cues, the CI listeners' performance in the intonation recognition task was significantly poorer with the higher initial-F0 stimuli than with the lower initial-F0 stimuli. Similar results were obtained with normal hearing listeners attending to noiseband-vocoded CI simulations with reduced spectral resolution.
Chatterjee, Monita; Peng, Shu-Chen
2008-01-01
Fundamental frequency (F0) processing by cochlear implant (CI) listeners was measured using a psychophysical task and a speech intonation recognition task. Listeners’ Weber fractions for modulation frequency discrimination were measured using an adaptive, 3-interval, forced-choice paradigm: stimuli were presented through a custom research interface. In the speech intonation recognition task, listeners were asked to indicate whether resynthesized bisyllabic words, when presented in the free field through the listeners’ everyday speech processor, were question-like or statement-like. The resynthesized tokens were systematically manipulated to have different initial F0s to represent male vs. female voices, and different F0 contours (i.e., falling, flat, and rising) Although the CI listeners showed considerable variation in performance on both tasks, significant correlations were observed between the CI listeners’ sensitivity to modulation frequency in the psychophysical task and their performance in intonation recognition. Consistent with their greater reliance on temporal cues, the CI listeners’ performance in the intonation recognition task was significantly poorer with the higher initial-F0 stimuli than with the lower initial-F0 stimuli. Similar results were obtained with normal hearing listeners attending to noiseband-vocoded CI simulations with reduced spectral resolution. PMID:18093766
Benefits of adaptive FM systems on speech recognition in noise for listeners who use hearing aids.
Thibodeau, Linda
2010-06-01
To compare the benefits of adaptive FM and fixed FM systems through measurement of speech recognition in noise with adults and students in clinical and real-world settings. Five adults and 5 students with moderate-to-severe hearing loss completed objective and subjective speech recognition in noise measures with the 2 types of FM processing. Sentence recognition was evaluated in a classroom for 5 competing noise levels ranging from 54 to 80 dBA while the FM microphone was positioned 6 in. from the signal loudspeaker to receive input at 84 dB SPL. The subjective measures included 2 classroom activities and 6 auditory lessons in a noisy, public aquarium. On the objective measures, adaptive FM processing resulted in significantly better speech recognition in noise than fixed FM processing for 68- and 73-dBA noise levels. On the subjective measures, all individuals preferred adaptive over fixed processing for half of the activities. Adaptive processing was also preferred by most (8-9) individuals for the remaining 4 activities. The adaptive FM processing resulted in significant improvements at the higher noise levels and was preferred by the majority of participants in most of the conditions.
Miller, Christi W; Stewart, Erin K; Wu, Yu-Hsiang; Bishop, Christopher; Bentler, Ruth A; Tremblay, Kelly
2017-08-16
This study evaluated the relationship between working memory (WM) and speech recognition in noise with different noise types as well as in the presence of visual cues. Seventy-six adults with bilateral, mild to moderately severe sensorineural hearing loss (mean age: 69 years) participated. Using a cross-sectional design, 2 measures of WM were taken: a reading span measure, and Word Auditory Recognition and Recall Measure (Smith, Pichora-Fuller, & Alexander, 2016). Speech recognition was measured with the Multi-Modal Lexical Sentence Test for Adults (Kirk et al., 2012) in steady-state noise and 4-talker babble, with and without visual cues. Testing was under unaided conditions. A linear mixed model revealed visual cues and pure-tone average as the only significant predictors of Multi-Modal Lexical Sentence Test outcomes. Neither WM measure nor noise type showed a significant effect. The contribution of WM in explaining unaided speech recognition in noise was negligible and not influenced by noise type or visual cues. We anticipate that with audibility partially restored by hearing aids, the effects of WM will increase. For clinical practice to be affected, more significant effect sizes are needed.
Neuronal Spoken Word Recognition: The Time Course of Processing Variation in the Speech Signal
ERIC Educational Resources Information Center
Schild, Ulrike; Roder, Brigitte; Friedrich, Claudia K.
2012-01-01
Recent neurobiological studies revealed evidence for lexical representations that are not specified for the coronal place of articulation (PLACE; Friedrich, Eulitz, & Lahiri, 2006; Friedrich, Lahiri, & Eulitz, 2008). Here we tested when these types of underspecified representations influence neuronal speech recognition. In a unimodal…
Recognizing Speech under a Processing Load: Dissociating Energetic from Informational Factors
ERIC Educational Resources Information Center
Mattys, Sven L.; Brooks, Joanna; Cooke, Martin
2009-01-01
Effects of perceptual and cognitive loads on spoken-word recognition have so far largely escaped investigation. This study lays the foundations of a psycholinguistic approach to speech recognition in adverse conditions that draws upon the distinction between energetic masking, i.e., listening environments leading to signal degradation, and…
Bilingual Computerized Speech Recognition Screening for Depression Symptoms
ERIC Educational Resources Information Center
Gonzalez, Gerardo; Carter, Colby; Blanes, Erika
2007-01-01
The Voice-Interactive Depression Assessment System (VIDAS) is a computerized speech recognition application for screening depression based on the Center for Epidemiological Studies--Depression scale in English and Spanish. Study 1 included 50 English and 47 Spanish speakers. Study 2 involved 108 English and 109 Spanish speakers. Participants…
Multichannel Compression, Temporal Cues, and Audibility.
ERIC Educational Resources Information Center
Souza, Pamela E.; Turner, Christopher W.
1998-01-01
The effect of the reduction of the temporal envelope produced by multichannel compression on recognition was examined in 16 listeners with hearing loss, with particular focus on audibility of the speech signal. Multichannel compression improved speech recognition when superior audibility was provided by a two-channel compression system over linear…
The effect of hearing aid technologies on listening in an automobile.
Wu, Yu-Hsiang; Stangl, Elizabeth; Bentler, Ruth A; Stanziola, Rachel W
2013-06-01
Communication while traveling in an automobile often is very difficult for hearing aid users. This is because the automobile/road noise level is usually high, and listeners/drivers often do not have access to visual cues. Since the talker of interest usually is not located in front of the listener/driver, conventional directional processing that places the directivity beam toward the listener's front may not be helpful and, in fact, could have a negative impact on speech recognition (when compared to omnidirectional processing). Recently, technologies have become available in commercial hearing aids that are designed to improve speech recognition and/or listening effort in noisy conditions where talkers are located behind or beside the listener. These technologies include (1) a directional microphone system that uses a backward-facing directivity pattern (Back-DIR processing), (2) a technology that transmits audio signals from the ear with the better signal-to-noise ratio (SNR) to the ear with the poorer SNR (Side-Transmission processing), and (3) a signal processing scheme that suppresses the noise at the ear with the poorer SNR (Side-Suppression processing). The purpose of the current study was to determine the effect of (1) conventional directional microphones and (2) newer signal processing schemes (Back-DIR, Side-Transmission, and Side-Suppression) on listener's speech recognition performance and preference for communication in a traveling automobile. A single-blinded, repeated-measures design was used. Twenty-five adults with bilateral symmetrical sensorineural hearing loss aged 44 through 84 yr participated in the study. The automobile/road noise and sentences of the Connected Speech Test (CST) were recorded through hearing aids in a standard van moving at a speed of 70 mph on a paved highway. The hearing aids were programmed to omnidirectional microphone, conventional adaptive directional microphone, and the three newer schemes. CST sentences were presented from the side and back of the hearing aids, which were placed on the ears of a manikin. The recorded stimuli were presented to listeners via earphones in a sound-treated booth to assess speech recognition performance and preference with each programmed condition. Compared to omnidirectional microphones, conventional adaptive directional processing had a detrimental effect on speech recognition when speech was presented from the back or side of the listener. Back-DIR and Side-Transmission processing improved speech recognition performance (relative to both omnidirectional and adaptive directional processing) when speech was from the back and side, respectively. The performance with Side-Suppression processing was better than with adaptive directional processing when speech was from the side. The participants' preferences for a given processing scheme were generally consistent with speech recognition results. The finding that performance with adaptive directional processing was poorer than with omnidirectional microphones demonstrates the importance of selecting the correct microphone technology for different listening situations. The results also suggest the feasibility of using hearing aid technologies to provide a better listening experience for hearing aid users in automobiles. American Academy of Audiology.
Improving speech perception in noise for children with cochlear implants.
Gifford, René H; Olund, Amy P; Dejong, Melissa
2011-10-01
Current cochlear implant recipients are achieving increasingly higher levels of speech recognition; however, the presence of background noise continues to significantly degrade speech understanding for even the best performers. Newer generation Nucleus cochlear implant sound processors can be programmed with SmartSound strategies that have been shown to improve speech understanding in noise for adult cochlear implant recipients. The applicability of these strategies for use in children, however, is not fully understood nor widely accepted. To assess speech perception for pediatric cochlear implant recipients in the presence of a realistic restaurant simulation generated by an eight-loudspeaker (R-SPACE™) array in order to determine whether Nucleus sound processor SmartSound strategies yield improved sentence recognition in noise for children who learn language through the implant. Single subject, repeated measures design. Twenty-two experimental subjects with cochlear implants (mean age 11.1 yr) and 25 control subjects with normal hearing (mean age 9.6 yr) participated in this prospective study. Speech reception thresholds (SRT) in semidiffuse restaurant noise originating from an eight-loudspeaker array were assessed with the experimental subjects' everyday program incorporating Adaptive Dynamic Range Optimization (ADRO) as well as with the addition of Autosensitivity control (ASC). Adaptive SRTs with the Hearing In Noise Test (HINT) sentences were obtained for all 22 experimental subjects, and performance-in percent correct-was assessed in a fixed +6 dB SNR (signal-to-noise ratio) for a six-subject subset. Statistical analysis using a repeated-measures analysis of variance (ANOVA) evaluated the effects of the SmartSound setting on the SRT in noise. The primary findings mirrored those reported previously with adult cochlear implant recipients in that the addition of ASC to ADRO significantly improved speech recognition in noise for pediatric cochlear implant recipients. The mean degree of improvement in the SRT with the addition of ASC to ADRO was 3.5 dB for a mean SRT of 10.9 dB SNR. Thus, despite the fact that these children have acquired auditory/oral speech and language through the use of their cochlear implant(s) equipped with ADRO, the addition of ASC significantly improved their ability to recognize speech in high levels of diffuse background noise. The mean SRT for the control subjects with normal hearing was 0.0 dB SNR. Given that the mean SRT for the experimental group was 10.9 dB SNR, despite the improvements in performance observed with the addition of ASC, cochlear implants still do not completely overcome the speech perception deficit encountered in noisy environments accompanying the diagnosis of severe-to-profound hearing loss. SmartSound strategies currently available in latest generation Nucleus cochlear implant sound processors are able to significantly improve speech understanding in a realistic, semidiffuse noise for pediatric cochlear implant recipients. Despite the reluctance of pediatric audiologists to utilize SmartSound settings for regular use, the results of the current study support the addition of ASC to ADRO for everyday listening environments to improve speech perception in a child's typical everyday program. American Academy of Audiology.
[Cochlear implantation in patients with Waardenburg syndrome type II].
Wan, Liangcai; Guo, Menghe; Chen, Shuaijun; Liu, Shuangriu; Chen, Hao; Gong, Jian
2010-05-01
To describe the multi-channel cochlear implantation in patients with Waardenburg syndrome including surgeries, pre and postoperative hearing assessments as well as outcomes of speech recognition. Multi-channel cochlear implantation surgeries have been performed in 12 cases with Waardenburg syndrome type II in our department from 2000 to 2008. All the patients received multi-channel cochlear implantation through transmastoid facial recess approach. The postoperative outcomes of 12 cases were compared with 12 cases with no inner ear malformation as a control group. The electrodes were totally inserted into the cochlear successfully, there was no facial paralysis and cerebrospinal fluid leakage occurred after operation. The hearing threshold in this series were similar to that of the normal cochlear implantation. After more than half a year of speech rehabilitation, the abilities of speech discrimination and spoken language of all the patients were improved compared with that of preoperation. Multi-channel cochlear implantation could be performed in the cases with Waardenburg syndrome, preoperative hearing and images assessments should be done.
Calandruccio, Lauren; Bradlow, Ann R; Dhar, Sumitrajit
2014-04-01
Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared with native-accented English speech was reported in Calandruccio et al (2010a). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech Affiliationect masking release. A mixed-model design with within-subject (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech and high-intelligibility, moderate-intelligibility, and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Three listener groups were tested, including monolingual English speakers with normal hearing, nonnative English speakers with normal hearing, and monolingual English speakers with hearing loss. The nonnative English speakers were from various native language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetric mild sloping to moderate sensorineural hearing loss. Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the key words within the sentences (100 key words per masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and listener groups. Monolingual English speakers with normal hearing benefited when the competing speech signal was foreign accented compared with native accented, allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the nonnative English-speaking listeners with normal hearing nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Slight modifications between the target and the masker speech allowed monolingual English speakers with normal hearing to improve their recognition of native-accented English, even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. American Academy of Audiology.
Application of speech recognition and synthesis in the general aviation cockpit
NASA Technical Reports Server (NTRS)
North, R. A.; Mountford, S. J.; Bergeron, H.
1984-01-01
Interactive speech recognition/synthesis technology is assessed as a method for the aleviation of single-pilot IFR flight workloads. Attention was given during this series of evaluations to the conditions typical of general aviation twin-engine aircrft cockpits, covering several commonly encountered IFR flight condition scenarios. The most beneficial speech command tasks are noted to be in the data retrieval domain, which would allow the pilot access to uplinked data, checklists, and performance charts. Data entry tasks also appear to benefit from this technology.
Noise-immune multisensor transduction of speech
NASA Astrophysics Data System (ADS)
Viswanathan, Vishu R.; Henry, Claudia M.; Derr, Alan G.; Roucos, Salim; Schwartz, Richard M.
1986-08-01
Two types of configurations of multiple sensors were developed, tested and evaluated in speech recognition application for robust performance in high levels of acoustic background noise: One type combines the individual sensor signals to provide a single speech signal input, and the other provides several parallel inputs. For single-input systems, several configurations of multiple sensors were developed and tested. Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show that each of the two-sensor configurations tested outperforms the constituent individual sensors in high noise. Also presented are results comparing the performance of two-sensor configurations and individual sensors in speaker-dependent, isolated-word speech recognition tests performed using a commercial recognizer (Verbex 4000) in simulated fighter aircraft cockpit noise.
NASA Technical Reports Server (NTRS)
Simpson, Carol A.
1990-01-01
The U.S. Army Crew Station Research and Development Facility uses vintage 1984 speech recognizers. An evaluation was performed of newer off-the-shelf speech recognition devices to determine whether newer technology performance and capabilities are substantially better than that of the Army's current speech recognizers. The Phonetic Discrimination (PD-100) Test was used to compare recognizer performance in two ambient noise conditions: quiet office and helicopter noise. Test tokens were spoken by males and females and in isolated-word and connected-work mode. Better overall recognition accuracy was obtained from the newer recognizers. Recognizer capabilities needed to support the development of human factors design requirements for speech command systems in advanced combat helicopters are listed.
"Who" is saying "what"? Brain-based decoding of human voice and speech.
Formisano, Elia; De Martino, Federico; Bonte, Milene; Goebel, Rainer
2008-11-07
Can we decipher speech content ("what" is being said) and speaker identity ("who" is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.
Word recognition materials for native speakers of Taiwan Mandarin.
Nissen, Shawn L; Harris, Richard W; Dukes, Alycia
2008-06-01
To select, digitally record, evaluate, and psychometrically equate word recognition materials that can be used to measure the speech perception abilities of native speakers of Taiwan Mandarin in quiet. Frequently used bisyllabic words produced by male and female talkers of Taiwan Mandarin were digitally recorded and subsequently evaluated using 20 native listeners with normal hearing at 10 intensity levels (-5 to 40 dB HL) in increments of 5 dB. Using logistic regression, 200 words with the steepest psychometric slopes were divided into 4 lists and 8 half-lists that were relatively equivalent in psychometric function slope. To increase auditory homogeneity of the lists, the intensity of words in each list was digitally adjusted so that the threshold of each list was equal to the midpoint between the mean thresholds of the male and female half-lists. Digital recordings of the word recognition lists and the associated clinical instructions are available on CD upon request.
Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation
Banks, Briony; Gowen, Emma; Munro, Kevin J.; Adank, Patti
2015-01-01
Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker’s facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants’ eye gaze was recorded to verify that they looked at the speaker’s face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation. PMID:26283946
Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.
Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti
2015-01-01
Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.
Wolfe, Jace; Morais, Mila; Schafer, Erin
2016-02-01
The goals of the present investigation were (1) to evaluate recognition of recorded speech presented over a mobile telephone for a group of adult bimodal cochlear implant users, and (2) to measure the potential benefits of wireless hearing assistance technology (HAT) for mobile telephone speech recognition using bimodal stimulation (i.e., a cochlear implant in one ear and a hearing aid on the other ear). A three-by-two-way repeated measures design was used to evaluate mobile telephone sentence-recognition performance differences obtained in quiet and in noise with and without the wireless HAT accessory coupled to the hearing aid alone, CI sound processor alone, and in the bimodal condition. Outpatient cochlear implant clinic. Sixteen bimodal users with Nucleus 24, Freedom, CI512, or CI422 cochlear implants participated in this study. Performance was measured with and without the use of a wireless HAT for the telephone used with the hearing aid alone, CI alone, and bimodal condition. CNC word recognition in quiet and in noise with and without the use of a wireless HAT telephone accessory in the hearing aid alone, CI alone, and bimodal conditions. Results suggested that the bimodal condition gave significantly better speech recognition on the mobile telephone with the wireless HAT. A wireless HAT for the mobile telephone provides bimodal users with significant improvement in word recognition in quiet and in noise over the mobile telephone.
NASA Astrophysics Data System (ADS)
Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.
2009-12-01
This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.
Automatic speech recognition in air traffic control
NASA Technical Reports Server (NTRS)
Karlsson, Joakim
1990-01-01
Automatic Speech Recognition (ASR) technology and its application to the Air Traffic Control system are described. The advantages of applying ASR to Air Traffic Control, as well as criteria for choosing a suitable ASR system are presented. Results from previous research and directions for future work at the Flight Transportation Laboratory are outlined.
Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students
ERIC Educational Resources Information Center
Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice
2012-01-01
Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…
Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation
ERIC Educational Resources Information Center
Kim, In-Seok
2006-01-01
This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…
Speech Recognition Technology for Disabilities Education
ERIC Educational Resources Information Center
Tang, K. Wendy; Kamoua, Ridha; Sutan, Victor; Farooq, Omer; Eng, Gilbert; Chu, Wei Chern; Hou, Guofeng
2005-01-01
Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace or reduce the reliability on standard keyboard and mouse input. This can especially assist dyslexic students who have problems with character or word use and manipulation in a textual…
Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation
ERIC Educational Resources Information Center
Elimat, Amal Khalil; AbuSeileek, Ali Farhan
2014-01-01
This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…
ERIC Educational Resources Information Center
Murphy, Harry; Higgins, Eleanor
This final report describes the activities and accomplishments of a 3-year study on the compensatory effectiveness of three assistive technologies, optical character recognition, speech synthesis, and speech recognition, on postsecondary students (N=140) with learning disabilities. These technologies were investigated relative to: (1) immediate…
Effects of Cognitive Load on Speech Recognition
ERIC Educational Resources Information Center
Mattys, Sven L.; Wiget, Lukas
2011-01-01
The effect of cognitive load (CL) on speech recognition has received little attention despite the prevalence of CL in everyday life, e.g., dual-tasking. To assess the effect of CL on the interaction between lexically-mediated and acoustically-mediated processes, we measured the magnitude of the "Ganong effect" (i.e., lexical bias on phoneme…
Implicit Processing of Phonotactic Cues: Evidence from Electrophysiological and Vascular Responses
ERIC Educational Resources Information Center
Rossi, Sonja; Jurgenson, Ina B.; Hanulikova, Adriana; Telkemeyer, Silke; Wartenburger, Isabell; Obrig, Hellmuth
2011-01-01
Spoken word recognition is achieved via competition between activated lexical candidates that match the incoming speech input. The competition is modulated by prelexical cues that are important for segmenting the auditory speech stream into linguistic units. One such prelexical cue that listeners rely on in spoken word recognition is phonotactics.…
Spoken Word Recognition of Chinese Words in Continuous Speech
ERIC Educational Resources Information Center
Yip, Michael C. W.
2015-01-01
The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…
Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech
ERIC Educational Resources Information Center
Yip, Michael C.
2016-01-01
Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…
Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study
ERIC Educational Resources Information Center
van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer
2016-01-01
The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…
Motor Speech Disorders Associated with Primary Progressive Aphasia
Duffy, Joseph R.; Strand, Edythe A.; Josephs, Keith A.
2014-01-01
Background Primary progressive aphasia (PPA) and conditions that overlap with it can be accompanied by motor speech disorders. Recognition and understanding of motor speech disorders can contribute to a fuller clinical understanding of PPA and its management as well as its localization and underlying pathology. Aims To review the types of motor speech disorders that may occur with PPA, its primary variants, and its overlap syndromes (progressive supranuclear palsy syndrome, corticobasal syndrome, motor neuron disease), as well as with primary progressive apraxia of speech. Main Contribution The review should assist clinicians' and researchers' understanding of the relationship between motor speech disorders and PPA and its major variants. It also highlights the importance of recognizing neurodegenerative apraxia of speech as a condition that can occur with little or no evidence of aphasia. Conclusion Motor speech disorders can occur with PPA. Their recognition can contribute to clinical diagnosis and management of PPA and to understanding and predicting the localization and pathology associated with PPA variants and conditions that can overlap with them. PMID:25309017
Quadcopter Control Using Speech Recognition
NASA Astrophysics Data System (ADS)
Malik, H.; Darma, S.; Soekirno, S.
2018-04-01
This research reported a comparison from a success rate of speech recognition systems that used two types of databases they were existing databases and new databases, that were implemented into quadcopter as motion control. Speech recognition system was using Mel frequency cepstral coefficient method (MFCC) as feature extraction that was trained using recursive neural network method (RNN). MFCC method was one of the feature extraction methods that most used for speech recognition. This method has a success rate of 80% - 95%. Existing database was used to measure the success rate of RNN method. The new database was created using Indonesian language and then the success rate was compared with results from an existing database. Sound input from the microphone was processed on a DSP module with MFCC method to get the characteristic values. Then, the characteristic values were trained using the RNN which result was a command. The command became a control input to the single board computer (SBC) which result was the movement of the quadcopter. On SBC, we used robot operating system (ROS) as the kernel (Operating System).
Participation of the Classical Speech Areas in Auditory Long-Term Memory
Karabanov, Anke Ninija; Paine, Rainer; Chao, Chi Chao; Schulze, Katrin; Scott, Brian; Hallett, Mark; Mishkin, Mortimer
2015-01-01
Accumulating evidence suggests that storing speech sounds requires transposing rapidly fluctuating sound waves into more easily encoded oromotor sequences. If so, then the classical speech areas in the caudalmost portion of the temporal gyrus (pSTG) and in the inferior frontal gyrus (IFG) may be critical for performing this acoustic-oromotor transposition. We tested this proposal by applying repetitive transcranial magnetic stimulation (rTMS) to each of these left-hemisphere loci, as well as to a nonspeech locus, while participants listened to pseudowords. After 5 minutes these stimuli were re-presented together with new ones in a recognition test. Compared to control-site stimulation, pSTG stimulation produced a highly significant increase in recognition error rate, without affecting reaction time. By contrast, IFG stimulation led only to a weak, non-significant, trend toward recognition memory impairment. Importantly, the impairment after pSTG stimulation was not due to interference with perception, since the same stimulation failed to affect pseudoword discrimination examined with short interstimulus intervals. Our findings suggest that pSTG is essential for transforming speech sounds into stored motor plans for reproducing the sound. Whether or not the IFG also plays a role in speech-sound recognition could not be determined from the present results. PMID:25815813
Participation of the classical speech areas in auditory long-term memory.
Karabanov, Anke Ninija; Paine, Rainer; Chao, Chi Chao; Schulze, Katrin; Scott, Brian; Hallett, Mark; Mishkin, Mortimer
2015-01-01
Accumulating evidence suggests that storing speech sounds requires transposing rapidly fluctuating sound waves into more easily encoded oromotor sequences. If so, then the classical speech areas in the caudalmost portion of the temporal gyrus (pSTG) and in the inferior frontal gyrus (IFG) may be critical for performing this acoustic-oromotor transposition. We tested this proposal by applying repetitive transcranial magnetic stimulation (rTMS) to each of these left-hemisphere loci, as well as to a nonspeech locus, while participants listened to pseudowords. After 5 minutes these stimuli were re-presented together with new ones in a recognition test. Compared to control-site stimulation, pSTG stimulation produced a highly significant increase in recognition error rate, without affecting reaction time. By contrast, IFG stimulation led only to a weak, non-significant, trend toward recognition memory impairment. Importantly, the impairment after pSTG stimulation was not due to interference with perception, since the same stimulation failed to affect pseudoword discrimination examined with short interstimulus intervals. Our findings suggest that pSTG is essential for transforming speech sounds into stored motor plans for reproducing the sound. Whether or not the IFG also plays a role in speech-sound recognition could not be determined from the present results.
Syntactic error modeling and scoring normalization in speech recognition
NASA Technical Reports Server (NTRS)
Olorenshaw, Lex
1991-01-01
The objective was to develop the speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Research was performed in the following areas: (1) syntactic error modeling; (2) score normalization; and (3) phoneme error modeling. The study into the types of errors that a reader makes will provide the basis for creating tests which will approximate the use of the system in the real world. NASA-Johnson will develop this technology into a 'Literacy Tutor' in order to bring innovative concepts to the task of teaching adults to read.
Tone classification of syllable-segmented Thai speech based on multilayer perception
NASA Astrophysics Data System (ADS)
Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman
2002-05-01
Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.
NASA Astrophysics Data System (ADS)
Kaddoura, Tarek; Vadlamudi, Karunakar; Kumar, Shine; Bobhate, Prashant; Guo, Long; Jain, Shreepal; Elgendi, Mohamed; Coe, James Y.; Kim, Daniel; Taylor, Dylan; Tymchak, Wayne; Schuurmans, Dale; Zemp, Roger J.; Adatia, Ian
2016-09-01
We hypothesized that an automated speech- recognition-inspired classification algorithm could differentiate between the heart sounds in subjects with and without pulmonary hypertension (PH) and outperform physicians. Heart sounds, electrocardiograms, and mean pulmonary artery pressures (mPAp) were recorded simultaneously. Heart sound recordings were digitized to train and test speech-recognition-inspired classification algorithms. We used mel-frequency cepstral coefficients to extract features from the heart sounds. Gaussian-mixture models classified the features as PH (mPAp ≥ 25 mmHg) or normal (mPAp < 25 mmHg). Physicians blinded to patient data listened to the same heart sound recordings and attempted a diagnosis. We studied 164 subjects: 86 with mPAp ≥ 25 mmHg (mPAp 41 ± 12 mmHg) and 78 with mPAp < 25 mmHg (mPAp 17 ± 5 mmHg) (p < 0.005). The correct diagnostic rate of the automated speech-recognition-inspired algorithm was 74% compared to 56% by physicians (p = 0.005). The false positive rate for the algorithm was 34% versus 50% (p = 0.04) for clinicians. The false negative rate for the algorithm was 23% and 68% (p = 0.0002) for physicians. We developed an automated speech-recognition-inspired classification algorithm for the acoustic diagnosis of PH that outperforms physicians that could be used to screen for PH and encourage earlier specialist referral.
NASA Astrophysics Data System (ADS)
Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui
2012-04-01
Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.
Neher, Tobias
2014-01-01
Knowledge of how executive functions relate to preferred hearing aid (HA) processing is sparse and seemingly inconsistent with related knowledge for speech recognition outcomes. This study thus aimed to find out if (1) performance on a measure of reading span (RS) is related to preferred binaural noise reduction (NR) strength, (2) similar relations exist for two different, non-verbal measures of executive function, (3) pure-tone average hearing loss (PTA), signal-to-noise ratio (SNR), and microphone directionality (DIR) also influence preferred NR strength, and (4) preference and speech recognition outcomes are similar. Sixty elderly HA users took part. Six HA conditions consisting of omnidirectional or cardioid microphones followed by inactive, moderate, or strong binaural NR as well as linear amplification were tested. Outcome was assessed at fixed SNRs using headphone simulations of a frontal target talker in a busy cafeteria. Analyses showed positive effects of active NR and DIR on preference, and negative and positive effects of, respectively, strong NR and DIR on speech recognition. Also, while moderate NR was the most preferred NR setting overall, preference for strong NR increased with SNR. No relation between RS and preference was found. However, larger PTA was related to weaker preference for inactive NR and stronger preference for strong NR for both microphone modes. Equivalent (but weaker) relations between worse performance on one non-verbal measure of executive function and the HA conditions without DIR were found. For speech recognition, there were relations between HA condition, PTA, and RS, but their pattern differed from that for preference. Altogether, these results indicate that, while moderate NR works well in general, a notable proportion of HA users prefer stronger NR. Furthermore, PTA and executive functions can account for some of the variability in preference for, and speech recognition with, different binaural NR and DIR settings. PMID:25538547
Learning to perceptually organize speech signals in native fashion.
Nittrouer, Susan; Lowenstein, Joanna H
2010-03-01
The ability to recognize speech involves sensory, perceptual, and cognitive processes. For much of the history of speech perception research, investigators have focused on the first and third of these, asking how much and what kinds of sensory information are used by normal and impaired listeners, as well as how effective amounts of that information are altered by "top-down" cognitive processes. This experiment focused on perceptual processes, asking what accounts for how the sensory information in the speech signal gets organized. Two types of speech signals processed to remove properties that could be considered traditional acoustic cues (amplitude envelopes and sine wave replicas) were presented to 100 listeners in five groups: native English-speaking (L1) adults, 7-, 5-, and 3-year-olds, and native Mandarin-speaking adults who were excellent second-language (L2) users of English. The L2 adults performed more poorly than L1 adults with both kinds of signals. Children performed more poorly than L1 adults but showed disproportionately better performance for the sine waves than for the amplitude envelopes compared to both groups of adults. Sentence context had similar effects across groups, so variability in recognition was attributed to differences in perceptual organization of the sensory information, presumed to arise from native language experience.
Lessons Learned in Part-of-Speech Tagging of Conversational Speech
2010-10-01
for conversational speech recognition. In Plenary Meeting and Symposium on Prosody and Speech Processing. Slav Petrov and Dan Klein. 2007. Improved...inference for unlexicalized parsing. In HLT-NAACL. Slav Petrov. 2010. Products of random latent variable grammars. In HLT-NAACL. Brian Roark, Yang Liu
Potts, Lisa G.; Skinner, Margaret W.; Litovsky, Ruth A.; Strube, Michael J; Kuk, Francis
2010-01-01
Background The use of bilateral amplification is now common clinical practice for hearing aid users but not for cochlear implant recipients. In the past, most cochlear implant recipients were implanted in one ear and wore only a monaural cochlear implant processor. There has been recent interest in benefits arising from bilateral stimulation that may be present for cochlear implant recipients. One option for bilateral stimulation is the use of a cochlear implant in one ear and a hearing aid in the opposite nonimplanted ear (bimodal hearing). Purpose This study evaluated the effect of wearing a cochlear implant in one ear and a digital hearing aid in the opposite ear on speech recognition and localization. Research Design A repeated-measures correlational study was completed. Study Sample Nineteen adult Cochlear Nucleus 24 implant recipients participated in the study. Intervention The participants were fit with a Widex Senso Vita 38 hearing aid to achieve maximum audibility and comfort within their dynamic range. Data Collection and Analysis Soundfield thresholds, loudness growth, speech recognition, localization, and subjective questionnaires were obtained six–eight weeks after the hearing aid fitting. Testing was completed in three conditions: hearing aid only, cochlear implant only, and cochlear implant and hearing aid (bimodal). All tests were repeated four weeks after the first test session. Repeated-measures analysis of variance was used to analyze the data. Significant effects were further examined using pairwise comparison of means or in the case of continuous moderators, regression analyses. The speech-recognition and localization tasks were unique, in that a speech stimulus presented from a variety of roaming azimuths (140 degree loudspeaker array) was used. Results Performance in the bimodal condition was significantly better for speech recognition and localization compared to the cochlear implant–only and hearing aid–only conditions. Performance was also different between these conditions when the location (i.e., side of the loudspeaker array that presented the word) was analyzed. In the bimodal condition, the speech-recognition and localization tasks were equal regardless of which side of the loudspeaker array presented the word, while performance was significantly poorer for the monaural conditions (hearing aid only and cochlear implant only) when the words were presented on the side with no stimulation. Binaural loudness summation of 1–3 dB was seen in soundfield thresholds and loudness growth in the bimodal condition. Measures of the audibility of sound with the hearing aid, including unaided thresholds, soundfield thresholds, and the Speech Intelligibility Index, were significant moderators of speech recognition and localization. Based on the questionnaire responses, participants showed a strong preference for bimodal stimulation. Conclusions These findings suggest that a well-fit digital hearing aid worn in conjunction with a cochlear implant is beneficial to speech recognition and localization. The dynamic test procedures used in this study illustrate the importance of bilateral hearing for locating, identifying, and switching attention between multiple speakers. It is recommended that unilateral cochlear implant recipients, with measurable unaided hearing thresholds, be fit with a hearing aid. PMID:19594084
Cleft Audit Protocol for Speech (CAPS-A): A Comprehensive Training Package for Speech Analysis
ERIC Educational Resources Information Center
Sell, D.; John, A.; Harding-Bell, A.; Sweeney, T.; Hegarty, F.; Freeman, J.
2009-01-01
Background: The previous literature has largely focused on speech analysis systems and ignored process issues, such as the nature of adequate speech samples, data acquisition, recording and playback. Although there has been recognition of the need for training on tools used in speech analysis associated with cleft palate, little attention has been…
ERIC Educational Resources Information Center
US Department of Education, 2010
2010-01-01
The American Speech-Language-Hearing Association, Council on Academic Accreditation in Audiology and Speech-Language Pathology (CAA) is a national accrediting agency of graduate education programs in audiology or speech-language pathology. The CAA currently accredits or or preaccredits 319 programs (247 in speech-language pathology and 72 in…
Rajan, R; Cainer, K E
2008-06-23
In most everyday settings, speech is heard in the presence of competing sounds and understanding speech requires skills in auditory streaming and segregation, followed by identification and recognition, of the attended signals. Ageing leads to difficulties in understanding speech in noisy backgrounds. In addition to age-related changes in hearing-related factors, cognitive factors also play a role but it is unclear to what extent these are generalized or modality-specific cognitive factors. We examined how ageing in normal-hearing decade age cohorts from 20 to 69 years affected discrimination of open-set speech in background noise. We used two types of sentences of similar structural and linguistic characteristics but different masking levels (i.e. differences in signal-to-noise ratios required for detection of sentences in a standard masker) so as to vary sentence demand, and two background maskers (one causing purely energetic masking effects and the other causing energetic and informational masking) to vary load conditions. There was a decline in performance (measured as speech reception thresholds for perception of sentences in noise) in the oldest cohort for both types of sentences, but only in the presence of the more demanding informational masker. We interpret these results to indicate a modality-specific decline in cognitive processing, likely a decrease in the ability to use acoustic and phonetic cues efficiently to segregate speech from background noise, in subjects aged >60.
Barker, Lynne Ann; Morton, Nicholas; Romanowski, Charles A J; Gosden, Kevin
2013-10-24
We report a rare case of a patient unable to read (alexic) and write (agraphic) after a mild head injury. He had preserved speech and comprehension, could spell aloud, identify words spelt aloud and copy letter features. He was unable to visualise letters but showed no problems with digits. Neuropsychological testing revealed general visual memory, processing speed and imaging deficits. Imaging data revealed an 8 mm colloid cyst of the third ventricle that splayed the fornix. Little is known about functions mediated by fornical connectivity, but this region is thought to contribute to memory recall. Other regions thought to mediate letter recognition and letter imagery, visual word form area and visual pathways were intact. We remediated reading and writing by multimodal letter retraining. The study raises issues about the neural substrates of reading, role of fornical tracts to selective memory in the absence of other pathology, and effective remediation strategies for selective functional deficits.
Stewart, Erin K.; Wu, Yu-Hsiang; Bishop, Christopher; Bentler, Ruth A.; Tremblay, Kelly
2017-01-01
Purpose This study evaluated the relationship between working memory (WM) and speech recognition in noise with different noise types as well as in the presence of visual cues. Method Seventy-six adults with bilateral, mild to moderately severe sensorineural hearing loss (mean age: 69 years) participated. Using a cross-sectional design, 2 measures of WM were taken: a reading span measure, and Word Auditory Recognition and Recall Measure (Smith, Pichora-Fuller, & Alexander, 2016). Speech recognition was measured with the Multi-Modal Lexical Sentence Test for Adults (Kirk et al., 2012) in steady-state noise and 4-talker babble, with and without visual cues. Testing was under unaided conditions. Results A linear mixed model revealed visual cues and pure-tone average as the only significant predictors of Multi-Modal Lexical Sentence Test outcomes. Neither WM measure nor noise type showed a significant effect. Conclusion The contribution of WM in explaining unaided speech recognition in noise was negligible and not influenced by noise type or visual cues. We anticipate that with audibility partially restored by hearing aids, the effects of WM will increase. For clinical practice to be affected, more significant effect sizes are needed. PMID:28744550
Influence of auditory attention on sentence recognition captured by the neural phase.
Müller, Jana Annina; Kollmeier, Birger; Debener, Stefan; Brand, Thomas
2018-03-07
The aim of this study was to investigate whether attentional influences on speech recognition are reflected in the neural phase entrained by an external modulator. Sentences were presented in 7 Hz sinusoidally modulated noise while the neural response to that modulation frequency was monitored by electroencephalogram (EEG) recordings in 21 participants. We implemented a selective attention paradigm including three different attention conditions while keeping physical stimulus parameters constant. The participants' task was either to repeat the sentence as accurately as possible (speech recognition task), to count the number of decrements implemented in modulated noise (decrement detection task), or to do both (dual task), while the EEG was recorded. Behavioural analysis revealed reduced performance in the dual task condition for decrement detection, possibly reflecting limited cognitive resources. EEG analysis revealed no significant differences in power for the 7 Hz modulation frequency, but an attention-dependent phase difference between tasks. Further phase analysis revealed a significant difference 500 ms after sentence onset between trials with correct and incorrect responses for speech recognition, indicating that speech recognition performance and the neural phase are linked via selective attention mechanisms, at least shortly after sentence onset. However, the neural phase effects identified were small and await further investigation. © 2018 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Hertrich, Ingo; Dietrich, Susanne; Ackermann, Hermann
2013-01-01
In blind people, the visual channel cannot assist face-to-face communication via lipreading or visual prosody. Nevertheless, the visual system may enhance the evaluation of auditory information due to its cross-links to (1) the auditory system, (2) supramodal representations, and (3) frontal action-related areas. Apart from feedback or top-down support of, for example, the processing of spatial or phonological representations, experimental data have shown that the visual system can impact auditory perception at more basic computational stages such as temporal signal resolution. For example, blind as compared to sighted subjects are more resistant against backward masking, and this ability appears to be associated with activity in visual cortex. Regarding the comprehension of continuous speech, blind subjects can learn to use accelerated text-to-speech systems for "reading" texts at ultra-fast speaking rates (>16 syllables/s), exceeding by far the normal range of 6 syllables/s. A functional magnetic resonance imaging study has shown that this ability, among other brain regions, significantly covaries with BOLD responses in bilateral pulvinar, right visual cortex, and left supplementary motor area. Furthermore, magnetoencephalographic measurements revealed a particular component in right occipital cortex phase-locked to the syllable onsets of accelerated speech. In sighted people, the "bottleneck" for understanding time-compressed speech seems related to higher demands for buffering phonological material and is, presumably, linked to frontal brain structures. On the other hand, the neurophysiological correlates of functions overcoming this bottleneck, seem to depend upon early visual cortex activity. The present Hypothesis and Theory paper outlines a model that aims at binding these data together, based on early cross-modal pathways that are already known from various audiovisual experiments on cross-modal adjustments during space, time, and object recognition.
Hertrich, Ingo; Dietrich, Susanne; Ackermann, Hermann
2013-01-01
In blind people, the visual channel cannot assist face-to-face communication via lipreading or visual prosody. Nevertheless, the visual system may enhance the evaluation of auditory information due to its cross-links to (1) the auditory system, (2) supramodal representations, and (3) frontal action-related areas. Apart from feedback or top-down support of, for example, the processing of spatial or phonological representations, experimental data have shown that the visual system can impact auditory perception at more basic computational stages such as temporal signal resolution. For example, blind as compared to sighted subjects are more resistant against backward masking, and this ability appears to be associated with activity in visual cortex. Regarding the comprehension of continuous speech, blind subjects can learn to use accelerated text-to-speech systems for “reading” texts at ultra-fast speaking rates (>16 syllables/s), exceeding by far the normal range of 6 syllables/s. A functional magnetic resonance imaging study has shown that this ability, among other brain regions, significantly covaries with BOLD responses in bilateral pulvinar, right visual cortex, and left supplementary motor area. Furthermore, magnetoencephalographic measurements revealed a particular component in right occipital cortex phase-locked to the syllable onsets of accelerated speech. In sighted people, the “bottleneck” for understanding time-compressed speech seems related to higher demands for buffering phonological material and is, presumably, linked to frontal brain structures. On the other hand, the neurophysiological correlates of functions overcoming this bottleneck, seem to depend upon early visual cortex activity. The present Hypothesis and Theory paper outlines a model that aims at binding these data together, based on early cross-modal pathways that are already known from various audiovisual experiments on cross-modal adjustments during space, time, and object recognition. PMID:23966968
Lee, Jong-Seok; Park, Cheol Hoon
2010-08-01
We propose a novel stochastic optimization algorithm, hybrid simulated annealing (SA), to train hidden Markov models (HMMs) for visual speech recognition. In our algorithm, SA is combined with a local optimization operator that substitutes a better solution for the current one to improve the convergence speed and the quality of solutions. We mathematically prove that the sequence of the objective values converges in probability to the global optimum in the algorithm. The algorithm is applied to train HMMs that are used as visual speech recognizers. While the popular training method of HMMs, the expectation-maximization algorithm, achieves only local optima in the parameter space, the proposed method can perform global optimization of the parameters of HMMs and thereby obtain solutions yielding improved recognition performance. The superiority of the proposed algorithm to the conventional ones is demonstrated via isolated word recognition experiments.
Speech-Enabled Interfaces for Travel Information Systems with Large Grammars
NASA Astrophysics Data System (ADS)
Zhao, Baoli; Allen, Tony; Bargiela, Andrzej
This paper introduces three grammar-segmentation methods capable of handling the large grammar issues associated with producing a real-time speech-enabled VXML bus travel application for London. Large grammars tend to produce relatively slow recognition interfaces and this work shows how this limitation can be successfully addressed. Comparative experimental results show that the novel last-word recognition based grammar segmentation method described here achieves an optimal balance between recognition rate, speed of processing and naturalness of interaction.
2015-05-28
recognition is simpler and requires less computational resources compared to other inputs such as facial expressions . The Berlin database of Emotional ...Processing Magazine, IEEE, vol. 18, no. 1, pp. 32– 80, 2001. [15] K. R. Scherer, T. Johnstone, and G. Klasmeyer, “Vocal expression of emotion ...Network for Real-Time Speech- Emotion Recognition 5a. CONTRACT NUMBER IN-HOUSE 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 62788F 6. AUTHOR(S) Q
Inferring Speaker Affect in Spoken Natural Language Communication
ERIC Educational Resources Information Center
Pon-Barry, Heather Roberta
2013-01-01
The field of spoken language processing is concerned with creating computer programs that can understand human speech and produce human-like speech. Regarding the problem of understanding human speech, there is currently growing interest in moving beyond speech recognition (the task of transcribing the words in an audio stream) and towards…
Automated Assessment of Speech Fluency for L2 English Learners
ERIC Educational Resources Information Center
Yoon, Su-Youn
2009-01-01
This dissertation provides an automated scoring method of speech fluency for second language learners of English (L2 learners) based that uses speech recognition technology. Non-standard pronunciation, frequent disfluencies, faulty grammar, and inappropriate lexical choices are crucial characteristics of L2 learners' speech. Due to the ease of…
Use of Computer Speech Technologies To Enhance Learning.
ERIC Educational Resources Information Center
Ferrell, Joe
1999-01-01
Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)
Enhancing Speech Intelligibility: Interactions among Context, Modality, Speech Style, and Masker
ERIC Educational Resources Information Center
Van Engen, Kristin J.; Phelps, Jasmine E. B.; Smiljanic, Rajka; Chandrasekaran, Bharath
2014-01-01
Purpose: The authors sought to investigate interactions among intelligibility-enhancing speech cues (i.e., semantic context, clearly produced speech, and visual information) across a range of masking conditions. Method: Sentence recognition in noise was assessed for 29 normal-hearing listeners. Testing included semantically normal and anomalous…
The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting
ERIC Educational Resources Information Center
Liaw, Meei-Ling
2014-01-01
This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…
ERIC Educational Resources Information Center
Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani
2017-01-01
The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…
Machine Learning Through Signature Trees. Applications to Human Speech.
ERIC Educational Resources Information Center
White, George M.
A signature tree is a binary decision tree used to classify unknown patterns. An attempt was made to develop a computer program for manipulating signature trees as a general research tool for exploring machine learning and pattern recognition. The program was applied to the problem of speech recognition to test its effectiveness for a specific…
ERIC Educational Resources Information Center
Baker, Elizabeth A.
2017-01-01
Informed by sociocultural and systems theory tenets, this study used ethnographic research methods to examine the feasibility of using speech recognition (SR) technology to support struggling readers in an early elementary classroom setting. Observations of eight first graders were conducted as they participated in a structured SR-supported…
User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning
ERIC Educational Resources Information Center
Ahn, Tae youn; Lee, Sangmin-Michelle
2016-01-01
With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…
[The role of temporal fine structure in tone recognition and music perception].
Zhou, Q; Gu, X; Liu, B
2017-11-07
The sound signal can be decomposed into temporal envelope and temporal fine structure information. The temporal envelope information is crucial for speech perception in quiet environment, and the temporal fine structure information plays an important role in speech perception in noise, Mandarin tone recognition and music perception, especially the pitch and melody perception.
ERIC Educational Resources Information Center
Franco, Horacio; Bratt, Harry; Rossier, Romain; Rao Gadde, Venkata; Shriberg, Elizabeth; Abrash, Victor; Precoda, Kristin
2010-01-01
SRI International's EduSpeak[R] system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to…
NASA Technical Reports Server (NTRS)
Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.
2016-01-01
During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.
The effect of hearing aid technologies on listening in an automobile
Wu, Yu-Hsiang; Stangl, Elizabeth; Bentler, Ruth A.; Stanziola, Rachel W.
2014-01-01
Background Communication while traveling in an automobile often is very difficult for hearing aid users. This is because the automobile /road noise level is usually high, and listeners/drivers often do not have access to visual cues. Since the talker of interest usually is not located in front of the driver/listener, conventional directional processing that places the directivity beam toward the listener’s front may not be helpful, and in fact, could have a negative impact on speech recognition (when compared to omnidirectional processing). Recently, technologies have become available in commercial hearing aids that are designed to improve speech recognition and/or listening effort in noisy conditions where talkers are located behind or beside the listener. These technologies include (1) a directional microphone system that uses a backward-facing directivity pattern (Back-DIR processing), (2) a technology that transmits audio signals from the ear with the better signal-to-noise ratio (SNR) to the ear with the poorer SNR (Side-Transmission processing), and (3) a signal processing scheme that suppresses the noise at the ear with the poorer SNR (Side-Suppression processing). Purpose The purpose of the current study was to determine the effect of (1) conventional directional microphones and (2) newer signal processing schemes (Back-DIR, Side-Transmission, and Side-Suppression) on listener’s speech recognition performance and preference for communication in a traveling automobile. Research design A single-blinded, repeated-measures design was used. Study Sample Twenty-five adults with bilateral symmetrical sensorineural hearing loss aged 44 through 84 years participated in the study. Data Collection and Analysis The automobile/road noise and sentences of the Connected Speech Test (CST) were recorded through hearing aids in a standard van moving at a speed of 70 miles/hour on a paved highway. The hearing aids were programmed to omnidirectional microphone, conventional adaptive directional microphone, and the three newer schemes. CST sentences were presented from the side and back of the hearing aids, which were placed on the ears of a manikin. The recorded stimuli were presented to listeners via earphones in a sound treated booth to assess speech recognition performance and preference with each programmed condition. Results Compared to omnidirectional microphones, conventional adaptive directional processing had a detrimental effect on speech recognition when speech was presented from the back or side of the listener. Back-DIR and Side-Transmission processing improved speech recognition performance (relative to both omnidirectional and adaptive directional processing) when speech was from the back and side, respectively. The performance with Side-Suppression processing was better than with adaptive directional processing when speech was from the side. The participants’ preferences for a given processing scheme were generally consistent with speech recognition results. Conclusions The finding that performance with adaptive directional processing was poorer than with omnidirectional microphones demonstrates the importance of selecting the correct microphone technology for different listening situations. The results also suggest the feasibility of using hearing aid technologies to provide a better listening experience for hearing aid users in automobiles. PMID:23886425
Evaluating a smartphone digits-in-noise test as part of the audiometric test battery.
Potgieter, Jenni-Mari; Swanepoel, De Wet; Smits, Cas
2018-05-21
Speech-in-noise tests have become a valuable part of the audiometric test battery providing an indication of a listener's ability to function in background noise. A simple digits-in-noise (DIN) test could be valuable to support diagnostic hearing assessments, hearing aid fittings and counselling for both paediatric and adult populations. Objective: The objective of this study was to evaluate the South African English smartphone DIN test's performance as part of the audiometric test battery. Design: This descriptive study evaluated 109 adult subjects (43 male and 66 female subjects) with and without sensorineural hearing loss by comparing pure-tone air conduction thresholds, speech recognition monaural performance scores (SRS dB) and the DIN speech reception threshold (SRT). An additional nine adult hearing aid users (four male and five female subjects) were included in a subset to determine aided and unaided DIN SRTs. Results: The DIN SRT is strongly associated with the best ear 4 frequency pure-tone average (4FPTA) (rs = 0.81) and maximum SRS dB (r = 0.72). The DIN test had high sensitivity and specificity to identify abnormal pure-tone (0.88 and 0.88, respectively) and SRS dB (0.76 and 0.88, respectively) results. There was a mean signal-to-noise ratio (SNR) improvement in the aided condition that demonstrated an overall benefit of 0.84 SNR dB. Conclusion: The DIN SRT was significantly correlated with the best ear 4FPTA and maximum SRS dB. The DIN SRT provides a useful measure of speech recognition in noise that can evaluate hearing aid fittings, manage counselling and hearing expectations.
Automatic speech recognition using a predictive echo state network classifier.
Skowronski, Mark D; Harris, John G
2007-04-01
We have combined an echo state network (ESN) with a competitive state machine framework to create a classification engine called the predictive ESN classifier. We derive the expressions for training the predictive ESN classifier and show that the model was significantly more noise robust compared to a hidden Markov model in noisy speech classification experiments by 8+/-1 dB signal-to-noise ratio. The simple training algorithm and noise robustness of the predictive ESN classifier make it an attractive classification engine for automatic speech recognition.
A Mis-recognized Medical Vocabulary Correction System for Speech-based Electronic Medical Record
Seo, Hwa Jeong; Kim, Ju Han; Sakabe, Nagamasa
2002-01-01
Speech recognition as an input tool for electronic medical record (EMR) enables efficient data entry at the point of care. However, the recognition accuracy for medical vocabulary is much poorer than that for doctor-patient dialogue. We developed a mis-recognized medical vocabulary correction system based on syllable-by-syllable comparison of speech text against medical vocabulary database. Using specialty medical vocabulary, the algorithm detects and corrects mis-recognized medical vocabularies in narrative text. Our preliminary evaluation showed 94% of accuracy in mis-recognized medical vocabulary correction.
Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition
NASA Astrophysics Data System (ADS)
Malcangi, Mario
2009-08-01
Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.
NASA Astrophysics Data System (ADS)
Kattoju, Ravi Kiran; Barber, Daniel J.; Abich, Julian; Harris, Jonathan
2016-05-01
With increasing necessity for intuitive Soldier-robot communication in military operations and advancements in interactive technologies, autonomous robots have transitioned from assistance tools to functional and operational teammates able to service an array of military operations. Despite improvements in gesture and speech recognition technologies, their effectiveness in supporting Soldier-robot communication is still uncertain. The purpose of the present study was to evaluate the performance of gesture and speech interface technologies to facilitate Soldier-robot communication during a spatial-navigation task with an autonomous robot. Gesture and speech semantically based spatial-navigation commands leveraged existing lexicons for visual and verbal communication from the U.S Army field manual for visual signaling and a previously established Squad Level Vocabulary (SLV). Speech commands were recorded by a Lapel microphone and Microsoft Kinect, and classified by commercial off-the-shelf automatic speech recognition (ASR) software. Visual signals were captured and classified using a custom wireless gesture glove and software. Participants in the experiment commanded a robot to complete a simulated ISR mission in a scaled down urban scenario by delivering a sequence of gesture and speech commands, both individually and simultaneously, to the robot. Performance and reliability of gesture and speech hardware interfaces and recognition tools were analyzed and reported. Analysis of experimental results demonstrated the employed gesture technology has significant potential for enabling bidirectional Soldier-robot team dialogue based on the high classification accuracy and minimal training required to perform gesture commands.
Hidden Markov models in automatic speech recognition
NASA Astrophysics Data System (ADS)
Wrzoskowicz, Adam
1993-11-01
This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.
Calandruccio, Lauren; Bradlow, Ann R.; Dhar, Sumitrajit
2013-01-01
Background Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared to native-accented English speech was reported in Calandruccio, Dhar and Bradlow (2010). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. Purpose The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech affect masking release. Research Design A mixed model design with within- (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech, and high-intelligibility, moderate-intelligibility and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Study Sample Three listener groups were tested including monolingual English speakers with normal hearing, non-native speakers of English with normal hearing, and monolingual speakers of English with hearing loss. The non-native speakers of English were from various native-language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetrical, mild sloping to moderate sensorineural hearing loss. Data Collection and Analysis Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the keywords within the sentences (100 keywords/masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and the listener groups. Results Monolingual speakers of English with normal hearing benefited when the competing speech signal was foreign-accented compared to native-accented allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the non-native English listeners with normal hearing, nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Conclusions Slight modifications between the target and the masker speech allowed monolingual speakers of English with normal hearing to improve their recognition of native-accented English even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker, or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. PMID:25126683
Fifty years of progress in acoustic phonetics
NASA Astrophysics Data System (ADS)
Stevens, Kenneth N.
2004-10-01
Three events that occurred 50 or 60 years ago shaped the study of acoustic phonetics, and in the following few decades these events influenced research and applications in speech disorders, speech development, speech synthesis, speech recognition, and other subareas in speech communication. These events were: (1) the source-filter theory of speech production (Chiba and Kajiyama; Fant); (2) the development of the sound spectrograph and its interpretation (Potter, Kopp, and Green; Joos); and (3) the birth of research that related distinctive features to acoustic patterns (Jakobson, Fant, and Halle). Following these events there has been systematic exploration of the articulatory, acoustic, and perceptual bases of phonological categories, and some quantification of the sources of variability in the transformation of this phonological representation of speech into its acoustic manifestations. This effort has been enhanced by studies of how children acquire language in spite of this variability and by research on speech disorders. Gaps in our knowledge of this inherent variability in speech have limited the directions of applications such as synthesis and recognition of speech, and have led to the implementation of data-driven techniques rather than theoretical principles. Some examples of advances in our knowledge, and limitations of this knowledge, are reviewed.
Cochlear implantation in adults with asymmetric hearing loss.
Firszt, Jill B; Holden, Laura K; Reeder, Ruth M; Cowdrey, Lisa; King, Sarah
2012-01-01
Bilateral severe to profound sensorineural hearing loss is a standard criterion for cochlear implantation. Increasingly, patients are implanted in one ear and continue to use a hearing aid in the nonimplanted ear to improve abilities such as sound localization and speech understanding in noise. Patients with severe to profound hearing loss in one ear and a more moderate hearing loss in the other ear (i.e., asymmetric hearing) are not typically considered candidates for cochlear implantation. Amplification in the poorer ear is often unsuccessful because of limited benefit, restricting the patient to unilateral listening from the better ear alone. The purpose of this study was to determine whether patients with asymmetric hearing loss could benefit from cochlear implantation in the poorer ear with continued use of a hearing aid in the better ear. Ten adults with asymmetric hearing between ears participated. In the poorer ear, all participants met cochlear implant candidacy guidelines; seven had postlingual onset, and three had pre/perilingual onset of severe to profound hearing loss. All had open-set speech recognition in the better-hearing ear. Assessment measures included word and sentence recognition in quiet, sentence recognition in fixed noise (four-talker babble) and in diffuse restaurant noise using an adaptive procedure, localization of word stimuli, and a hearing handicap scale. Participants were evaluated preimplant with hearing aids and postimplant with the implant alone, the hearing aid alone in the better ear, and bimodally (the implant and hearing aid in combination). Postlingual participants were evaluated at 6 mo postimplant, and pre/perilingual participants were evaluated at 6 and 12 mo postimplant. Data analysis compared the following results: (1) the poorer-hearing ear preimplant (with hearing aid) and postimplant (with cochlear implant); (2) the device(s) used for everyday listening pre- and postimplant; and (3) the hearing aid-alone and bimodal listening conditions postimplant. The postlingual participants showed significant improvements in speech recognition after 6 mo cochlear implant use in the poorer ear. Five postlingual participants had a bimodal advantage over the hearing aid-alone condition on at least one test measure. On average, the postlingual participants had significantly improved localization with bimodal input compared with the hearing aid-alone. Only one pre/perilingual participant had open-set speech recognition with the cochlear implant. This participant had better hearing than the other two pre/perilingual participants in both the poorer and better ear. Localization abilities were not significantly different between the bimodal and hearing aid-alone conditions for the pre/perilingual participants. Mean hearing handicap ratings improved postimplant for all participants indicating perceived benefit in everyday life with the addition of the cochlear implant. Patients with asymmetric hearing loss who are not typical cochlear implant candidates can benefit from using a cochlear implant in the poorer ear with continued use of a hearing aid in the better ear. For this group of 10, the 7 postlingually deafened participants showed greater benefits with the cochlear implant than the pre/perilingual participants; however, further study is needed to determine maximum benefit for those with early onset of hearing loss.
Cochlear Implantation in Adults with Asymmetric Hearing Loss
Firszt, Jill B.; Holden, Laura K.; Reeder, Ruth M.; Cowdrey, Lisa; King, Sarah
2012-01-01
Objective Bilateral severe-to-profound sensorineural hearing loss is a standard criterion for cochlear implantation. Increasingly, patients are implanted in one ear and continue to use a hearing aid in the non-implanted ear to improve abilities such as sound localization and speech understanding in noise. Patients with severe-to-profound hearing loss in one ear and a more moderate hearing loss in the other ear (i.e., asymmetric hearing) are not typically considered candidates for cochlear implantation. Amplification in the poorer ear is often unsuccessful due to limited benefit, restricting the patient to unilateral listening from the better ear alone. The purpose of this study was to determine if patients with asymmetric hearing loss could benefit from cochlear implantation in the poorer ear with continued use of a hearing aid in the better ear. Design Ten adults with asymmetric hearing between ears participated. In the poorer ear, all participants met cochlear implant candidacy guidelines; seven had postlingual onset and three had pre/perilingual onset of severe-to-profound hearing loss. All had open-set speech recognition in the better hearing ear. Assessment measures included word and sentence recognition in quiet, sentence recognition in fixed noise (four-talker babble) and in diffuse restaurant noise using an adaptive procedure, localization of word stimuli and a hearing handicap scale. Participants were evaluated pre-implant with hearing aids and post-implant with the implant alone, the hearing aid alone in the better ear and bimodally (the implant and hearing aid in combination). Postlingual participants were evaluated at six months post-implant and pre/perilingual participants were evaluated at six and 12 months post-implant. Data analysis compared results 1) of the poorer hearing ear pre-implant (with hearing aid) and post-implant (with cochlear implant), 2) with the device(s) used for everyday listening pre- and post-implant and, 3) between the hearing aid-alone and bimodal listening conditions post-implant. Results The postlingual participants showed significant improvements in speech recognition after six months cochlear implant use in the poorer ear. Five postlingual participants had a bimodal advantage over the hearing aid-alone condition on at least one test measure. On average, the postlingual participants had significantly improved localization with bimodal input compared to the hearing aid-alone. Only one pre/perilingual participant had open-set speech recognition with the cochlear implant. This participant had better hearing than the other two pre/perilingual participants in both the poorer and better ear. Localization abilities were not significantly different between the bimodal and hearing aid-alone conditions for the pre/perilingual participants. Mean hearing handicap ratings improved post-implant for all participants indicating perceived benefit in everyday life with the addition of the cochlear implant. Conclusions Patients with asymmetric hearing loss who are not typical cochlear implant candidates can benefit from using a cochlear implant in the poorer ear with continued use of a hearing aid in the better ear. For this group of ten, the seven postlingually deafened participants showed greater benefits with the cochlear implant than the pre/perilingual participants; however, further study is needed to determine maximum benefit for those with early onset of hearing loss. PMID:22441359
NASA Astrophysics Data System (ADS)
Green, Tim; Faulkner, Andrew; Rosen, Stuart; Macherey, Olivier
2005-07-01
Standard continuous interleaved sampling processing, and a modified processing strategy designed to enhance temporal cues to voice pitch, were compared on tests of intonation perception, and vowel perception, both in implant users and in acoustic simulations. In standard processing, 400 Hz low-pass envelopes modulated either pulse trains (implant users) or noise carriers (simulations). In the modified strategy, slow-rate envelope modulations, which convey dynamic spectral variation crucial for speech understanding, were extracted by low-pass filtering (32 Hz). In addition, during voiced speech, higher-rate temporal modulation in each channel was provided by 100% amplitude-modulation by a sawtooth-like wave form whose periodicity followed the fundamental frequency (F0) of the input. Channel levels were determined by the product of the lower- and higher-rate modulation components. Both in acoustic simulations and in implant users, the ability to use intonation information to identify sentences as question or statement was significantly better with modified processing. However, while there was no difference in vowel recognition in the acoustic simulation, implant users performed worse with modified processing both in vowel recognition and in formant frequency discrimination. It appears that, while enhancing pitch perception, modified processing harmed the transmission of spectral information.
Halim, Zahid; Abbas, Ghulam
2015-01-01
Sign language provides hearing and speech impaired individuals with an interface to communicate with other members of the society. Unfortunately, sign language is not understood by most of the common people. For this, a gadget based on image processing and pattern recognition can provide with a vital aid for detecting and translating sign language into a vocal language. This work presents a system for detecting and understanding the sign language gestures by a custom built software tool and later translating the gesture into a vocal language. For the purpose of recognizing a particular gesture, the system employs a Dynamic Time Warping (DTW) algorithm and an off-the-shelf software tool is employed for vocal language generation. Microsoft(®) Kinect is the primary tool used to capture video stream of a user. The proposed method is capable of successfully detecting gestures stored in the dictionary with an accuracy of 91%. The proposed system has the ability to define and add custom made gestures. Based on an experiment in which 10 individuals with impairments used the system to communicate with 5 people with no disability, 87% agreed that the system was useful.
Restoring the missing features of the corrupted speech using linear interpolation methods
NASA Astrophysics Data System (ADS)
Rassem, Taha H.; Makbol, Nasrin M.; Hasan, Ali Muttaleb; Zaki, Siti Syazni Mohd; Girija, P. N.
2017-10-01
One of the main challenges in the Automatic Speech Recognition (ASR) is the noise. The performance of the ASR system reduces significantly if the speech is corrupted by noise. In spectrogram representation of a speech signal, after deleting low Signal to Noise Ratio (SNR) elements, the incomplete spectrogram is obtained. In this case, the speech recognizer should make modifications to the spectrogram in order to restore the missing elements, which is one direction. In another direction, speech recognizer should be able to restore the missing elements due to deleting low SNR elements before performing the recognition. This is can be done using different spectrogram reconstruction methods. In this paper, the geometrical spectrogram reconstruction methods suggested by some researchers are implemented as a toolbox. In these geometrical reconstruction methods, the linear interpolation along time or frequency methods are used to predict the missing elements between adjacent observed elements in the spectrogram. Moreover, a new linear interpolation method using time and frequency together is presented. The CMU Sphinx III software is used in the experiments to test the performance of the linear interpolation reconstruction method. The experiments are done under different conditions such as different lengths of the window and different lengths of utterances. Speech corpus consists of 20 males and 20 females; each one has two different utterances are used in the experiments. As a result, 80% recognition accuracy is achieved with 25% SNR ratio.
Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems
NASA Technical Reports Server (NTRS)
Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan
2010-01-01
A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.
Plant, Kerrie; Babic, Leanne
2016-01-01
The aim of the study was to quantify the benefit provided by having access to amplified acoustic hearing in the implanted ear for use in combination with contralateral acoustic hearing and the electrical stimulation provided by the cochlear implant. Measures of spatial and non-spatial hearing abilities were obtained to compare performance obtained with different configurations of acoustic hearing in combination with electrical stimulation. In the combined listening condition participants had access to bilateral acoustic hearing whereas the bimodal condition used acoustic hearing contralateral to the implanted ear only. Experience was provided with each of the listening conditions using a repeated-measures A-B-B-A experimental design. Sixteen post-linguistically hearing-impaired adults participated in the study. Group mean benefit was obtained with use of the combined mode on measures of speech recognition in coincident speech in noise, localization ability, subjective ratings of real-world benefit, and musical sound quality ratings. Access to bilateral acoustic hearing after cochlear implantation provides significant benefit on a range of functional measures.