Sample records for speech recognition errors

  1. Supporting Dictation Speech Recognition Error Correction: The Impact of External Information

    ERIC Educational Resources Information Center

    Shi, Yongmei; Zhou, Lina

    2011-01-01

    Although speech recognition technology has made remarkable progress, its wide adoption is still restricted by notable effort made and frustration experienced by users while correcting speech recognition errors. One of the promising ways to improve error correction is by providing user support. Although support mechanisms have been proposed for…

  2. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    NASA Technical Reports Server (NTRS)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  3. Automatic concept extraction from spoken medical reports.

    PubMed

    Happe, André; Pouliquen, Bruno; Burgun, Anita; Cuggia, Marc; Le Beux, Pierre

    2003-07-01

    The objective of this project is to investigate methods whereby a combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices. We based our study on existing speech recognition software programs and on NOMINDEX, a tool that extracts MeSH concepts from medical text in natural language and that is mainly based on a French medical lexicon and on the UMLS. For each document, the process consists of three steps: (1) dictation and digital audio recording, (2) speech recognition, (3) automatic indexing. The evaluation consisted of a comparison between the set of concepts extracted by NOMINDEX after the speech recognition phase and the set of keywords manually extracted from the initial document. The method was evaluated on a set of 28 patient discharge summaries extracted from the MENELAS corpus in French, corresponding to in-patients admitted for coronarography. The overall precision was 73% and the overall recall was 90%. Indexing errors were mainly due to word sense ambiguity and abbreviations. A specific issue was the fact that the standard French translation of MeSH terms lacks diacritics. A preliminary evaluation of speech recognition tools showed that the rate of accurate recognition was higher than 98%. Only 3% of the indexing errors were generated by inadequate speech recognition. We discuss several areas to focus on to improve this prototype. However, the very low rate of indexing errors due to speech recognition errors highlights the potential benefits of combining speech recognition techniques and automatic indexing.

  4. Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks

    ERIC Educational Resources Information Center

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2016-01-01

    This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…

  5. Syntactic and semantic errors in radiology reports associated with speech recognition software.

    PubMed

    Ringler, Michael D; Goss, Brian C; Bartholmai, Brian J

    2017-03-01

    Speech recognition software can increase the frequency of errors in radiology reports, which may affect patient care. We retrieved 213,977 speech recognition software-generated reports from 147 different radiologists and proofread them for errors. Errors were classified as "material" if they were believed to alter interpretation of the report. "Immaterial" errors were subclassified as intrusion/omission or spelling errors. The proportion of errors and error type were compared among individual radiologists, imaging subspecialty, and time periods. In all, 20,759 reports (9.7%) contained errors, of which 3992 (1.9%) were material errors. Among immaterial errors, spelling errors were more common than intrusion/omission errors ( p < .001). Proportion of errors and fraction of material errors varied significantly among radiologists and between imaging subspecialties ( p < .001). Errors were more common in cross-sectional reports, reports reinterpreting results of outside examinations, and procedural studies (all p < .001). Error rate decreased over time ( p < .001), which suggests that a quality control program with regular feedback may reduce errors.

  6. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    NASA Astrophysics Data System (ADS)

    Caballero Morales, Santiago Omar; Cox, Stephen J.

    2009-12-01

    Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  7. Syntactic error modeling and scoring normalization in speech recognition

    NASA Technical Reports Server (NTRS)

    Olorenshaw, Lex

    1991-01-01

    The objective was to develop the speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Research was performed in the following areas: (1) syntactic error modeling; (2) score normalization; and (3) phoneme error modeling. The study into the types of errors that a reader makes will provide the basis for creating tests which will approximate the use of the system in the real world. NASA-Johnson will develop this technology into a 'Literacy Tutor' in order to bring innovative concepts to the task of teaching adults to read.

  8. Studies in automatic speech recognition and its application in aerospace

    NASA Astrophysics Data System (ADS)

    Taylor, Michael Robinson

    Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.

  9. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    NASA Astrophysics Data System (ADS)

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  10. Speech variability effects on recognition accuracy associated with concurrent task performance by pilots

    NASA Technical Reports Server (NTRS)

    Simpson, C. A.

    1985-01-01

    In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.

  11. Speech recognition technology: an outlook for human-to-machine interaction.

    PubMed

    Erdel, T; Crooks, S

    2000-01-01

    Speech recognition, as an enabling technology in healthcare-systems computing, is a topic that has been discussed for quite some time, but is just now coming to fruition. Traditionally, speech-recognition software has been constrained by hardware, but improved processors and increased memory capacities are starting to remove some of these limitations. With these barriers removed, companies that create software for the healthcare setting have the opportunity to write more successful applications. Among the criticisms of speech-recognition applications are the high rates of error and steep training curves. However, even in the face of such negative perceptions, there remains significant opportunities for speech recognition to allow healthcare providers and, more specifically, physicians, to work more efficiently and ultimately spend more time with their patients and less time completing necessary documentation. This article will identify opportunities for inclusion of speech-recognition technology in the healthcare setting and examine major categories of speech-recognition software--continuous speech recognition, command and control, and text-to-speech. We will discuss the advantages and disadvantages of each area, the limitations of the software today, and how future trends might affect them.

  12. Incidence of speech recognition errors in the emergency department.

    PubMed

    Goss, Foster R; Zhou, Li; Weiner, Scott G

    2016-09-01

    Physician use of computerized speech recognition (SR) technology has risen in recent years due to its ease of use and efficiency at the point of care. However, error rates between 10 and 23% have been observed, raising concern about the number of errors being entered into the permanent medical record, their impact on quality of care and medical liability that may arise. Our aim was to determine the incidence and types of SR errors introduced by this technology in the emergency department (ED). Level 1 emergency department with 42,000 visits/year in a tertiary academic teaching hospital. A random sample of 100 notes dictated by attending emergency physicians (EPs) using SR software was collected from the ED electronic health record between January and June 2012. Two board-certified EPs annotated the notes and conducted error analysis independently. An existing classification schema was adopted to classify errors into eight errors types. Critical errors deemed to potentially impact patient care were identified. There were 128 errors in total or 1.3 errors per note, and 14.8% (n=19) errors were judged to be critical. 71% of notes contained errors, and 15% contained one or more critical errors. Annunciation errors were the highest at 53.9% (n=69), followed by deletions at 18.0% (n=23) and added words at 11.7% (n=15). Nonsense errors, homonyms and spelling errors were present in 10.9% (n=14), 4.7% (n=6), and 0.8% (n=1) of notes, respectively. There were no suffix or dictionary errors. Inter-annotator agreement was 97.8%. This is the first estimate at classifying speech recognition errors in dictated emergency department notes. Speech recognition errors occur commonly with annunciation errors being the most frequent. Error rates were comparable if not lower than previous studies. 15% of errors were deemed critical, potentially leading to miscommunication that could affect patient care. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  13. Hybrid Speaker Recognition Using Universal Acoustic Model

    NASA Astrophysics Data System (ADS)

    Nishimura, Jun; Kuroda, Tadahiro

    We propose a novel speaker recognition approach using a speaker-independent universal acoustic model (UAM) for sensornet applications. In sensornet applications such as “Business Microscope”, interactions among knowledge workers in an organization can be visualized by sensing face-to-face communication using wearable sensor nodes. In conventional studies, speakers are detected by comparing energy of input speech signals among the nodes. However, there are often synchronization errors among the nodes which degrade the speaker recognition performance. By focusing on property of the speaker's acoustic channel, UAM can provide robustness against the synchronization error. The overall speaker recognition accuracy is improved by combining UAM with the energy-based approach. For 0.1s speech inputs and 4 subjects, speaker recognition accuracy of 94% is achieved at the synchronization error less than 100ms.

  14. A preliminary comparison of speech recognition functionality in dental practice management systems.

    PubMed

    Irwin, Jeannie Y; Schleyer, Titus

    2008-11-06

    In this study, we examined speech recognition functionality in four leading dental practice management systems. Twenty dental students used voice to chart a simulated patient with 18 findings in each system. Results show it can take over a minute to chart one finding and that users frequently have to repeat commands. Limited functionality, poor usability and a high error rate appear to retard adoption of speech recognition in dentistry.

  15. Effects and modeling of phonetic and acoustic confusions in accented speech.

    PubMed

    Fung, Pascale; Liu, Yi

    2005-11-01

    Accented speech recognition is more challenging than standard speech recognition due to the effects of phonetic and acoustic confusions. Phonetic confusion in accented speech occurs when an expected phone is pronounced as a different one, which leads to erroneous recognition. Acoustic confusion occurs when the pronounced phone is found to lie acoustically between two baseform models and can be equally recognized as either one. We propose that it is necessary to analyze and model these confusions separately in order to improve accented speech recognition without degrading standard speech recognition. Since low phonetic confusion units in accented speech do not give rise to automatic speech recognition errors, we focus on analyzing and reducing phonetic and acoustic confusability under high phonetic confusion conditions. We propose using likelihood ratio test to measure phonetic confusion, and asymmetric acoustic distance to measure acoustic confusion. Only accent-specific phonetic units with low acoustic confusion are used in an augmented pronunciation dictionary, while phonetic units with high acoustic confusion are reconstructed using decision tree merging. Experimental results show that our approach is effective and superior to methods modeling phonetic confusion or acoustic confusion alone in accented speech, with a significant 5.7% absolute WER reduction, without degrading standard speech recognition.

  16. Speech Recognition for Medical Dictation: Overview in Quebec and Systematic Review.

    PubMed

    Poder, Thomas G; Fisette, Jean-François; Déry, Véronique

    2018-04-03

    Speech recognition is increasingly used in medical reporting. The aim of this article is to identify in the literature the strengths and weaknesses of this technology, as well as barriers to and facilitators of its implementation. A systematic review of systematic reviews was performed using PubMed, Scopus, the Cochrane Library and the Center for Reviews and Dissemination through August 2017. The gray literature has also been consulted. The quality of systematic reviews has been assessed with the AMSTAR checklist. The main inclusion criterion was use of speech recognition for medical reporting (front-end or back-end). A survey has also been conducted in Quebec, Canada, to identify the dissemination of this technology in this province, as well as the factors leading to the success or failure of its implementation. Five systematic reviews were identified. These reviews indicated a high level of heterogeneity across studies. The quality of the studies reported was generally poor. Speech recognition is not as accurate as human transcription, but it can dramatically reduce turnaround times for reporting. In front-end use, medical doctors need to spend more time on dictation and correction than required with human transcription. With speech recognition, major errors occur up to three times more frequently. In back-end use, a potential increase in productivity of transcriptionists was noted. In conclusion, speech recognition offers several advantages for medical reporting. However, these advantages are countered by an increased burden on medical doctors and by risks of additional errors in medical reports. It is also hard to identify for which medical specialties and which clinical activities the use of speech recognition will be the most beneficial.

  17. Is talking to an automated teller machine natural and fun?

    PubMed

    Chan, F Y; Khalid, H M

    Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.

  18. Does the cost function matter in Bayes decision rule?

    PubMed

    Schlü ter, Ralf; Nussbaum-Thom, Markus; Ney, Hermann

    2012-02-01

    In many tasks in pattern recognition, such as automatic speech recognition (ASR), optical character recognition (OCR), part-of-speech (POS) tagging, and other string recognition tasks, we are faced with a well-known inconsistency: The Bayes decision rule is usually used to minimize string (symbol sequence) error, whereas, in practice, we want to minimize symbol (word, character, tag, etc.) error. When comparing different recognition systems, we do indeed use symbol error rate as an evaluation measure. The topic of this work is to analyze the relation between string (i.e., 0-1) and symbol error (i.e., metric, integer valued) cost functions in the Bayes decision rule, for which fundamental analytic results are derived. Simple conditions are derived for which the Bayes decision rule with integer-valued metric cost function and with 0-1 cost gives the same decisions or leads to classes with limited cost. The corresponding conditions can be tested with complexity linear in the number of classes. The results obtained do not make any assumption w.r.t. the structure of the underlying distributions or the classification problem. Nevertheless, the general analytic results are analyzed via simulations of string recognition problems with Levenshtein (edit) distance cost function. The results support earlier findings that considerable improvements are to be expected when initial error rates are high.

  19. Recognizing Whispered Speech Produced by an Individual with Surgically Reconstructed Larynx Using Articulatory Movement Data

    PubMed Central

    Cao, Beiming; Kim, Myungjong; Mau, Ted; Wang, Jun

    2017-01-01

    Individuals with larynx (vocal folds) impaired have problems in controlling their glottal vibration, producing whispered speech with extreme hoarseness. Standard automatic speech recognition using only acoustic cues is typically ineffective for whispered speech because the corresponding spectral characteristics are distorted. Articulatory cues such as the tongue and lip motion may help in recognizing whispered speech since articulatory motion patterns are generally not affected. In this paper, we investigated whispered speech recognition for patients with reconstructed larynx using articulatory movement data. A data set with both acoustic and articulatory motion data was collected from a patient with surgically reconstructed larynx using an electromagnetic articulograph. Two speech recognition systems, Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural network-HMM (DNN-HMM), were used in the experiments. Experimental results showed adding either tongue or lip motion data to acoustic features such as mel-frequency cepstral coefficient (MFCC) significantly reduced the phone error rates on both speech recognition systems. Adding both tongue and lip data achieved the best performance. PMID:29423453

  20. Intonation and dialog context as constraints for speech recognition.

    PubMed

    Taylor, P; King, S; Isard, S; Wright, H

    1998-01-01

    This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.

  1. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

    PubMed

    Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

    2011-01-01

    The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

  2. Hierarchical singleton-type recurrent neural fuzzy networks for noisy speech recognition.

    PubMed

    Juang, Chia-Feng; Chiou, Chyi-Tian; Lai, Chun-Lung

    2007-05-01

    This paper proposes noisy speech recognition using hierarchical singleton-type recurrent neural fuzzy networks (HSRNFNs). The proposed HSRNFN is a hierarchical connection of two singleton-type recurrent neural fuzzy networks (SRNFNs), where one is used for noise filtering and the other for recognition. The SRNFN is constructed by recurrent fuzzy if-then rules with fuzzy singletons in the consequences, and their recurrent properties make them suitable for processing speech patterns with temporal characteristics. In n words recognition, n SRNFNs are created for modeling n words, where each SRNFN receives the current frame feature and predicts the next one of its modeling word. The prediction error of each SRNFN is used as recognition criterion. In filtering, one SRNFN is created, and each SRNFN recognizer is connected to the same SRNFN filter, which filters noisy speech patterns in the feature domain before feeding them to the SRNFN recognizer. Experiments with Mandarin word recognition under different types of noise are performed. Other recognizers, including multilayer perceptron (MLP), time-delay neural networks (TDNNs), and hidden Markov models (HMMs), are also tested and compared. These experiments and comparisons demonstrate good results with HSRNFN for noisy speech recognition tasks.

  3. Speech recognition training for enhancing written language generation by a traumatic brain injury survivor.

    PubMed

    Manasse, N J; Hux, K; Rankin-Erickson, J L

    2000-11-01

    Impairments in motor functioning, language processing, and cognitive status may impact the written language performance of traumatic brain injury (TBI) survivors. One strategy to minimize the impact of these impairments is to use a speech recognition system. The purpose of this study was to explore the effect of mild dysarthria and mild cognitive-communication deficits secondary to TBI on a 19-year-old survivor's mastery and use of such a system-specifically, Dragon Naturally Speaking. Data included the % of the participant's words accurately perceived by the system over time, the participant's accuracy over time in using commands for navigation and error correction, and quantitative and qualitative changes in the participant's written texts generated with and without the use of the speech recognition system. Results showed that Dragon NaturallySpeaking was approximately 80% accurate in perceiving words spoken by the participant, and the participant quickly and easily mastered all navigation and error correction commands presented. Quantitatively, the participant produced a greater amount of text using traditional word processing and a standard keyboard than using the speech recognition system. Minimal qualitative differences appeared between writing samples. Discussion of factors that may have contributed to the obtained results and that may affect the generalization of the findings to other TBI survivors is provided.

  4. [Vocal recognition in dental and oral radiology].

    PubMed

    La Fianza, A; Giorgetti, S; Marelli, P; Campani, R

    1993-10-01

    Speech reporting benefits by units which can recognize sentences in any natural language in real time. The use of this method in the everyday practice of radiology departments shows its possible application fields. We used the speech recognition method to report orthopantomographic exams in order to evaluate the advantages the method offers to the management and quality of reporting the exams which are difficult to fit in other closed computed reporting systems. Both speech recognition and the conventional reporting method (tape recording and typewriting) were used to report 760 orthopantomographs. The average time needed to make the report, the legibility (or Flesch) index, as adapted for the Italian language, and finally a clinical index (the subjective opinion of 4 odontostomatologists) were evaluated for each exam, with both techniques. Moreover, errors in speech reporting (crude, human and overall errors) were also evaluated. The advantages of speech reporting consisted in the shorter time needed for the report to become available (2.24 vs 2.99 minutes) (p < 0.0005), in the improved Flesch index (30.62 vs 28.9) and in the clinical index. The data obtained from speech reporting in odontostomatologic radiology were useful not only to reduce the mean reporting time of orthopantomographic exams but also to improve report quality by reducing both grammar and transmission mistakes. However, the basic condition for such results to be obtained is the speaker's skills to make a good report.

  5. Robust recognition of loud and Lombard speech in the fighter cockpit environment

    NASA Astrophysics Data System (ADS)

    Stanton, Bill J., Jr.

    1988-08-01

    There are a number of challenges associated with incorporating speech recognition technology into the fighter cockpit. One of the major problems is the wide range of variability in the pilot's voice. That can result from changing levels of stress and workload. Increasing the training set to include abnormal speech is not an attractive option because of the innumerable conditions that would have to be represented and the inordinate amount of time to collect such a training set. A more promising approach is to study subsets of abnormal speech that have been produced under controlled cockpit conditions with the purpose of characterizing reliable shifts that occur relative to normal speech. Such was the initiative of this research. Analyses were conducted for 18 features on 17671 phoneme tokens across eight speakers for normal, loud, and Lombard speech. It was discovered that there was a consistent migration of energy in the sonorants. This discovery of reliable energy shifts led to the development of a method to reduce or eliminate these shifts in the Euclidean distances between LPC log magnitude spectra. This combination significantly improved recognition performance of loud and Lombard speech. Discrepancies in recognition error rates between normal and abnormal speech were reduced by approximately 50 percent for all eight speakers combined.

  6. [The endpoint detection of cough signal in continuous speech].

    PubMed

    Yang, Guoqing; Mo, Hongqiang; Li, Wen; Lian, Lianfang; Zheng, Zeguang

    2010-06-01

    The endpoint detection of cough signal in continuous speech has been researched in order to improve the efficiency and veracity of manual recognition or computer-based automatic recognition. First, using the short time zero crossing ratio(ZCR) for identifying the suspicious coughs and getting the threshold of short time energy based on acoustic characteristics of cough. Then, the short time energy is combined with short time ZCR in order to implement the endpoint detection of cough in continuous speech. To evaluate the effect of the method, first, the virtual number of coughs in each recording was identified by two experienced doctors using the graphical user interface (GUI). Second, the recordings were analyzed by automatic endpoint detection program under Matlab7.0. Finally, the comparison between these two results showed: The error rate of undetected cough is 2.18%, and 98.13% of noise, silence and speech were removed. The way of setting short time energy threshold is robust. The endpoint detection program can remove most speech and noise, thus maintaining a lower rate of error.

  7. A posteriori error estimates in voice source recovery

    NASA Astrophysics Data System (ADS)

    Leonov, A. S.; Sorokin, V. N.

    2017-12-01

    The inverse problem of voice source pulse recovery from a segment of a speech signal is under consideration. A special mathematical model is used for the solution that relates these quantities. A variational method of solving inverse problem of voice source recovery for a new parametric class of sources, that is for piecewise-linear sources (PWL-sources), is proposed. Also, a technique for a posteriori numerical error estimation for obtained solutions is presented. A computer study of the adequacy of adopted speech production model with PWL-sources is performed in solving the inverse problems for various types of voice signals, as well as corresponding study of a posteriori error estimates. Numerical experiments for speech signals show satisfactory properties of proposed a posteriori error estimates, which represent the upper bounds of possible errors in solving the inverse problem. The estimate of the most probable error in determining the source-pulse shapes is about 7-8% for the investigated speech material. It is noted that a posteriori error estimates can be used as a criterion of the quality for obtained voice source pulses in application to speaker recognition.

  8. The cingulo-opercular network provides word-recognition benefit.

    PubMed

    Vaden, Kenneth I; Kuchinsky, Stefanie E; Cute, Stephanie L; Ahlstrom, Jayne B; Dubno, Judy R; Eckert, Mark A

    2013-11-27

    Recognizing speech in difficult listening conditions requires considerable focus of attention that is often demonstrated by elevated activity in putative attention systems, including the cingulo-opercular network. We tested the prediction that elevated cingulo-opercular activity provides word-recognition benefit on a subsequent trial. Eighteen healthy, normal-hearing adults (10 females; aged 20-38 years) performed word recognition (120 trials) in multi-talker babble at +3 and +10 dB signal-to-noise ratios during a sparse sampling functional magnetic resonance imaging (fMRI) experiment. Blood oxygen level-dependent (BOLD) contrast was elevated in the anterior cingulate cortex, anterior insula, and frontal operculum in response to poorer speech intelligibility and response errors. These brain regions exhibited significantly greater correlated activity during word recognition compared with rest, supporting the premise that word-recognition demands increased the coherence of cingulo-opercular network activity. Consistent with an adaptive control network explanation, general linear mixed model analyses demonstrated that increased magnitude and extent of cingulo-opercular network activity was significantly associated with correct word recognition on subsequent trials. These results indicate that elevated cingulo-opercular network activity is not simply a reflection of poor performance or error but also supports word recognition in difficult listening conditions.

  9. Approximated mutual information training for speech recognition using myoelectric signals.

    PubMed

    Guo, Hua J; Chan, A D C

    2006-01-01

    A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.

  10. Analysis of Factors Affecting System Performance in the ASpIRE Challenge

    DTIC Science & Technology

    2015-12-13

    performance in the ASpIRE (Automatic Speech recognition In Reverberant Environments) challenge. In particular, overall word error rate (WER) of the solver...systems is analyzed as a function of room, distance between talker and microphone, and microphone type. We also analyze speech activity detection...analysis will inform the design of future challenges and provide insight into the efficacy of current solutions addressing noisy reverberant speech

  11. Radiological reporting that combine continuous speech recognition with error correction by transcriptionists.

    PubMed

    Ichikawa, Tamaki; Kitanosono, Takashi; Koizumi, Jun; Ogushi, Yoichi; Tanaka, Osamu; Endo, Jun; Hashimoto, Takeshi; Kawada, Shuichi; Saito, Midori; Kobayashi, Makiko; Imai, Yutaka

    2007-12-20

    We evaluated the usefulness of radiological reporting that combines continuous speech recognition (CSR) and error correction by transcriptionists. Four transcriptionists (two with more than 10 years' and two with less than 3 months' transcription experience) listened to the same 100 dictation files and created radiological reports using conventional transcription and a method that combined CSR with manual error correction by the transcriptionists. We compared the 2 groups using the 2 methods for accuracy and report creation time and evaluated the transcriptionists' inter-personal dependence on accuracy rate and report creation time. We used a CSR system that did not require the training of the system to recognize the user's voice. We observed no significant difference in accuracy between the 2 groups and 2 methods that we tested, though transcriptionists with greater experience transcribed faster than those with less experience using conventional transcription. Using the combined method, error correction speed was not significantly different between two groups of transcriptionists with different levels of experience. Combining CSR and manual error correction by transcriptionists enabled convenient and accurate radiological reporting.

  12. Continuous Speech Recognition for Clinicians

    PubMed Central

    Zafar, Atif; Overhage, J. Marc; McDonald, Clement J.

    1999-01-01

    The current generation of continuous speech recognition systems claims to offer high accuracy (greater than 95 percent) speech recognition at natural speech rates (150 words per minute) on low-cost (under $2000) platforms. This paper presents a state-of-the-technology summary, along with insights the authors have gained through testing one such product extensively and other products superficially. The authors have identified a number of issues that are important in managing accuracy and usability. First, for efficient recognition users must start with a dictionary containing the phonetic spellings of all words they anticipate using. The authors dictated 50 discharge summaries using one inexpensive internal medicine dictionary ($30) and found that they needed to add an additional 400 terms to get recognition rates of 98 percent. However, if they used either of two more expensive and extensive commercial medical vocabularies ($349 and $695), they did not need to add terms to get a 98 percent recognition rate. Second, users must speak clearly and continuously, distinctly pronouncing all syllables. Users must also correct errors as they occur, because accuracy improves with error correction by at least 5 percent over two weeks. Users may find it difficult to train the system to recognize certain terms, regardless of the amount of training, and appropriate substitutions must be created. For example, the authors had to substitute “twice a day” for “bid” when using the less expensive dictionary, but not when using the other two dictionaries. From trials they conducted in settings ranging from an emergency room to hospital wards and clinicians' offices, they learned that ambient noise has minimal effect. Finally, they found that a minimal “usable” hardware configuration (which keeps up with dictation) comprises a 300-MHz Pentium processor with 128 MB of RAM and a “speech quality” sound card (e.g., SoundBlaster, $99). Anything less powerful will result in the system lagging behind the speaking rate. The authors obtained 97 percent accuracy with just 30 minutes of training when using the latest edition of one of the speech recognition systems supplemented by a commercial medical dictionary. This technology has advanced considerably in recent years and is now a serious contender to replace some or all of the increasingly expensive alternative methods of dictation with human transcription. PMID:10332653

  13. Cost-sensitive learning for emotion robust speaker recognition.

    PubMed

    Li, Dongdong; Yang, Yingchun; Dai, Weihui

    2014-01-01

    In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved.

  14. A preliminary analysis of human factors affecting the recognition accuracy of a discrete word recognizer for C3 systems

    NASA Astrophysics Data System (ADS)

    Yellen, H. W.

    1983-03-01

    Literature pertaining to Voice Recognition abounds with information relevant to the assessment of transitory speech recognition devices. In the past, engineering requirements have dictated the path this technology followed. But, other factors do exist that influence recognition accuracy. This thesis explores the impact of Human Factors on the successful recognition of speech, principally addressing the differences or variability among users. A Threshold Technology T-600 was used for a 100 utterance vocubalary to test 44 subjects. A statistical analysis was conducted on 5 generic categories of Human Factors: Occupational, Operational, Psychological, Physiological and Personal. How the equipment is trained and the experience level of the speaker were found to be key characteristics influencing recognition accuracy. To a lesser extent computer experience, time or week, accent, vital capacity and rate of air flow, speaker cooperativeness and anxiety were found to affect overall error rates.

  15. Cost-Sensitive Learning for Emotion Robust Speaker Recognition

    PubMed Central

    Li, Dongdong; Yang, Yingchun

    2014-01-01

    In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved. PMID:24999492

  16. Participation of the Classical Speech Areas in Auditory Long-Term Memory

    PubMed Central

    Karabanov, Anke Ninija; Paine, Rainer; Chao, Chi Chao; Schulze, Katrin; Scott, Brian; Hallett, Mark; Mishkin, Mortimer

    2015-01-01

    Accumulating evidence suggests that storing speech sounds requires transposing rapidly fluctuating sound waves into more easily encoded oromotor sequences. If so, then the classical speech areas in the caudalmost portion of the temporal gyrus (pSTG) and in the inferior frontal gyrus (IFG) may be critical for performing this acoustic-oromotor transposition. We tested this proposal by applying repetitive transcranial magnetic stimulation (rTMS) to each of these left-hemisphere loci, as well as to a nonspeech locus, while participants listened to pseudowords. After 5 minutes these stimuli were re-presented together with new ones in a recognition test. Compared to control-site stimulation, pSTG stimulation produced a highly significant increase in recognition error rate, without affecting reaction time. By contrast, IFG stimulation led only to a weak, non-significant, trend toward recognition memory impairment. Importantly, the impairment after pSTG stimulation was not due to interference with perception, since the same stimulation failed to affect pseudoword discrimination examined with short interstimulus intervals. Our findings suggest that pSTG is essential for transforming speech sounds into stored motor plans for reproducing the sound. Whether or not the IFG also plays a role in speech-sound recognition could not be determined from the present results. PMID:25815813

  17. Participation of the classical speech areas in auditory long-term memory.

    PubMed

    Karabanov, Anke Ninija; Paine, Rainer; Chao, Chi Chao; Schulze, Katrin; Scott, Brian; Hallett, Mark; Mishkin, Mortimer

    2015-01-01

    Accumulating evidence suggests that storing speech sounds requires transposing rapidly fluctuating sound waves into more easily encoded oromotor sequences. If so, then the classical speech areas in the caudalmost portion of the temporal gyrus (pSTG) and in the inferior frontal gyrus (IFG) may be critical for performing this acoustic-oromotor transposition. We tested this proposal by applying repetitive transcranial magnetic stimulation (rTMS) to each of these left-hemisphere loci, as well as to a nonspeech locus, while participants listened to pseudowords. After 5 minutes these stimuli were re-presented together with new ones in a recognition test. Compared to control-site stimulation, pSTG stimulation produced a highly significant increase in recognition error rate, without affecting reaction time. By contrast, IFG stimulation led only to a weak, non-significant, trend toward recognition memory impairment. Importantly, the impairment after pSTG stimulation was not due to interference with perception, since the same stimulation failed to affect pseudoword discrimination examined with short interstimulus intervals. Our findings suggest that pSTG is essential for transforming speech sounds into stored motor plans for reproducing the sound. Whether or not the IFG also plays a role in speech-sound recognition could not be determined from the present results.

  18. Improving Acoustic Models by Watching Television

    NASA Technical Reports Server (NTRS)

    Witbrock, Michael J.; Hauptmann, Alexander G.

    1998-01-01

    Obtaining sufficient labelled training data is a persistent difficulty for speech recognition research. Although well transcribed data is expensive to produce, there is a constant stream of challenging speech data and poor transcription broadcast as closed-captioned television. We describe a reliable unsupervised method for identifying accurately transcribed sections of these broadcasts, and show how these segments can be used to train a recognition system. Starting from acoustic models trained on the Wall Street Journal database, a single iteration of our training method reduced the word error rate on an independent broadcast television news test set from 62.2% to 59.5%.

  19. A multi-views multi-learners approach towards dysarthric speech recognition using multi-nets artificial neural networks.

    PubMed

    Shahamiri, Seyed Reza; Salim, Siti Salwah Binti

    2014-09-01

    Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of motor speech articulators. Although a few attempts have been made to apply ASR technologies to sufferers of dysarthria, previous studies show that such ASR systems have not attained an adequate level of performance. In this study, a dysarthric multi-networks speech recognizer (DM-NSR) model is provided using a realization of multi-views multi-learners approach called multi-nets artificial neural networks, which tolerates variability of dysarthric speech. In particular, the DM-NSR model employs several ANNs (as learners) to approximate the likelihood of ASR vocabulary words and to deal with the complexity of dysarthric speech. The proposed DM-NSR approach was presented as both speaker-dependent and speaker-independent paradigms. In order to highlight the performance of the proposed model over legacy models, multi-views single-learner models of the DM-NSRs were also provided and their efficiencies were compared in detail. Moreover, a comparison among the prominent dysarthric ASR methods and the proposed one is provided. The results show that the DM-NSR recorded improved recognition rate by up to 24.67% and the error rate was reduced by up to 8.63% over the reference model.

  20. Improved Open-Microphone Speech Recognition

    NASA Astrophysics Data System (ADS)

    Abrash, Victor

    2002-12-01

    Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken dialog manager extra flexibility to recognize the signal with no audio gaps between recognition requests, as well as to rerecognize portions of the signal, or to rerecognize speech with different grammars, acoustic models, recognizers, start times, and so on. SRI expects that this new open-mic functionality will enable NASA to develop better error-correction mechanisms for spoken dialog systems, and may also enable new interaction strategies.

  1. Improved Open-Microphone Speech Recognition

    NASA Technical Reports Server (NTRS)

    Abrash, Victor

    2002-01-01

    Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken dialog manager extra flexibility to recognize the signal with no audio gaps between recognition requests, as well as to rerecognize portions of the signal, or to rerecognize speech with different grammars, acoustic models, recognizers, start times, and so on. SRI expects that this new open-mic functionality will enable NASA to develop better error-correction mechanisms for spoken dialog systems, and may also enable new interaction strategies.

  2. Perceptual learning of degraded speech by minimizing prediction error.

    PubMed

    Sohoglu, Ediz; Davis, Matthew H

    2016-03-22

    Human perception is shaped by past experience on multiple timescales. Sudden and dramatic changes in perception occur when prior knowledge or expectations match stimulus content. These immediate effects contrast with the longer-term, more gradual improvements that are characteristic of perceptual learning. Despite extensive investigation of these two experience-dependent phenomena, there is considerable debate about whether they result from common or dissociable neural mechanisms. Here we test single- and dual-mechanism accounts of experience-dependent changes in perception using concurrent magnetoencephalographic and EEG recordings of neural responses evoked by degraded speech. When speech clarity was enhanced by prior knowledge obtained from matching text, we observed reduced neural activity in a peri-auditory region of the superior temporal gyrus (STG). Critically, longer-term improvements in the accuracy of speech recognition following perceptual learning resulted in reduced activity in a nearly identical STG region. Moreover, short-term neural changes caused by prior knowledge and longer-term neural changes arising from perceptual learning were correlated across subjects with the magnitude of learning-induced changes in recognition accuracy. These experience-dependent effects on neural processing could be dissociated from the neural effect of hearing physically clearer speech, which similarly enhanced perception but increased rather than decreased STG responses. Hence, the observed neural effects of prior knowledge and perceptual learning cannot be attributed to epiphenomenal changes in listening effort that accompany enhanced perception. Instead, our results support a predictive coding account of speech perception; computational simulations show how a single mechanism, minimization of prediction error, can drive immediate perceptual effects of prior knowledge and longer-term perceptual learning of degraded speech.

  3. Perceptual learning of degraded speech by minimizing prediction error

    PubMed Central

    Sohoglu, Ediz

    2016-01-01

    Human perception is shaped by past experience on multiple timescales. Sudden and dramatic changes in perception occur when prior knowledge or expectations match stimulus content. These immediate effects contrast with the longer-term, more gradual improvements that are characteristic of perceptual learning. Despite extensive investigation of these two experience-dependent phenomena, there is considerable debate about whether they result from common or dissociable neural mechanisms. Here we test single- and dual-mechanism accounts of experience-dependent changes in perception using concurrent magnetoencephalographic and EEG recordings of neural responses evoked by degraded speech. When speech clarity was enhanced by prior knowledge obtained from matching text, we observed reduced neural activity in a peri-auditory region of the superior temporal gyrus (STG). Critically, longer-term improvements in the accuracy of speech recognition following perceptual learning resulted in reduced activity in a nearly identical STG region. Moreover, short-term neural changes caused by prior knowledge and longer-term neural changes arising from perceptual learning were correlated across subjects with the magnitude of learning-induced changes in recognition accuracy. These experience-dependent effects on neural processing could be dissociated from the neural effect of hearing physically clearer speech, which similarly enhanced perception but increased rather than decreased STG responses. Hence, the observed neural effects of prior knowledge and perceptual learning cannot be attributed to epiphenomenal changes in listening effort that accompany enhanced perception. Instead, our results support a predictive coding account of speech perception; computational simulations show how a single mechanism, minimization of prediction error, can drive immediate perceptual effects of prior knowledge and longer-term perceptual learning of degraded speech. PMID:26957596

  4. Automatic Speech Recognition in Air Traffic Control: a Human Factors Perspective

    NASA Technical Reports Server (NTRS)

    Karlsson, Joakim

    1990-01-01

    The introduction of Automatic Speech Recognition (ASR) technology into the Air Traffic Control (ATC) system has the potential to improve overall safety and efficiency. However, because ASR technology is inherently a part of the man-machine interface between the user and the system, the human factors issues involved must be addressed. Here, some of the human factors problems are identified and related methods of investigation are presented. Research at M.I.T.'s Flight Transportation Laboratory is being conducted from a human factors perspective, focusing on intelligent parser design, presentation of feedback, error correction strategy design, and optimal choice of input modalities.

  5. Noisy text categorization.

    PubMed

    Vinciarelli, Alessandro

    2005-12-01

    This work presents categorization experiments performed over noisy texts. By noisy, we mean any text obtained through an extraction process (affected by errors) from media other than digital texts (e.g., transcriptions of speech recordings extracted with a recognition system). The performance of a categorization system over the clean and noisy (Word Error Rate between approximately 10 and approximately 50 percent) versions of the same documents is compared. The noisy texts are obtained through handwriting recognition and simulation of optical character recognition. The results show that the performance loss is acceptable for Recall values up to 60-70 percent depending on the noise sources. New measures of the extraction process performance, allowing a better explanation of the categorization results, are proposed.

  6. Kurzweil Reading Machine: A Partial Evaluation of Its Optical Character Recognition Error Rate.

    ERIC Educational Resources Information Center

    Goodrich, Gregory L.; And Others

    1979-01-01

    A study designed to assess the ability of the Kurzweil reading machine (a speech reading device for the visually handicapped) to read three different type styles produced by five different means indicated that the machines tested had different error rates depending upon the means of producing the copy and upon the type style used. (Author/CL)

  7. Analytic study of the Tadoma method: background and preliminary results.

    PubMed

    Norton, S J; Schultz, M C; Reed, C M; Braida, L D; Durlach, N I; Rabinowitz, W M; Chomsky, C

    1977-09-01

    Certain deaf-blind persons have been taught, through the Tadoma method of speechreading, to use vibrotactile cues from the face and neck to understand speech. This paper reports the results of preliminary tests of the speechreading ability of one adult Tadoma user. The tests were of four major types: (1) discrimination of speech stimuli; (2) recognition of words in isolation and in sentences; (3) interpretation of prosodic and syntactic features in sentences; and (4) comprehension of written (Braille) and oral speech. Words in highly contextual environments were much better perceived than were words in low-context environments. Many of the word errors involved phonemic substitutions which shared articulatory features with the target phonemes, with a higher error rate for vowels than consonants. Relative to performance on word-recognition tests, performance on some of the discrimination tests was worse than expected. Perception of sentences appeared to be mildly sensitive to rate of talking and to speaker differences. Results of the tests on perception of prosodic and syntactic features, while inconclusive, indicate that many of the features tested were not used in interpreting sentences. On an English comprehension test, a higher score was obtained for items administered in Braille than through oral presentation.

  8. Sonority contours in word recognition

    NASA Astrophysics Data System (ADS)

    McLennan, Sean

    2003-04-01

    Contrary to the Generativist distinction between competence and performance which asserts that speech or perception errors are due to random, nonlinguistic factors, it seems likely that errors are principled and possibly governed by some of the same constraints as language. A preliminary investigation of errors modeled after the child's ``Chain Whisper'' game (a degraded stimulus task) suggests that a significant number of recognition errors can be characterized as an improvement in syllable sonority contour towards the linguistically least-marked, voiceless-stop-plus-vowel syllable. An independent study of sonority contours showed that approximately half of the English lexicon can be uniquely identified by their contour alone. Additionally, ``sororities'' (groups of words that share a single sonority contour), surprisingly, show no correlation to familiarity or frequency in either size or membership. Together these results imply that sonority contours may be an important factor in word recognition and in defining word ``neighborhoods.'' Moreover, they suggest that linguistic markedness constraints may be more prevalent in performance-related phenomena than previously accepted.

  9. Brain-to-text: decoding spoken phrases from phone representations in the brain.

    PubMed

    Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

    2015-01-01

    It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.

  10. Brain-to-text: decoding spoken phrases from phone representations in the brain

    PubMed Central

    Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

    2015-01-01

    It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech. PMID:26124702

  11. Dereverberation and denoising based on generalized spectral subtraction by multi-channel LMS algorithm using a small-scale microphone array

    NASA Astrophysics Data System (ADS)

    Wang, Longbiao; Odani, Kyohei; Kai, Atsuhiko

    2012-12-01

    A blind dereverberation method based on power spectral subtraction (SS) using a multi-channel least mean squares algorithm was previously proposed to suppress the reverberant speech without additive noise. The results of isolated word speech recognition experiments showed that this method achieved significant improvements over conventional cepstral mean normalization (CMN) in a reverberant environment. In this paper, we propose a blind dereverberation method based on generalized spectral subtraction (GSS), which has been shown to be effective for noise reduction, instead of power SS. Furthermore, we extend the missing feature theory (MFT), which was initially proposed to enhance the robustness of additive noise, to dereverberation. A one-stage dereverberation and denoising method based on GSS is presented to simultaneously suppress both the additive noise and nonstationary multiplicative noise (reverberation). The proposed dereverberation method based on GSS with MFT is evaluated on a large vocabulary continuous speech recognition task. When the additive noise was absent, the dereverberation method based on GSS with MFT using only 2 microphones achieves a relative word error reduction rate of 11.4 and 32.6% compared to the dereverberation method based on power SS and the conventional CMN, respectively. For the reverberant and noisy speech, the dereverberation and denoising method based on GSS achieves a relative word error reduction rate of 12.8% compared to the conventional CMN with GSS-based additive noise reduction method. We also analyze the effective factors of the compensation parameter estimation for the dereverberation method based on SS, such as the number of channels (the number of microphones), the length of reverberation to be suppressed, and the length of the utterance used for parameter estimation. The experimental results showed that the SS-based method is robust in a variety of reverberant environments for both isolated and continuous speech recognition and under various parameter estimation conditions.

  12. Spoken Language Processing in the Clarissa Procedure Browser

    NASA Technical Reports Server (NTRS)

    Rayner, M.; Hockey, B. A.; Renders, J.-M.; Chatzichrisafis, N.; Farrell, K.

    2005-01-01

    Clarissa, an experimental voice enabled procedure browser that has recently been deployed on the International Space Station, is as far as we know the first spoken dialog system in space. We describe the objectives of the Clarissa project and the system's architecture. In particular, we focus on three key problems: grammar-based speech recognition using the Regulus toolkit; methods for open mic speech recognition; and robust side-effect free dialogue management for handling undos, corrections and confirmations. We first describe the grammar-based recogniser we have build using Regulus, and report experiments where we compare it against a class N-gram recogniser trained off the same 3297 utterance dataset. We obtained a 15% relative improvement in WER and a 37% improvement in semantic error rate. The grammar-based recogniser moreover outperforms the class N-gram version for utterances of all lengths from 1 to 9 words inclusive. The central problem in building an open-mic speech recognition system is being able to distinguish between commands directed at the system, and other material (cross-talk), which should be rejected. Most spoken dialogue systems make the accept/reject decision by applying a threshold to the recognition confidence score. NASA shows how a simple and general method, based on standard approaches to document classification using Support Vector Machines, can give substantially better performance, and report experiments showing a relative reduction in the task-level error rate by about 25% compared to the baseline confidence threshold method. Finally, we describe a general side-effect free dialogue management architecture that we have implemented in Clarissa, which extends the "update semantics'' framework by including task as well as dialogue information in the information state. We show that this enables elegant treatments of several dialogue management problems, including corrections, confirmations, querying of the environment, and regression testing.

  13. Words from spontaneous conversational speech can be recognized with human-like accuracy by an error-driven learning algorithm that discriminates between meanings straight from smart acoustic features, bypassing the phoneme as recognition unit.

    PubMed

    Arnold, Denis; Tomaschek, Fabian; Sering, Konstantin; Lopez, Florence; Baayen, R Harald

    2017-01-01

    Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20-44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a 'wide' yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.

  14. Automatic speech recognition technology development at ITT Defense Communications Division

    NASA Technical Reports Server (NTRS)

    White, George M.

    1977-01-01

    An assessment of the applications of automatic speech recognition to defense communication systems is presented. Future research efforts include investigations into the following areas: (1) dynamic programming; (2) recognition of speech degraded by noise; (3) speaker independent recognition; (4) large vocabulary recognition; (5) word spotting and continuous speech recognition; and (6) isolated word recognition.

  15. Fifty years of progress in speech and speaker recognition

    NASA Astrophysics Data System (ADS)

    Furui, Sadaoki

    2004-10-01

    Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.

  16. A Method for Determining the Timing of Displaying the Speaker's Face and Captions for a Real-Time Speech-to-Caption System

    NASA Astrophysics Data System (ADS)

    Kuroki, Hayato; Ino, Shuichi; Nakano, Satoko; Hori, Kotaro; Ifukube, Tohru

    The authors of this paper have been studying a real-time speech-to-caption system using speech recognition technology with a “repeat-speaking” method. In this system, they used a “repeat-speaker” who listens to a lecturer's voice and then speaks back the lecturer's speech utterances into a speech recognition computer. The througoing system showed that the accuracy of the captions is about 97% in Japanese-Japanese conversion and the conversion time from voices to captions is about 4 seconds in English-English conversion in some international conferences. Of course it required a lot of costs to achieve these high performances. In human communications, speech understanding depends not only on verbal information but also on non-verbal information such as speaker's gestures, and face and mouth movements. So the authors found the idea to display information of captions and speaker's face movement images with a suitable way to achieve a higher comprehension after storing information once into a computer briefly. In this paper, we investigate the relationship of the display sequence and display timing between captions that have speech recognition errors and the speaker's face movement images. The results show that the sequence “to display the caption before the speaker's face image” improves the comprehension of the captions. The sequence “to display both simultaneously” shows an improvement only a few percent higher than the question sentence, and the sequence “to display the speaker's face image before the caption” shows almost no change. In addition, the sequence “to display the caption 1 second before the speaker's face shows the most significant improvement of all the conditions.

  17. Pilot study on the feasibility of a computerized speech recognition charting system.

    PubMed

    Feldman, C A; Stevens, D

    1990-08-01

    The objective of this study was to determine the feasibility of developing and using a voice recognition computerized charting system to record dental clinical examination data. More specifically, the study was designed to analyze the time and error differential between the traditional examiner/recorder method (ASSISTANT) and computerized voice recognition method (VOICE). DMFS examinations were performed twice on 20 patients using the traditional ASSISTANT and the VOICE charting system. A statistically significant difference was found when comparing the mean ASSISTANT time of 2.69 min to the VOICE time of 3.72 min (P less than 0.001). No statistically significant difference was found when comparing the mean ASSISTANT recording errors of 0.1 to VOICE recording errors of 0.6 (P = 0.059). 90% of the patients indicated they felt comfortable with the dentist talking to a computer and only 5% of the sample indicated they opposed VOICE. Results from this pilot study indicate that a charting system utilizing voice recognition technology could be considered a viable alternative to traditional examiner/recorder methods of clinical charting.

  18. Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers

    PubMed Central

    Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng

    2014-01-01

    Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004

  19. A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

    NASA Astrophysics Data System (ADS)

    Oh, Yoo Rhee; Kim, Hong Kook

    In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

  20. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

    PubMed Central

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an average vocabulary size similar to that of younger adults, they were still slower in lexical access. PMID:27458400

  1. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise.

    PubMed

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of "verbal abilities" that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18-35 years) and 22 older (60-78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults' poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an average vocabulary size similar to that of younger adults, they were still slower in lexical access.

  2. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms.

    PubMed

    Schädler, Marc R; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than -20 dB could not be predicted.

  3. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms

    PubMed Central

    Schädler, Marc R.; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than −20 dB could not be predicted. PMID:29692200

  4. Should visual speech cues (speechreading) be considered when fitting hearing aids?

    NASA Astrophysics Data System (ADS)

    Grant, Ken

    2002-05-01

    When talker and listener are face-to-face, visual speech cues become an important part of the communication environment, and yet, these cues are seldom considered when designing hearing aids. Models of auditory-visual speech recognition highlight the importance of complementary versus redundant speech information for predicting auditory-visual recognition performance. Thus, for hearing aids to work optimally when visual speech cues are present, it is important to know whether the cues provided by amplification and the cues provided by speechreading complement each other. In this talk, data will be reviewed that show nonmonotonicity between auditory-alone speech recognition and auditory-visual speech recognition, suggesting that efforts designed solely to improve auditory-alone recognition may not always result in improved auditory-visual recognition. Data will also be presented showing that one of the most important speech cues for enhancing auditory-visual speech recognition performance, voicing, is often the cue that benefits least from amplification.

  5. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    NASA Astrophysics Data System (ADS)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  6. Automated Intelligibility Assessment of Pathological Speech Using Phonological Features

    NASA Astrophysics Data System (ADS)

    Middag, Catherine; Martens, Jean-Pierre; Van Nuffelen, Gwen; De Bodt, Marc

    2009-12-01

    It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words) and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008) is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.

  7. SAM: speech-aware applications in medicine to support structured data entry.

    PubMed Central

    Wormek, A. K.; Ingenerf, J.; Orthner, H. F.

    1997-01-01

    In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech recognizer to a specific medical application. The goals of our work are first, to develop a speech-aware core component which is able to establish connections to speech recognition engines of different vendors. This is realized in SAM. Second, with applications based on SAM we want to support the physician in his/her routine clinical care activities. Within the STAMP project (STAndardized Multimedia report generator in Pathology), we extend SAM by combining a structured data entry approach with speech recognition technology. Another speech-aware application in the field of Diabetes care is connected to a terminology server. The server delivers a controlled vocabulary which can be used for speech recognition. PMID:9357730

  8. Development of a Mandarin-English Bilingual Speech Recognition System for Real World Music Retrieval

    NASA Astrophysics Data System (ADS)

    Zhang, Qingqing; Pan, Jielin; Lin, Yang; Shao, Jian; Yan, Yonghong

    In recent decades, there has been a great deal of research into the problem of bilingual speech recognition-to develop a recognizer that can handle inter- and intra-sentential language switching between two languages. This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition systems for real world applications are tackled in this paper. One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language**. In order to process the intra-sentential language switching and reduce the amount of data required to robustly estimate statistical models, a compact single set of bilingual acoustic models derived by phone set merging and clustering is developed instead of using two separate monolingual models for each language. In our study, a novel Two-pass phone clustering method based on Confusion Matrix (TCM) is presented and compared with the log-likelihood measure method. Experiments testify that TCM can achieve better performance. Since potential system users' native language is Mandarin which is regarded as a matrix language in our application, their pronunciations of English as the embedded language usually contain Mandarin accents. In order to deal with the matrix language accents in embedded language, different non-native adaptation approaches are investigated. Experiments show that model retraining method outperforms the other common adaptation methods such as Maximum A Posteriori (MAP). With the effective incorporation of approaches on phone clustering and non-native adaptation, the Phrase Error Rate (PER) of MESRS for English utterances was reduced by 24.47% relatively compared to the baseline monolingual English system while the PER on Mandarin utterances was comparable to that of the baseline monolingual Mandarin system. The performance for bilingual utterances achieved 22.37% relative PER reduction.

  9. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance

    PubMed Central

    Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn

    2016-01-01

    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486

  10. Effects of Hearing Loss and Cognitive Load on Speech Recognition with Competing Talkers.

    PubMed

    Meister, Hartmut; Schreitmüller, Stefan; Ortmann, Magdalene; Rählmann, Sebastian; Walger, Martin

    2016-01-01

    Everyday communication frequently comprises situations with more than one talker speaking at a time. These situations are challenging since they pose high attentional and memory demands placing cognitive load on the listener. Hearing impairment additionally exacerbates communication problems under these circumstances. We examined the effects of hearing loss and attention tasks on speech recognition with competing talkers in older adults with and without hearing impairment. We hypothesized that hearing loss would affect word identification, talker separation and word recall and that the difficulties experienced by the hearing impaired listeners would be especially pronounced in a task with high attentional and memory demands. Two listener groups closely matched for their age and neuropsychological profile but differing in hearing acuity were examined regarding their speech recognition with competing talkers in two different tasks. One task required repeating back words from one target talker (1TT) while ignoring the competing talker whereas the other required repeating back words from both talkers (2TT). The competing talkers differed with respect to their voice characteristics. Moreover, sentences either with low or high context were used in order to consider linguistic properties. Compared to their normal hearing peers, listeners with hearing loss revealed limited speech recognition in both tasks. Their difficulties were especially pronounced in the more demanding 2TT task. In order to shed light on the underlying mechanisms, different error sources, namely having misunderstood, confused, or omitted words were investigated. Misunderstanding and omitting words were more frequently observed in the hearing impaired than in the normal hearing listeners. In line with common speech perception models, it is suggested that these effects are related to impaired object formation and taxed working memory capacity (WMC). In a post-hoc analysis, the listeners were further separated with respect to their WMC. It appeared that higher capacity could be used in the sense of a compensatory mechanism with respect to the adverse effects of hearing loss, especially with low context speech.

  11. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    NASA Astrophysics Data System (ADS)

    Heracleous, Panikos; Kaino, Tomomi; Saruwatari, Hiroshi; Shikano, Kiyohiro

    2006-12-01

    We present the use of stethoscope and silicon NAM (nonaudible murmur) microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (nonaudible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc.) for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a[InlineEquation not available: see fulltext.] word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  12. Relationship between listeners' nonnative speech recognition and categorization abilities

    PubMed Central

    Atagi, Eriko; Bent, Tessa

    2015-01-01

    Enhancement of the perceptual encoding of talker characteristics (indexical information) in speech can facilitate listeners' recognition of linguistic content. The present study explored this indexical-linguistic relationship in nonnative speech processing by examining listeners' performance on two tasks: nonnative accent categorization and nonnative speech-in-noise recognition. Results indicated substantial variability across listeners in their performance on both the accent categorization and nonnative speech recognition tasks. Moreover, listeners' accent categorization performance correlated with their nonnative speech-in-noise recognition performance. These results suggest that having more robust indexical representations for nonnative accents may allow listeners to more accurately recognize the linguistic content of nonnative speech. PMID:25618098

  13. Method and apparatus for obtaining complete speech signals for speech recognition applications

    NASA Technical Reports Server (NTRS)

    Abrash, Victor (Inventor); Cesari, Federico (Inventor); Franco, Horacio (Inventor); George, Christopher (Inventor); Zheng, Jing (Inventor)

    2009-01-01

    The present invention relates to a method and apparatus for obtaining complete speech signals for speech recognition applications. In one embodiment, the method continuously records an audio stream comprising a sequence of frames to a circular buffer. When a user command to commence or terminate speech recognition is received, the method obtains a number of frames of the audio stream occurring before or after the user command in order to identify an augmented audio signal for speech recognition processing. In further embodiments, the method analyzes the augmented audio signal in order to locate starting and ending speech endpoints that bound at least a portion of speech to be processed for recognition. At least one of the speech endpoints is located using a Hidden Markov Model.

  14. Temporal Sensitivity Measured Shortly After Cochlear Implantation Predicts 6-Month Speech Recognition Outcome.

    PubMed

    Erb, Julia; Ludwig, Alexandra Annemarie; Kunke, Dunja; Fuchs, Michael; Obleser, Jonas

    2018-04-24

    Psychoacoustic tests assessed shortly after cochlear implantation are useful predictors of the rehabilitative speech outcome. While largely independent, both spectral and temporal resolution tests are important to provide an accurate prediction of speech recognition. However, rapid tests of temporal sensitivity are currently lacking. Here, we propose a simple amplitude modulation rate discrimination (AMRD) paradigm that is validated by predicting future speech recognition in adult cochlear implant (CI) patients. In 34 newly implanted patients, we used an adaptive AMRD paradigm, where broadband noise was modulated at the speech-relevant rate of ~4 Hz. In a longitudinal study, speech recognition in quiet was assessed using the closed-set Freiburger number test shortly after cochlear implantation (t0) as well as the open-set Freiburger monosyllabic word test 6 months later (t6). Both AMRD thresholds at t0 (r = -0.51) and speech recognition scores at t0 (r = 0.56) predicted speech recognition scores at t6. However, AMRD and speech recognition at t0 were uncorrelated, suggesting that those measures capture partially distinct perceptual abilities. A multiple regression model predicting 6-month speech recognition outcome with deafness duration and speech recognition at t0 improved from adjusted R = 0.30 to adjusted R = 0.44 when AMRD threshold was added as a predictor. These findings identify AMRD thresholds as a reliable, nonredundant predictor above and beyond established speech tests for CI outcome. This AMRD test could potentially be developed into a rapid clinical temporal-resolution test to be integrated into the postoperative test battery to improve the reliability of speech outcome prognosis.

  15. Understanding native Russian listeners' errors on an English word recognition test: model-based analysis of phoneme confusion.

    PubMed

    Shi, Lu-Feng; Morozova, Natalia

    2012-08-01

    Word recognition is a basic component in a comprehensive hearing evaluation, but data are lacking for listeners speaking two languages. This study obtained such data for Russian natives in the US and analysed the data using the perceptual assimilation model (PAM) and speech learning model (SLM). Listeners were randomly presented 200 NU-6 words in quiet. Listeners responded verbally and in writing. Performance was scored on words and phonemes (word-initial consonants, vowels, and word-final consonants). Seven normal-hearing, adult monolingual English natives (NM), 16 English-dominant (ED), and 15 Russian-dominant (RD) Russian natives participated. ED and RD listeners differed significantly in their language background. Consistent with the SLM, NM outperformed ED listeners and ED outperformed RD listeners, whether responses were scored on words or phonemes. NM and ED listeners shared similar phoneme error patterns, whereas RD listeners' errors had unique patterns that could be largely understood via the PAM. RD listeners had particular difficulty differentiating vowel contrasts /i-I/, /æ-ε/, and /ɑ-Λ/, word-initial consonant contrasts /p-h/ and /b-f/, and word-final contrasts /f-v/. Both first-language phonology and second-language learning history affect word and phoneme recognition. Current findings may help clinicians differentiate word recognition errors due to language background from hearing pathologies.

  16. Acoustic landmarks contain more information about the phone string than other frames for automatic speech recognition with deep neural network acoustic model

    NASA Astrophysics Data System (ADS)

    He, Di; Lim, Boon Pang; Yang, Xuesong; Hasegawa-Johnson, Mark; Chen, Deming

    2018-06-01

    Most mainstream Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea, that some frames are more important than others. Acoustic landmark theory exploits quantal non-linearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception. In this work, we conduct experiments on the TIMIT corpus, with both GMM and DNN based ASR systems and find that frames containing landmarks are more informative for ASR than others. We find that altering the level of emphasis on landmarks by re-weighting acoustic likelihood tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (45.8% to be precise) of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrate the potential of landmarks for reducing computation.

  17. Audibility-based predictions of speech recognition for children and adults with normal hearing.

    PubMed

    McCreery, Ryan W; Stelmachowicz, Patricia G

    2011-12-01

    This study investigated the relationship between audibility and predictions of speech recognition for children and adults with normal hearing. The Speech Intelligibility Index (SII) is used to quantify the audibility of speech signals and can be applied to transfer functions to predict speech recognition scores. Although the SII is used clinically with children, relatively few studies have evaluated SII predictions of children's speech recognition directly. Children have required more audibility than adults to reach maximum levels of speech understanding in previous studies. Furthermore, children may require greater bandwidth than adults for optimal speech understanding, which could influence frequency-importance functions used to calculate the SII. Speech recognition was measured for 116 children and 19 adults with normal hearing. Stimulus bandwidth and background noise level were varied systematically in order to evaluate speech recognition as predicted by the SII and derive frequency-importance functions for children and adults. Results suggested that children required greater audibility to reach the same level of speech understanding as adults. However, differences in performance between adults and children did not vary across frequency bands. © 2011 Acoustical Society of America

  18. The Suitability of Cloud-Based Speech Recognition Engines for Language Learning

    ERIC Educational Resources Information Center

    Daniels, Paul; Iwago, Koji

    2017-01-01

    As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…

  19. Individual differences in language and working memory affect children's speech recognition in noise.

    PubMed

    McCreery, Ryan W; Spratford, Meredith; Kirby, Benjamin; Brennan, Marc

    2017-05-01

    We examined how cognitive and linguistic skills affect speech recognition in noise for children with normal hearing. Children with better working memory and language abilities were expected to have better speech recognition in noise than peers with poorer skills in these domains. As part of a prospective, cross-sectional study, children with normal hearing completed speech recognition in noise for three types of stimuli: (1) monosyllabic words, (2) syntactically correct but semantically anomalous sentences and (3) semantically and syntactically anomalous word sequences. Measures of vocabulary, syntax and working memory were used to predict individual differences in speech recognition in noise. Ninety-six children with normal hearing, who were between 5 and 12 years of age. Higher working memory was associated with better speech recognition in noise for all three stimulus types. Higher vocabulary abilities were associated with better recognition in noise for sentences and word sequences, but not for words. Working memory and language both influence children's speech recognition in noise, but the relationships vary across types of stimuli. These findings suggest that clinical assessment of speech recognition is likely to reflect underlying cognitive and linguistic abilities, in addition to a child's auditory skills, consistent with the Ease of Language Understanding model.

  20. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar

    PubMed Central

    Shin, Young Hoon; Seo, Jiwon

    2016-01-01

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867

  1. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar.

    PubMed

    Shin, Young Hoon; Seo, Jiwon

    2016-10-29

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  2. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    ERIC Educational Resources Information Center

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  3. The Effect of Dynamic Pitch on Speech Recognition in Temporally Modulated Noise.

    PubMed

    Shen, Jing; Souza, Pamela E

    2017-09-18

    This study investigated the effect of dynamic pitch in target speech on older and younger listeners' speech recognition in temporally modulated noise. First, we examined whether the benefit from dynamic-pitch cues depends on the temporal modulation of noise. Second, we tested whether older listeners can benefit from dynamic-pitch cues for speech recognition in noise. Last, we explored the individual factors that predict the amount of dynamic-pitch benefit for speech recognition in noise. Younger listeners with normal hearing and older listeners with varying levels of hearing sensitivity participated in the study, in which speech reception thresholds were measured with sentences in nonspeech noise. The younger listeners benefited more from dynamic pitch for speech recognition in temporally modulated noise than unmodulated noise. Older listeners were able to benefit from the dynamic-pitch cues but received less benefit from noise modulation than the younger listeners. For those older listeners with hearing loss, the amount of hearing loss strongly predicted the dynamic-pitch benefit for speech recognition in noise. Dynamic-pitch cues aid speech recognition in noise, particularly when noise has temporal modulation. Hearing loss negatively affects the dynamic-pitch benefit to older listeners with significant hearing loss.

  4. The Effect of Dynamic Pitch on Speech Recognition in Temporally Modulated Noise

    PubMed Central

    Souza, Pamela E.

    2017-01-01

    Purpose This study investigated the effect of dynamic pitch in target speech on older and younger listeners' speech recognition in temporally modulated noise. First, we examined whether the benefit from dynamic-pitch cues depends on the temporal modulation of noise. Second, we tested whether older listeners can benefit from dynamic-pitch cues for speech recognition in noise. Last, we explored the individual factors that predict the amount of dynamic-pitch benefit for speech recognition in noise. Method Younger listeners with normal hearing and older listeners with varying levels of hearing sensitivity participated in the study, in which speech reception thresholds were measured with sentences in nonspeech noise. Results The younger listeners benefited more from dynamic pitch for speech recognition in temporally modulated noise than unmodulated noise. Older listeners were able to benefit from the dynamic-pitch cues but received less benefit from noise modulation than the younger listeners. For those older listeners with hearing loss, the amount of hearing loss strongly predicted the dynamic-pitch benefit for speech recognition in noise. Conclusions Dynamic-pitch cues aid speech recognition in noise, particularly when noise has temporal modulation. Hearing loss negatively affects the dynamic-pitch benefit to older listeners with significant hearing loss. PMID:28800370

  5. [Creating language model of the forensic medicine domain for developing a autopsy recording system by automatic speech recognition].

    PubMed

    Niijima, H; Ito, N; Ogino, S; Takatori, T; Iwase, H; Kobayashi, M

    2000-11-01

    For the purpose of practical use of speech recognition technology for recording of forensic autopsy, a language model of the speech recording system, specialized for the forensic autopsy, was developed. The language model for the forensic autopsy by applying 3-gram model was created, and an acoustic model for Japanese speech recognition by Hidden Markov Model in addition to the above were utilized to customize the speech recognition engine for forensic autopsy. A forensic vocabulary set of over 10,000 words was compiled and some 300,000 sentence patterns were made to create the forensic language model, then properly mixing with a general language model to attain high exactitude. When tried by dictating autopsy findings, this speech recognition system was proved to be about 95% of recognition rate that seems to have reached to the practical usability in view of speech recognition software, though there remains rooms for improving its hardware and application-layer software.

  6. Task-dependent modulation of the visual sensory thalamus assists visual-speech recognition.

    PubMed

    Díaz, Begoña; Blank, Helen; von Kriegstein, Katharina

    2018-05-14

    The cerebral cortex modulates early sensory processing via feed-back connections to sensory pathway nuclei. The functions of this top-down modulation for human behavior are poorly understood. Here, we show that top-down modulation of the visual sensory thalamus (the lateral geniculate body, LGN) is involved in visual-speech recognition. In two independent functional magnetic resonance imaging (fMRI) studies, LGN response increased when participants processed fast-varying features of articulatory movements required for visual-speech recognition, as compared to temporally more stable features required for face identification with the same stimulus material. The LGN response during the visual-speech task correlated positively with the visual-speech recognition scores across participants. In addition, the task-dependent modulation was present for speech movements and did not occur for control conditions involving non-speech biological movements. In face-to-face communication, visual speech recognition is used to enhance or even enable understanding what is said. Speech recognition is commonly explained in frameworks focusing on cerebral cortex areas. Our findings suggest that task-dependent modulation at subcortical sensory stages has an important role for communication: Together with similar findings in the auditory modality the findings imply that task-dependent modulation of the sensory thalami is a general mechanism to optimize speech recognition. Copyright © 2018. Published by Elsevier Inc.

  7. Bridging automatic speech recognition and psycholinguistics: Extending Shortlist to an end-to-end model of human speech recognition (L)

    NASA Astrophysics Data System (ADS)

    Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis

    2003-12-01

    This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.

  8. Automated smartphone audiometry: Validation of a word recognition test app.

    PubMed

    Dewyer, Nicholas A; Jiradejvong, Patpong; Henderson Sabes, Jennifer; Limb, Charles J

    2018-03-01

    Develop and validate an automated smartphone word recognition test. Cross-sectional case-control diagnostic test comparison. An automated word recognition test was developed as an app for a smartphone with earphones. English-speaking adults with recent audiograms and various levels of hearing loss were recruited from an audiology clinic and were administered the smartphone word recognition test. Word recognition scores determined by the smartphone app and the gold standard speech audiometry test performed by an audiologist were compared. Test scores for 37 ears were analyzed. Word recognition scores determined by the smartphone app and audiologist testing were in agreement, with 86% of the data points within a clinically acceptable margin of error and a linear correlation value between test scores of 0.89. The WordRec automated smartphone app accurately determines word recognition scores. 3b. Laryngoscope, 128:707-712, 2018. © 2017 The American Laryngological, Rhinological and Otological Society, Inc.

  9. Learning curve of speech recognition.

    PubMed

    Kauppinen, Tomi A; Kaipio, Johanna; Koivikko, Mika P

    2013-12-01

    Speech recognition (SR) speeds patient care processes by reducing report turnaround times. However, concerns have emerged about prolonged training and an added secretarial burden for radiologists. We assessed how much proofing radiologists who have years of experience with SR and radiologists new to SR must perform, and estimated how quickly the new users become as skilled as the experienced users. We studied SR log entries for 0.25 million reports from 154 radiologists and after careful exclusions, defined a group of 11 experienced radiologists and 71 radiologists new to SR (24,833 and 122,093 reports, respectively). Data were analyzed for sound file and report lengths, character-based error rates, and words unknown to the SR's dictionary. Experienced radiologists corrected 6 characters for each report and for new users, 11. Some users presented a very unfavorable learning curve, with error rates not declining as expected. New users' reports were longer, and data for the experienced users indicates that their reports, initially equally lengthy, shortened over a period of several years. For most radiologists, only minor corrections of dictated reports were necessary. While new users adopted SR quickly, with a subset outperforming experienced users from the start, identification of users struggling with SR will help facilitate troubleshooting and support.

  10. Increase in Speech Recognition due to Linguistic Mismatch Between Target and Masker Speech: Monolingual and Simultaneous Bilingual Performance

    PubMed Central

    Calandruccio, Lauren; Zhou, Haibo

    2014-01-01

    Purpose To examine whether improved speech recognition during linguistically mismatched target–masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method Monolingual English speakers (n = 20) and English–Greek simultaneous bilinguals (n = 20) listened to English sentences in the presence of competing English and Greek speech. Data were analyzed using mixed-effects regression models to determine differences in English recogition performance between the 2 groups and 2 masker conditions. Results Results indicated that English sentence recognition for monolinguals and simultaneous English–Greek bilinguals improved when the masker speech changed from competing English to competing Greek speech. Conclusion The improvement in speech recognition that has been observed for linguistically mismatched target–masker experiments cannot be simply explained by the masker language being linguistically unknown or unfamiliar to the listeners. Listeners can improve their speech recognition in linguistically mismatched target–masker experiments even when the listener is able to obtain meaningful linguistic information from the masker speech. PMID:24167230

  11. Estimating psycho-physiological state of a human by speech analysis

    NASA Astrophysics Data System (ADS)

    Ronzhin, A. L.

    2005-05-01

    Adverse effects of intoxication, fatigue and boredom could degrade performance of highly trained operators of complex technical systems with potentially catastrophic consequences. Existing physiological fitness for duty tests are time consuming, costly, invasive, and highly unpopular. Known non-physiological tests constitute a secondary task and interfere with the busy workload of the tested operator. Various attempts to assess the current status of the operator by processing of "normal operational data" often lead to excessive amount of computations, poorly justified metrics, and ambiguity of results. At the same time, speech analysis presents a natural, non-invasive approach based upon well-established efficient data processing. In addition, it supports both behavioral and physiological biometric. This paper presents an approach facilitating robust speech analysis/understanding process in spite of natural speech variability and background noise. Automatic speech recognition is suggested as a technique for the detection of changes in the psycho-physiological state of a human that typically manifest themselves by changes of characteristics of voice tract and semantic-syntactic connectivity of conversation. Preliminary tests have confirmed that the statistically significant correlation between the error rate of automatic speech recognition and the extent of alcohol intoxication does exist. In addition, the obtained data allowed exploring some interesting correlations and establishing some quantitative models. It is proposed to utilize this approach as a part of fitness for duty test and compare its efficiency with analyses of iris, face geometry, thermography and other popular non-invasive biometric techniques.

  12. The Effect of Lexical Content on Dichotic Speech Recognition in Older Adults.

    PubMed

    Findlen, Ursula M; Roup, Christina M

    2016-01-01

    Age-related auditory processing deficits have been shown to negatively affect speech recognition for older adult listeners. In contrast, older adults gain benefit from their ability to make use of semantic and lexical content of the speech signal (i.e., top-down processing), particularly in complex listening situations. Assessment of auditory processing abilities among aging adults should take into consideration semantic and lexical content of the speech signal. The purpose of this study was to examine the effects of lexical and attentional factors on dichotic speech recognition performance characteristics for older adult listeners. A repeated measures design was used to examine differences in dichotic word recognition as a function of lexical and attentional factors. Thirty-five older adults (61-85 yr) with sensorineural hearing loss participated in this study. Dichotic speech recognition was evaluated using consonant-vowel-consonant (CVC) word and nonsense CVC syllable stimuli administered in the free recall, directed recall right, and directed recall left response conditions. Dichotic speech recognition performance for nonsense CVC syllables was significantly poorer than performance for CVC words. Dichotic recognition performance varied across response condition for both stimulus types, which is consistent with previous studies on dichotic speech recognition. Inspection of individual results revealed that five listeners demonstrated an auditory-based left ear deficit for one or both stimulus types. Lexical content of stimulus materials affects performance characteristics for dichotic speech recognition tasks in the older adult population. The use of nonsense CVC syllable material may provide a way to assess dichotic speech recognition performance while potentially lessening the effects of lexical content on performance (i.e., measuring bottom-up auditory function both with and without top-down processing). American Academy of Audiology.

  13. ChoiceKey: a real-time speech recognition program for psychology experiments with a small response set.

    PubMed

    Donkin, Christopher; Brown, Scott D; Heathcote, Andrew

    2009-02-01

    Psychological experiments often collect choice responses using buttonpresses. However, spoken responses are useful in many cases-for example, when working with special clinical populations, or when a paradigm demands vocalization, or when accurate response time measurements are desired. In these cases, spoken responses are typically collected using a voice key, which usually involves manual coding by experimenters in a tedious and error-prone manner. We describe ChoiceKey, an open-source speech recognition package for MATLAB. It can be optimized by training for small response sets and different speakers. We show ChoiceKey to be reliable with minimal training for most participants in experiments with two different responses. Problems presented by individual differences, and occasional atypical responses, are examined, and extensions to larger response sets are explored. The ChoiceKey source files and instructions may be downloaded as supplemental materials for this article from brm.psychonomic-journals.org/content/supplemental.

  14. Automatic measurement and representation of prosodic features

    NASA Astrophysics Data System (ADS)

    Ying, Goangshiuan Shawn

    Effective measurement and representation of prosodic features of the acoustic signal for use in automatic speech recognition and understanding systems is the goal of this work. Prosodic features-stress, duration, and intonation-are variations of the acoustic signal whose domains are beyond the boundaries of each individual phonetic segment. Listeners perceive prosodic features through a complex combination of acoustic correlates such as intensity, duration, and fundamental frequency (F0). We have developed new tools to measure F0 and intensity features. We apply a probabilistic global error correction routine to an Average Magnitude Difference Function (AMDF) pitch detector. A new short-term frequency-domain Teager energy algorithm is used to measure the energy of a speech signal. We have conducted a series of experiments performing lexical stress detection on words in continuous English speech from two speech corpora. We have experimented with two different approaches, a segment-based approach and a rhythm unit-based approach, in lexical stress detection. The first approach uses pattern recognition with energy- and duration-based measurements as features to build Bayesian classifiers to detect the stress level of a vowel segment. In the second approach we define rhythm unit and use only the F0-based measurement and a scoring system to determine the stressed segment in the rhythm unit. A duration-based segmentation routine was developed to break polysyllabic words into rhythm units. The long-term goal of this work is to develop a system that can effectively detect the stress pattern for each word in continuous speech utterances. Stress information will be integrated as a constraint for pruning the word hypotheses in a word recognition system based on hidden Markov models.

  15. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hogden, J.

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation maymore » decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.« less

  16. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hogden, J.

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values aremore » constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.« less

  17. Is Listening in Noise Worth It? The Neurobiology of Speech Recognition in Challenging Listening Conditions.

    PubMed

    Eckert, Mark A; Teubner-Rhodes, Susan; Vaden, Kenneth I

    2016-01-01

    This review examines findings from functional neuroimaging studies of speech recognition in noise to provide a neural systems level explanation for the effort and fatigue that can be experienced during speech recognition in challenging listening conditions. Neuroimaging studies of speech recognition consistently demonstrate that challenging listening conditions engage neural systems that are used to monitor and optimize performance across a wide range of tasks. These systems appear to improve speech recognition in younger and older adults, but sustained engagement of these systems also appears to produce an experience of effort and fatigue that may affect the value of communication. When considered in the broader context of the neuroimaging and decision making literature, the speech recognition findings from functional imaging studies indicate that the expected value, or expected level of speech recognition given the difficulty of listening conditions, should be considered when measuring effort and fatigue. The authors propose that the behavioral economics or neuroeconomics of listening can provide a conceptual and experimental framework for understanding effort and fatigue that may have clinical significance.

  18. Is Listening in Noise Worth It? The Neurobiology of Speech Recognition in Challenging Listening Conditions

    PubMed Central

    Eckert, Mark A.; Teubner-Rhodes, Susan; Vaden, Kenneth I.

    2016-01-01

    This review examines findings from functional neuroimaging studies of speech recognition in noise to provide a neural systems level explanation for the effort and fatigue that can be experienced during speech recognition in challenging listening conditions. Neuroimaging studies of speech recognition consistently demonstrate that challenging listening conditions engage neural systems that are used to monitor and optimize performance across a wide range of tasks. These systems appear to improve speech recognition in younger and older adults, but sustained engagement of these systems also appears to produce an experience of effort and fatigue that may affect the value of communication. When considered in the broader context of the neuroimaging and decision making literature, the speech recognition findings from functional imaging studies indicate that the expected value, or expected level of speech recognition given the difficulty of listening conditions, should be considered when measuring effort and fatigue. We propose that the behavioral economics and/or neuroeconomics of listening can provide a conceptual and experimental framework for understanding effort and fatigue that may have clinical significance. PMID:27355759

  19. A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms

    PubMed Central

    Wołk, Agnieszka; Glinkowski, Wojciech

    2017-01-01

    People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer. PMID:29230254

  20. A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms.

    PubMed

    Wołk, Krzysztof; Wołk, Agnieszka; Glinkowski, Wojciech

    2017-01-01

    People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer.

  1. Speech Recognition and Parent Ratings From Auditory Development Questionnaires in Children Who Are Hard of Hearing.

    PubMed

    McCreery, Ryan W; Walker, Elizabeth A; Spratford, Meredith; Oleson, Jacob; Bentler, Ruth; Holte, Lenore; Roush, Patricia

    2015-01-01

    Progress has been made in recent years in the provision of amplification and early intervention for children who are hard of hearing. However, children who use hearing aids (HAs) may have inconsistent access to their auditory environment due to limitations in speech audibility through their HAs or limited HA use. The effects of variability in children's auditory experience on parent-reported auditory skills questionnaires and on speech recognition in quiet and in noise were examined for a large group of children who were followed as part of the Outcomes of Children with Hearing Loss study. Parent ratings on auditory development questionnaires and children's speech recognition were assessed for 306 children who are hard of hearing. Children ranged in age from 12 months to 9 years. Three questionnaires involving parent ratings of auditory skill development and behavior were used, including the LittlEARS Auditory Questionnaire, Parents Evaluation of Oral/Aural Performance in Children rating scale, and an adaptation of the Speech, Spatial, and Qualities of Hearing scale. Speech recognition in quiet was assessed using the Open- and Closed-Set Test, Early Speech Perception test, Lexical Neighborhood Test, and Phonetically Balanced Kindergarten word lists. Speech recognition in noise was assessed using the Computer-Assisted Speech Perception Assessment. Children who are hard of hearing were compared with peers with normal hearing matched for age, maternal educational level, and nonverbal intelligence. The effects of aided audibility, HA use, and language ability on parent responses to auditory development questionnaires and on children's speech recognition were also examined. Children who are hard of hearing had poorer performance than peers with normal hearing on parent ratings of auditory skills and had poorer speech recognition. Significant individual variability among children who are hard of hearing was observed. Children with greater aided audibility through their HAs, more hours of HA use, and better language abilities generally had higher parent ratings of auditory skills and better speech-recognition abilities in quiet and in noise than peers with less audibility, more limited HA use, or poorer language abilities. In addition to the auditory and language factors that were predictive for speech recognition in quiet, phonological working memory was also a positive predictor for word recognition abilities in noise. Children who are hard of hearing continue to experience delays in auditory skill development and speech-recognition abilities compared with peers with normal hearing. However, significant improvements in these domains have occurred in comparison to similar data reported before the adoption of universal newborn hearing screening and early intervention programs for children who are hard of hearing. Increasing the audibility of speech has a direct positive effect on auditory skill development and speech-recognition abilities and also may enhance these skills by improving language abilities in children who are hard of hearing. Greater number of hours of HA use also had a significant positive impact on parent ratings of auditory skills and children's speech recognition.

  2. Speaker diarization system on the 2007 NIST rich transcription meeting recognition evaluation

    NASA Astrophysics Data System (ADS)

    Sun, Hanwu; Nwe, Tin Lay; Koh, Eugene Chin Wei; Bin, Ma; Li, Haizhou

    2007-09-01

    This paper presents a speaker diarization system developed at the Institute for Infocomm Research (I2R) for NIST Rich Transcription 2007 (RT-07) evaluation task. We describe in details our primary approaches for the speaker diarization on the Multiple Distant Microphones (MDM) conditions in conference room scenario. Our proposed system consists of six modules: 1). Least-mean squared (NLMS) adaptive filter for the speaker direction estimate via Time Difference of Arrival (TDOA), 2). An initial speaker clustering via two-stage TDOA histogram distribution quantization approach, 3). Multiple microphone speaker data alignment via GCC-PHAT Time Delay Estimate (TDE) among all the distant microphone channel signals, 4). A speaker clustering algorithm based on GMM modeling approach, 5). Non-speech removal via speech/non-speech verification mechanism and, 6). Silence removal via "Double-Layer Windowing"(DLW) method. We achieves error rate of 31.02% on the 2006 Spring (RT-06s) MDM evaluation task and a competitive overall error rate of 15.32% for the NIST Rich Transcription 2007 (RT-07) MDM evaluation task.

  3. Speech Recognition as a Transcription Aid: A Randomized Comparison With Standard Transcription

    PubMed Central

    Mohr, David N.; Turner, David W.; Pond, Gregory R.; Kamath, Joseph S.; De Vos, Cathy B.; Carpenter, Paul C.

    2003-01-01

    Objective. Speech recognition promises to reduce information entry costs for clinical information systems. It is most likely to be accepted across an organization if physicians can dictate without concerning themselves with real-time recognition and editing; assistants can then edit and process the computer-generated document. Our objective was to evaluate the use of speech-recognition technology in a randomized controlled trial using our institutional infrastructure. Design. Clinical note dictations from physicians in two specialty divisions were randomized to either a standard transcription process or a speech-recognition process. Secretaries and transcriptionists also were assigned randomly to each of these processes. Measurements. The duration of each dictation was measured. The amount of time spent processing a dictation to yield a finished document also was measured. Secretarial and transcriptionist productivity, defined as hours of secretary work per minute of dictation processed, was determined for speech recognition and standard transcription. Results. Secretaries in the endocrinology division were 87.3% (confidence interval, 83.3%, 92.3%) as productive with the speech-recognition technology as implemented in this study as they were using standard transcription. Psychiatry transcriptionists and secretaries were similarly less productive. Author, secretary, and type of clinical note were significant (p < 0.05) predictors of productivity. Conclusion. When implemented in an organization with an existing document-processing infrastructure (which included training and interfaces of the speech-recognition editor with the existing document entry application), speech recognition did not improve the productivity of secretaries or transcriptionists. PMID:12509359

  4. Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions

    NASA Astrophysics Data System (ADS)

    Wang, Longbiao; Minami, Kazue; Yamamoto, Kazumasa; Nakagawa, Seiichi

    In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.

  5. Speech emotion recognition methods: A literature review

    NASA Astrophysics Data System (ADS)

    Basharirad, Babak; Moradhaseli, Mohammadreza

    2017-10-01

    Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.

  6. Emotionally conditioning the target-speech voice enhances recognition of the target speech under "cocktail-party" listening conditions.

    PubMed

    Lu, Lingxi; Bao, Xiaohan; Chen, Jing; Qu, Tianshu; Wu, Xihong; Li, Liang

    2018-05-01

    Under a noisy "cocktail-party" listening condition with multiple people talking, listeners can use various perceptual/cognitive unmasking cues to improve recognition of the target speech against informational speech-on-speech masking. One potential unmasking cue is the emotion expressed in a speech voice, by means of certain acoustical features. However, it was unclear whether emotionally conditioning a target-speech voice that has none of the typical acoustical features of emotions (i.e., an emotionally neutral voice) can be used by listeners for enhancing target-speech recognition under speech-on-speech masking conditions. In this study we examined the recognition of target speech against a two-talker speech masker both before and after the emotionally neutral target voice was paired with a loud female screaming sound that has a marked negative emotional valence. The results showed that recognition of the target speech (especially the first keyword in a target sentence) was significantly improved by emotionally conditioning the target speaker's voice. Moreover, the emotional unmasking effect was independent of the unmasking effect of the perceived spatial separation between the target speech and the masker. Also, (skin conductance) electrodermal responses became stronger after emotional learning when the target speech and masker were perceptually co-located, suggesting an increase of listening efforts when the target speech was informationally masked. These results indicate that emotionally conditioning the target speaker's voice does not change the acoustical parameters of the target-speech stimuli, but the emotionally conditioned vocal features can be used as cues for unmasking target speech.

  7. Ongoing slow oscillatory phase modulates speech intelligibility in cooperation with motor cortical activity.

    PubMed

    Onojima, Takayuki; Kitajo, Keiichi; Mizuhara, Hiroaki

    2017-01-01

    Neural oscillation is attracting attention as an underlying mechanism for speech recognition. Speech intelligibility is enhanced by the synchronization of speech rhythms and slow neural oscillation, which is typically observed as human scalp electroencephalography (EEG). In addition to the effect of neural oscillation, it has been proposed that speech recognition is enhanced by the identification of a speaker's motor signals, which are used for speech production. To verify the relationship between the effect of neural oscillation and motor cortical activity, we measured scalp EEG, and simultaneous EEG and functional magnetic resonance imaging (fMRI) during a speech recognition task in which participants were required to recognize spoken words embedded in noise sound. We proposed an index to quantitatively evaluate the EEG phase effect on behavioral performance. The results showed that the delta and theta EEG phase before speech inputs modulated the participant's response time when conducting speech recognition tasks. The simultaneous EEG-fMRI experiment showed that slow EEG activity was correlated with motor cortical activity. These results suggested that the effect of the slow oscillatory phase was associated with the activity of the motor cortex during speech recognition.

  8. Address entry while driving: speech recognition versus a touch-screen keyboard.

    PubMed

    Tsimhoni, Omer; Smith, Daniel; Green, Paul

    2004-01-01

    A driving simulator experiment was conducted to determine the effects of entering addresses into a navigation system during driving. Participants drove on roads of varying visual demand while entering addresses. Three address entry methods were explored: word-based speech recognition, character-based speech recognition, and typing on a touch-screen keyboard. For each method, vehicle control and task measures, glance timing, and subjective ratings were examined. During driving, word-based speech recognition yielded the shortest total task time (15.3 s), followed by character-based speech recognition (41.0 s) and touch-screen keyboard (86.0 s). The standard deviation of lateral position when performing keyboard entry (0.21 m) was 60% higher than that for all other address entry methods (0.13 m). Degradation of vehicle control associated with address entry using a touch screen suggests that the use of speech recognition is favorable. Speech recognition systems with visual feedback, however, even with excellent accuracy, are not without performance consequences. Applications of this research include the design of in-vehicle navigation systems as well as other systems requiring significant driver input, such as E-mail, the Internet, and text messaging.

  9. Speech Processing and Recognition (SPaRe)

    DTIC Science & Technology

    2011-01-01

    results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing ( NLP ), and...Processing ( NLP ), Information Retrieval (IR) 16. SECURITY CLASSIFICATION OF: UNCLASSIFED 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...Figure 9, the IOC was only expected to provide document submission and search; automatic speech recognition (ASR) for English, Spanish, Arabic , and

  10. Using Automatic Speech Recognition to Dictate Mathematical Expressions: The Development of the "TalkMaths" Application at Kingston University

    ERIC Educational Resources Information Center

    Wigmore, Angela; Hunter, Gordon; Pflugel, Eckhard; Denholm-Price, James; Binelli, Vincent

    2009-01-01

    Speech technology--especially automatic speech recognition--has now advanced to a level where it can be of great benefit both to able-bodied people and those with various disabilities. In this paper we describe an application "TalkMaths" which, using the output from a commonly-used conventional automatic speech recognition system,…

  11. Performing speech recognition research with hypercard

    NASA Technical Reports Server (NTRS)

    Shepherd, Chip

    1993-01-01

    The purpose of this paper is to describe a HyperCard-based system for performing speech recognition research and to instruct Human Factors professionals on how to use the system to obtain detailed data about the user interface of a prototype speech recognition application.

  12. Speech recognition and parent-ratings from auditory development questionnaires in children who are hard of hearing

    PubMed Central

    McCreery, Ryan W.; Walker, Elizabeth A.; Spratford, Meredith; Oleson, Jacob; Bentler, Ruth; Holte, Lenore; Roush, Patricia

    2015-01-01

    Objectives Progress has been made in recent years in the provision of amplification and early intervention for children who are hard of hearing. However, children who use hearing aids (HA) may have inconsistent access to their auditory environment due to limitations in speech audibility through their HAs or limited HA use. The effects of variability in children’s auditory experience on parent-report auditory skills questionnaires and on speech recognition in quiet and in noise were examined for a large group of children who were followed as part of the Outcomes of Children with Hearing Loss study. Design Parent ratings on auditory development questionnaires and children’s speech recognition were assessed for 306 children who are hard of hearing. Children ranged in age from 12 months to 9 years of age. Three questionnaires involving parent ratings of auditory skill development and behavior were used, including the LittlEARS Auditory Questionnaire, Parents Evaluation of Oral/Aural Performance in Children Rating Scale, and an adaptation of the Speech, Spatial and Qualities of Hearing scale. Speech recognition in quiet was assessed using the Open and Closed set task, Early Speech Perception Test, Lexical Neighborhood Test, and Phonetically-balanced Kindergarten word lists. Speech recognition in noise was assessed using the Computer-Assisted Speech Perception Assessment. Children who are hard of hearing were compared to peers with normal hearing matched for age, maternal educational level and nonverbal intelligence. The effects of aided audibility, HA use and language ability on parent responses to auditory development questionnaires and on children’s speech recognition were also examined. Results Children who are hard of hearing had poorer performance than peers with normal hearing on parent ratings of auditory skills and had poorer speech recognition. Significant individual variability among children who are hard of hearing was observed. Children with greater aided audibility through their HAs, more hours of HA use and better language abilities generally had higher parent ratings of auditory skills and better speech recognition abilities in quiet and in noise than peers with less audibility, more limited HA use or poorer language abilities. In addition to the auditory and language factors that were predictive for speech recognition in quiet, phonological working memory was also a positive predictor for word recognition abilities in noise. Conclusions Children who are hard of hearing continue to experience delays in auditory skill development and speech recognition abilities compared to peers with normal hearing. However, significant improvements in these domains have occurred in comparison to similar data reported prior to the adoption of universal newborn hearing screening and early intervention programs for children who are hard of hearing. Increasing the audibility of speech has a direct positive effect on auditory skill development and speech recognition abilities, and may also enhance these skills by improving language abilities in children who are hard of hearing. Greater number of hours of HA use also had a significant positive impact on parent ratings of auditory skills and children’s speech recognition. PMID:26731160

  13. Perceptual learning for speech in noise after application of binary time-frequency masks

    PubMed Central

    Ahmadi, Mahnaz; Gross, Vauna L.; Sinex, Donal G.

    2013-01-01

    Ideal time-frequency (TF) masks can reject noise and improve the recognition of speech-noise mixtures. An ideal TF mask is constructed with prior knowledge of the target speech signal. The intelligibility of a processed speech-noise mixture depends upon the threshold criterion used to define the TF mask. The study reported here assessed the effect of training on the recognition of speech in noise after processing by ideal TF masks that did not restore perfect speech intelligibility. Two groups of listeners with normal hearing listened to speech-noise mixtures processed by TF masks calculated with different threshold criteria. For each group, a threshold criterion that initially produced word recognition scores between 0.56–0.69 was chosen for training. Listeners practiced with one set of TF-masked sentences until their word recognition performance approached asymptote. Perceptual learning was quantified by comparing word-recognition scores in the first and last training sessions. Word recognition scores improved with practice for all listeners with the greatest improvement observed for the same materials used in training. PMID:23464038

  14. The effects of reverberant self- and overlap-masking on speech recognition in cochlear implant listeners.

    PubMed

    Desmond, Jill M; Collins, Leslie M; Throckmorton, Chandra S

    2014-06-01

    Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221-3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.

  15. Children with a cochlear implant: characteristics and determinants of speech recognition, speech-recognition growth rate, and speech production.

    PubMed

    Wie, Ona Bø; Falkenberg, Eva-Signe; Tvete, Ole; Tomblin, Bruce

    2007-05-01

    The objectives of the study were to describe the characteristics of the first 79 prelingually deaf cochlear implant users in Norway and to investigate to what degree the variation in speech recognition, speech- recognition growth rate, and speech production could be explained by the characteristics of the child, the cochlear implant, the family, and the educational setting. Data gathered longitudinally were analysed using descriptive statistics, multiple regression, and growth-curve analysis. The results show that more than 50% of the variation could be explained by these characteristics. Daily user-time, non-verbal intelligence, mode of communication, length of CI experience, and educational placement had the highest effect on the outcome. The results also indicate that children educated in a bilingual approach to education have better speech perception and faster speech perception growth rate with increased focus on spoken language.

  16. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    PubMed

    Ellis, Rachel J; Rönnberg, Jerker

    2015-01-01

    Proactive interference (PI) is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.

  17. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    PubMed Central

    Ellis, Rachel J.; Rönnberg, Jerker

    2015-01-01

    Proactive interference (PI) is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility. PMID:26283981

  18. Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.

    PubMed

    Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina

    2015-07-01

    It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line with the 'auditory-visual view' of auditory speech perception, which assumes that auditory speech recognition is optimized by using predictions from previously encoded speaker-specific audio-visual internal models. Copyright © 2015 Elsevier Ltd. All rights reserved.

  19. Automatic Speech Recognition from Neural Signals: A Focused Review.

    PubMed

    Herff, Christian; Schultz, Tanja

    2016-01-01

    Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.

  20. Distributed Fusion in Sensor Networks with Information Genealogy

    DTIC Science & Technology

    2011-06-28

    image processing [2], acoustic and speech recognition [3], multitarget tracking [4], distributed fusion [5], and Bayesian inference [6-7]. For...Adaptation for Distant-Talking Speech Recognition." in Proc Acoustics. Speech , and Signal Processing, 2004 |4| Y Bar-Shalom and T 1-. Fortmann...used in speech recognition and other classification applications [8]. But their use in underwater mine classification is limited. In this paper, we

  1. Prediction of consonant recognition in quiet for listeners with normal and impaired hearing using an auditory model.

    PubMed

    Jürgens, Tim; Ewert, Stephan D; Kollmeier, Birger; Brand, Thomas

    2014-03-01

    Consonant recognition was assessed in normal-hearing (NH) and hearing-impaired (HI) listeners in quiet as a function of speech level using a nonsense logatome test. Average recognition scores were analyzed and compared to recognition scores of a speech recognition model. In contrast to commonly used spectral speech recognition models operating on long-term spectra, a "microscopic" model operating in the time domain was used. Variations of the model (accounting for hearing impairment) and different model parameters (reflecting cochlear compression) were tested. Using these model variations this study examined whether speech recognition performance in quiet is affected by changes in cochlear compression, namely, a linearization, which is often observed in HI listeners. Consonant recognition scores for HI listeners were poorer than for NH listeners. The model accurately predicted the speech reception thresholds of the NH and most HI listeners. A partial linearization of the cochlear compression in the auditory model, while keeping audibility constant, produced higher recognition scores and improved the prediction accuracy. However, including listener-specific information about the exact form of the cochlear compression did not improve the prediction further.

  2. Mispronunciation Detection for Language Learning and Speech Recognition Adaptation

    ERIC Educational Resources Information Center

    Ge, Zhenhao

    2013-01-01

    The areas of "mispronunciation detection" (or "accent detection" more specifically) within the speech recognition community are receiving increased attention now. Two application areas, namely language learning and speech recognition adaptation, are largely driving this research interest and are the focal points of this work.…

  3. Longitudinal changes in speech recognition in older persons.

    PubMed

    Dubno, Judy R; Lee, Fu-Shing; Matthews, Lois J; Ahlstrom, Jayne B; Horwitz, Amy R; Mills, John H

    2008-01-01

    Recognition of isolated monosyllabic words in quiet and recognition of key words in low- and high-context sentences in babble were measured in a large sample of older persons enrolled in a longitudinal study of age-related hearing loss. Repeated measures were obtained yearly or every 2 to 3 years. To control for concurrent changes in pure-tone thresholds and speech levels, speech-recognition scores were adjusted using an importance-weighted speech-audibility metric (AI). Linear-regression slope estimated the rate of change in adjusted speech-recognition scores. Recognition of words in quiet declined significantly faster with age than predicted by declines in speech audibility. As subjects aged, observed scores deviated increasingly from AI-predicted scores, but this effect did not accelerate with age. Rate of decline in word recognition was significantly faster for females than males and for females with high serum progesterone levels, whereas noise history had no effect. Rate of decline did not accelerate with age but increased with degree of hearing loss, suggesting that with more severe injury to the auditory system, impairments to auditory function other than reduced audibility resulted in faster declines in word recognition as subjects aged. Recognition of key words in low- and high-context sentences in babble did not decline significantly with age.

  4. From Birdsong to Human Speech Recognition: Bayesian Inference on a Hierarchy of Nonlinear Dynamical Systems

    PubMed Central

    Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902

  5. From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems.

    PubMed

    Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

  6. Statistical assessment of speech system performance

    NASA Technical Reports Server (NTRS)

    Moshier, Stephen L.

    1977-01-01

    Methods for the normalization of performance tests results of speech recognition systems are presented. Technological accomplishments in speech recognition systems, as well as planned research activities are described.

  7. Building Searchable Collections of Enterprise Speech Data.

    ERIC Educational Resources Information Center

    Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret

    The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…

  8. Masked Speech Recognition and Reading Ability in School-Age Children: Is There a Relationship?

    ERIC Educational Resources Information Center

    Miller, Gabrielle; Lewis, Barbara; Benchek, Penelope; Buss, Emily; Calandruccio, Lauren

    2018-01-01

    Purpose: The relationship between reading (decoding) skills, phonological processing abilities, and masked speech recognition in typically developing children was explored. This experiment was designed to evaluate the relationship between phonological processing and decoding abilities and 2 aspects of masked speech recognition in typically…

  9. Six characteristics of effective structured reporting and the inevitable integration with speech recognition.

    PubMed

    Liu, David; Zucherman, Mark; Tulloss, William B

    2006-03-01

    The reporting of radiological images is undergoing dramatic changes due to the introduction of two new technologies: structured reporting and speech recognition. Each technology has its own unique advantages. The highly organized content of structured reporting facilitates data mining and billing, whereas speech recognition offers a natural succession from the traditional dictation-transcription process. This article clarifies the distinction between the process and outcome of structured reporting, describes fundamental requirements for any effective structured reporting system, and describes the potential development of a novel, easy-to-use, customizable structured reporting system that incorporates speech recognition. This system should have all the advantages derived from structured reporting, accommodate a wide variety of user needs, and incorporate speech recognition as a natural component and extension of the overall reporting process.

  10. Military applications of automatic speech recognition and future requirements

    NASA Technical Reports Server (NTRS)

    Beek, Bruno; Cupples, Edward J.

    1977-01-01

    An updated summary of the state-of-the-art of automatic speech recognition and its relevance to military applications is provided. A number of potential systems for military applications are under development. These include: (1) digital narrowband communication systems; (2) automatic speech verification; (3) on-line cartographic processing unit; (4) word recognition for militarized tactical data system; and (5) voice recognition and synthesis for aircraft cockpit.

  11. Speech Recognition as a Support Service for Deaf and Hard of Hearing Students: Adaptation and Evaluation. Final Report to Spencer Foundation.

    ERIC Educational Resources Information Center

    Stinson, Michael; Elliot, Lisa; McKee, Barbara; Coyne, Gina

    This report discusses a project that adapted new automatic speech recognition (ASR) technology to provide real-time speech-to-text transcription as a support service for students who are deaf and hard of hearing (D/HH). In this system, as the teacher speaks, a hearing intermediary, or captionist, dictates into the speech recognition system in a…

  12. Comparing auditory filter bandwidths, spectral ripple modulation detection, spectral ripple discrimination, and speech recognition: Normal and impaired hearinga)

    PubMed Central

    Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela

    2015-01-01

    Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes. PMID:26233047

  13. Comparing auditory filter bandwidths, spectral ripple modulation detection, spectral ripple discrimination, and speech recognition: Normal and impaired hearing.

    PubMed

    Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela

    2015-07-01

    Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes.

  14. Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model.

    PubMed

    Jürgens, Tim; Brand, Thomas

    2009-11-01

    This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.

  15. Preschool speech error patterns predict articulation and phonological awareness outcomes in children with histories of speech sound disorders.

    PubMed

    Preston, Jonathan L; Hull, Margaret; Edwards, Mary Louise

    2013-05-01

    To determine if speech error patterns in preschoolers with speech sound disorders (SSDs) predict articulation and phonological awareness (PA) outcomes almost 4 years later. Twenty-five children with histories of preschool SSDs (and normal receptive language) were tested at an average age of 4;6 (years;months) and were followed up at age 8;3. The frequency of occurrence of preschool distortion errors, typical substitution and syllable structure errors, and atypical substitution and syllable structure errors was used to predict later speech sound production, PA, and literacy outcomes. Group averages revealed below-average school-age articulation scores and low-average PA but age-appropriate reading and spelling. Preschool speech error patterns were related to school-age outcomes. Children for whom >10% of their speech sound errors were atypical had lower PA and literacy scores at school age than children who produced <10% atypical errors. Preschoolers who produced more distortion errors were likely to have lower school-age articulation scores than preschoolers who produced fewer distortion errors. Different preschool speech error patterns predict different school-age clinical outcomes. Many atypical speech sound errors in preschoolers may be indicative of weak phonological representations, leading to long-term PA weaknesses. Preschoolers' distortions may be resistant to change over time, leading to persisting speech sound production problems.

  16. The influence of speech rate and accent on access and use of semantic information.

    PubMed

    Sajin, Stanislav M; Connine, Cynthia M

    2017-04-01

    Circumstances in which the speech input is presented in sub-optimal conditions generally lead to processing costs affecting spoken word recognition. The current study indicates that some processing demands imposed by listening to difficult speech can be mitigated by feedback from semantic knowledge. A set of lexical decision experiments examined how foreign accented speech and word duration impact access to semantic knowledge in spoken word recognition. Results indicate that when listeners process accented speech, the reliance on semantic information increases. Speech rate was not observed to influence semantic access, except in the setting in which unusually slow accented speech was presented. These findings support interactive activation models of spoken word recognition in which attention is modulated based on speech demands.

  17. Use of intonation contours for speech recognition in noise by cochlear implant recipients.

    PubMed

    Meister, Hartmut; Landwehr, Markus; Pyschny, Verena; Grugel, Linda; Walger, Martin

    2011-05-01

    The corruption of intonation contours has detrimental effects on sentence-based speech recognition in normal-hearing listeners Binns and Culling [(2007). J. Acoust. Soc. Am. 122, 1765-1776]. This paper examines whether this finding also applies to cochlear implant (CI) recipients. The subjects' F0-discrimination and speech perception in the presence of noise were measured, using sentences with regular and inverted F0-contours. The results revealed that speech recognition for regular contours was significantly better than for inverted contours. This difference was related to the subjects' F0-discrimination providing further evidence that the perception of intonation patterns is important for the CI-mediated speech recognition in noise.

  18. Does quality of life depend on speech recognition performance for adult cochlear implant users?

    PubMed

    Capretta, Natalie R; Moberly, Aaron C

    2016-03-01

    Current postoperative clinical outcome measures for adults receiving cochlear implants (CIs) consist of testing speech recognition, primarily under quiet conditions. However, it is strongly suspected that results on these measures may not adequately reflect patients' quality of life (QOL) using their implants. This study aimed to evaluate whether QOL for CI users depends on speech recognition performance. Twenty-three postlingually deafened adults with CIs were assessed. Participants were tested for speech recognition (Central Institute for the Deaf word and AzBio sentence recognition in quiet) and completed three QOL measures-the Nijmegen Cochlear Implant Questionnaire; either the Hearing Handicap Inventory for Adults or the Hearing Handicap Inventory for the Elderly; and the Speech, Spatial and Qualities of Hearing Scale questionnaires-to assess a variety of QOL factors. Correlations were sought between speech recognition and QOL scores. Demographics, audiologic history, language, and cognitive skills were also examined as potential predictors of QOL. Only a few QOL scores significantly correlated with postoperative sentence or word recognition in quiet, and correlations were primarily isolated to speech-related subscales on QOL measures. Poorer pre- and postoperative unaided hearing predicted better QOL. Socioeconomic status, duration of deafness, age at implantation, duration of CI use, reading ability, vocabulary size, and cognitive status did not consistently predict QOL scores. For adult, postlingually deafened CI users, clinical speech recognition measures in quiet do not correlate broadly with QOL. Results suggest the need for additional outcome measures of the benefits and limitations of cochlear implantation. 4. Laryngoscope, 126:699-706, 2016. © 2015 The American Laryngological, Rhinological and Otological Society, Inc.

  19. Visual abilities are important for auditory-only speech recognition: evidence from autism spectrum disorder.

    PubMed

    Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina

    2014-12-01

    In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory-only speech recognition. Copyright © 2014 Elsevier Ltd. All rights reserved.

  20. Speech processing using maximum likelihood continuity mapping

    DOEpatents

    Hogden, John E.

    2000-01-01

    Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.

  1. Speech processing using maximum likelihood continuity mapping

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hogden, J.E.

    Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.

  2. Application of an auditory model to speech recognition.

    PubMed

    Cohen, J R

    1989-06-01

    Some aspects of auditory processing are incorporated in a front end for the IBM speech-recognition system [F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE 64 (4), 532-556 (1976)]. This new process includes adaptation, loudness scaling, and mel warping. Tests show that the design is an improvement over previous algorithms.

  3. Automatically Detecting Likely Edits in Clinical Notes Created Using Automatic Speech Recognition

    PubMed Central

    Lybarger, Kevin; Ostendorf, Mari; Yetisgen, Meliha

    2017-01-01

    The use of automatic speech recognition (ASR) to create clinical notes has the potential to reduce costs associated with note creation for electronic medical records, but at current system accuracy levels, post-editing by practitioners is needed to ensure note quality. Aiming to reduce the time required to edit ASR transcripts, this paper investigates novel methods for automatic detection of edit regions within the transcripts, including both putative ASR errors but also regions that are targets for cleanup or rephrasing. We create detection models using logistic regression and conditional random field models, exploring a variety of text-based features that consider the structure of clinical notes and exploit the medical context. Different medical text resources are used to improve feature extraction. Experimental results on a large corpus of practitioner-edited clinical notes show that 67% of sentence-level edits and 45% of word-level edits can be detected with a false detection rate of 15%. PMID:29854187

  4. Presentation video retrieval using automatically recovered slide and spoken text

    NASA Astrophysics Data System (ADS)

    Cooper, Matthew

    2013-03-01

    Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.

  5. Generalized auditory agnosia with spared music recognition in a left-hander. Analysis of a case with a right temporal stroke.

    PubMed

    Mendez, M F

    2001-02-01

    After a right temporoparietal stroke, a left-handed man lost the ability to understand speech and environmental sounds but developed greater appreciation for music. The patient had preserved reading and writing but poor verbal comprehension. Slower speech, single syllable words, and minimal written cues greatly facilitated his verbal comprehension. On identifying environmental sounds, he made predominant acoustic errors. Although he failed to name melodies, he could match, describe, and sing them. The patient had normal hearing except for presbyacusis, right-ear dominance for phonemes, and normal discrimination of basic psychoacoustic features and rhythm. Further testing disclosed difficulty distinguishing tone sequences and discriminating two clicks and short-versus-long tones, particularly in the left ear. Together, these findings suggest impairment in a direct route for temporal analysis and auditory word forms in his right hemisphere to Wernicke's area in his left hemisphere. The findings further suggest a separate and possibly rhythm-based mechanism for music recognition.

  6. Specific acoustic models for spontaneous and dictated style in indonesian speech recognition

    NASA Astrophysics Data System (ADS)

    Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.

    2018-03-01

    The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.

  7. Preschool speech error patterns predict articulation and phonological awareness outcomes in children with histories of speech sound disorders

    PubMed Central

    Preston, Jonathan L.; Hull, Margaret; Edwards, Mary Louise

    2012-01-01

    Purpose To determine if speech error patterns in preschoolers with speech sound disorders (SSDs) predict articulation and phonological awareness (PA) outcomes almost four years later. Method Twenty-five children with histories of preschool SSDs (and normal receptive language) were tested at an average age of 4;6 and followed up at 8;3. The frequency of occurrence of preschool distortion errors, typical substitution and syllable structure errors, and atypical substitution and syllable structure errors were used to predict later speech sound production, PA, and literacy outcomes. Results Group averages revealed below-average school-age articulation scores and low-average PA, but age-appropriate reading and spelling. Preschool speech error patterns were related to school-age outcomes. Children for whom more than 10% of their speech sound errors were atypical had lower PA and literacy scores at school-age than children who produced fewer than 10% atypical errors. Preschoolers who produced more distortion errors were likely to have lower school-age articulation scores. Conclusions Different preschool speech error patterns predict different school-age clinical outcomes. Many atypical speech sound errors in preschool may be indicative of weak phonological representations, leading to long-term PA weaknesses. Preschool distortions may be resistant to change over time, leading to persisting speech sound production problems. PMID:23184137

  8. Increase in Speech Recognition Due to Linguistic Mismatch between Target and Masker Speech: Monolingual and Simultaneous Bilingual Performance

    ERIC Educational Resources Information Center

    Calandruccio, Lauren; Zhou, Haibo

    2014-01-01

    Purpose: To examine whether improved speech recognition during linguistically mismatched target-masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method: Monolingual English speakers (n = 20) and English-Greek simultaneous bilinguals (n = 20) listened to…

  9. Microscopic prediction of speech intelligibility in spatially distributed speech-shaped noise for normal-hearing listeners.

    PubMed

    Geravanchizadeh, Masoud; Fallah, Ali

    2015-12-01

    A binaural and psychoacoustically motivated intelligibility model, based on a well-known monaural microscopic model is proposed. This model simulates a phoneme recognition task in the presence of spatially distributed speech-shaped noise in anechoic scenarios. In the proposed model, binaural advantage effects are considered by generating a feature vector for a dynamic-time-warping speech recognizer. This vector consists of three subvectors incorporating two monaural subvectors to model the better-ear hearing, and a binaural subvector to simulate the binaural unmasking effect. The binaural unit of the model is based on equalization-cancellation theory. This model operates blindly, which means separate recordings of speech and noise are not required for the predictions. Speech intelligibility tests were conducted with 12 normal hearing listeners by collecting speech reception thresholds (SRTs) in the presence of single and multiple sources of speech-shaped noise. The comparison of the model predictions with the measured binaural SRTs, and with the predictions of a macroscopic binaural model called extended equalization-cancellation, shows that this approach predicts the intelligibility in anechoic scenarios with good precision. The square of the correlation coefficient (r(2)) and the mean-absolute error between the model predictions and the measurements are 0.98 and 0.62 dB, respectively.

  10. The influence of audibility on speech recognition with nonlinear frequency compression for children and adults with hearing loss

    PubMed Central

    McCreery, Ryan W.; Alexander, Joshua; Brennan, Marc A.; Hoover, Brenda; Kopun, Judy; Stelmachowicz, Patricia G.

    2014-01-01

    Objective The primary goal of nonlinear frequency compression (NFC) and other frequency lowering strategies is to increase the audibility of high-frequency sounds that are not otherwise audible with conventional hearing-aid processing due to the degree of hearing loss, limited hearing aid bandwidth or a combination of both factors. The aim of the current study was to compare estimates of speech audibility processed by NFC to improvements in speech recognition for a group of children and adults with high-frequency hearing loss. Design Monosyllabic word recognition was measured in noise for twenty-four adults and twelve children with mild to severe sensorineural hearing loss. Stimuli were amplified based on each listener’s audiogram with conventional processing (CP) with amplitude compression or with NFC and presented under headphones using a software-based hearing aid simulator. A modification of the speech intelligibility index (SII) was used to estimate audibility of information in frequency-lowered bands. The mean improvement in SII was compared to the mean improvement in speech recognition. Results All but two listeners experienced improvements in speech recognition with NFC compared to CP, consistent with the small increase in audibility that was estimated using the modification of the SII. Children and adults had similar improvements in speech recognition with NFC. Conclusion Word recognition with NFC was higher than CP for children and adults with mild to severe hearing loss. The average improvement in speech recognition with NFC (7%) was consistent with the modified SII, which indicated that listeners experienced an increase in audibility with NFC compared to CP. Further studies are necessary to determine if changes in audibility with NFC are related to speech recognition with NFC for listeners with greater degrees of hearing loss, with a greater variety of compression settings, and using auditory training. PMID:24535558

  11. Speech Enhancement, Gain, and Noise Spectrum Adaptation Using Approximate Bayesian Estimation

    PubMed Central

    Hao, Jiucang; Attias, Hagai; Nagarajan, Srikantan; Lee, Te-Won; Sejnowski, Terrence J.

    2010-01-01

    This paper presents a new approximate Bayesian estimator for enhancing a noisy speech signal. The speech model is assumed to be a Gaussian mixture model (GMM) in the log-spectral domain. This is in contrast to most current models in frequency domain. Exact signal estimation is a computationally intractable problem. We derive three approximations to enhance the efficiency of signal estimation. The Gaussian approximation transforms the log-spectral domain GMM into the frequency domain using minimal Kullback–Leiber (KL)-divergency criterion. The frequency domain Laplace method computes the maximum a posteriori (MAP) estimator for the spectral amplitude. Correspondingly, the log-spectral domain Laplace method computes the MAP estimator for the log-spectral amplitude. Further, the gain and noise spectrum adaptation are implemented using the expectation–maximization (EM) algorithm within the GMM under Gaussian approximation. The proposed algorithms are evaluated by applying them to enhance the speeches corrupted by the speech-shaped noise (SSN). The experimental results demonstrate that the proposed algorithms offer improved signal-to-noise ratio, lower word recognition error rate, and less spectral distortion. PMID:20428253

  12. Speech endpoint detection with non-language speech sounds for generic speech processing applications

    NASA Astrophysics Data System (ADS)

    McClain, Matthew; Romanowski, Brian

    2009-05-01

    Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.

  13. How much does language proficiency by non-native listeners influence speech audiometric tests in noise?

    PubMed

    Warzybok, Anna; Brand, Thomas; Wagener, Kirsten C; Kollmeier, Birger

    2015-01-01

    The current study investigates the extent to which the linguistic complexity of three commonly employed speech recognition tests and second language proficiency influence speech recognition thresholds (SRTs) in noise in non-native listeners. SRTs were measured for non-natives and natives using three German speech recognition tests: the digit triplet test (DTT), the Oldenburg sentence test (OLSA), and the Göttingen sentence test (GÖSA). Sixty-four non-native and eight native listeners participated. Non-natives can show native-like SRTs in noise only for the linguistically easy speech material (DTT). Furthermore, the limitation of phonemic-acoustical cues in digit triplets affects speech recognition to the same extent in non-natives and natives. For more complex and less familiar speech materials, non-natives, ranging from basic to advanced proficiency in German, require on average 3-dB better signal-to-noise ratio for the OLSA and 6-dB for the GÖSA to obtain 50% speech recognition compared to native listeners. In clinical audiology, SRT measurements with a closed-set speech test (i.e. DTT for screening or OLSA test for clinical purposes) should be used with non-native listeners rather than open-set speech tests (such as the GÖSA or HINT), especially if a closed-set version in the patient's own native language is available.

  14. Mandarin-Speaking Children's Speech Recognition: Developmental Changes in the Influences of Semantic Context and F0 Contours.

    PubMed

    Zhou, Hong; Li, Yu; Liang, Meng; Guan, Connie Qun; Zhang, Linjun; Shu, Hua; Zhang, Yang

    2017-01-01

    The goal of this developmental speech perception study was to assess whether and how age group modulated the influences of high-level semantic context and low-level fundamental frequency ( F 0 ) contours on the recognition of Mandarin speech by elementary and middle-school-aged children in quiet and interference backgrounds. The results revealed different patterns for semantic and F 0 information. One the one hand, age group modulated significantly the use of F 0 contours, indicating that elementary school children relied more on natural F 0 contours than middle school children during Mandarin speech recognition. On the other hand, there was no significant modulation effect of age group on semantic context, indicating that children of both age groups used semantic context to assist speech recognition to a similar extent. Furthermore, the significant modulation effect of age group on the interaction between F 0 contours and semantic context revealed that younger children could not make better use of semantic context in recognizing speech with flat F 0 contours compared with natural F 0 contours, while older children could benefit from semantic context even when natural F 0 contours were altered, thus confirming the important role of F 0 contours in Mandarin speech recognition by elementary school children. The developmental changes in the effects of high-level semantic and low-level F 0 information on speech recognition might reflect the differences in auditory and cognitive resources associated with processing of the two types of information in speech perception.

  15. Talker variability in audio-visual speech perception

    PubMed Central

    Heald, Shannon L. M.; Nusbaum, Howard C.

    2014-01-01

    A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition). So far, this talker variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker’s face, speech recognition is improved under adverse listening (e.g., noise or distortion) conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker’s face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker’s face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred. PMID:25076919

  16. Talker variability in audio-visual speech perception.

    PubMed

    Heald, Shannon L M; Nusbaum, Howard C

    2014-01-01

    A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition). So far, this talker variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker's face, speech recognition is improved under adverse listening (e.g., noise or distortion) conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker's face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker's face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred.

  17. Methods and apparatus for non-acoustic speech characterization and recognition

    DOEpatents

    Holzrichter, John F.

    1999-01-01

    By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.

  18. Methods and apparatus for non-acoustic speech characterization and recognition

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Holzrichter, J.F.

    By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.

  19. Voice technology and BBN

    NASA Technical Reports Server (NTRS)

    Wolf, Jared J.

    1977-01-01

    The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.

  20. Effects of Age and Working Memory Capacity on Speech Recognition Performance in Noise Among Listeners With Normal Hearing.

    PubMed

    Gordon-Salant, Sandra; Cole, Stacey Samuels

    2016-01-01

    This study aimed to determine if younger and older listeners with normal hearing who differ on working memory span perform differently on speech recognition tests in noise. Older adults typically exhibit poorer speech recognition scores in noise than younger adults, which is attributed primarily to poorer hearing sensitivity and more limited working memory capacity in older than younger adults. Previous studies typically tested older listeners with poorer hearing sensitivity and shorter working memory spans than younger listeners, making it difficult to discern the importance of working memory capacity on speech recognition. This investigation controlled for hearing sensitivity and compared speech recognition performance in noise by younger and older listeners who were subdivided into high and low working memory groups. Performance patterns were compared for different speech materials to assess whether or not the effect of working memory capacity varies with the demands of the specific speech test. The authors hypothesized that (1) normal-hearing listeners with low working memory span would exhibit poorer speech recognition performance in noise than those with high working memory span; (2) older listeners with normal hearing would show poorer speech recognition scores than younger listeners with normal hearing, when the two age groups were matched for working memory span; and (3) an interaction between age and working memory would be observed for speech materials that provide contextual cues. Twenty-eight older (61 to 75 years) and 25 younger (18 to 25 years) normal-hearing listeners were assigned to groups based on age and working memory status. Northwestern University Auditory Test No. 6 words and Institute of Electrical and Electronics Engineers sentences were presented in noise using an adaptive procedure to measure the signal-to-noise ratio corresponding to 50% correct performance. Cognitive ability was evaluated with two tests of working memory (Listening Span Test and Reading Span Test) and two tests of processing speed (Paced Auditory Serial Addition Test and The Letter Digit Substitution Test). Significant effects of age and working memory capacity were observed on the speech recognition measures in noise, but these effects were mediated somewhat by the speech signal. Specifically, main effects of age and working memory were revealed for both words and sentences, but the interaction between the two was significant for sentences only. For these materials, effects of age were observed for listeners in the low working memory groups only. Although all cognitive measures were significantly correlated with speech recognition in noise, working memory span was the most important variable accounting for speech recognition performance. The results indicate that older adults with high working memory capacity are able to capitalize on contextual cues and perform as well as young listeners with high working memory capacity for sentence recognition. The data also suggest that listeners with normal hearing and low working memory capacity are less able to adapt to distortion of speech signals caused by background noise, which requires the allocation of more processing resources to earlier processing stages. These results indicate that both younger and older adults with low working memory capacity and normal hearing are at a disadvantage for recognizing speech in noise.

  1. Speech Recognition and Cognitive Skills in Bimodal Cochlear Implant Users

    ERIC Educational Resources Information Center

    Hua, Håkan; Johansson, Björn; Magnusson, Lennart; Lyxell, Björn; Ellis, Rachel J.

    2017-01-01

    Purpose: To examine the relation between speech recognition and cognitive skills in bimodal cochlear implant (CI) and hearing aid users. Method: Seventeen bimodal CI users (28-74 years) were recruited to the study. Speech recognition tests were carried out in quiet and in noise. The cognitive tests employed included the Reading Span Test and the…

  2. Visual Speech Primes Open-Set Recognition of Spoken Words

    ERIC Educational Resources Information Center

    Buchwald, Adam B.; Winters, Stephen J.; Pisoni, David B.

    2009-01-01

    Visual speech perception has become a topic of considerable interest to speech researchers. Previous research has demonstrated that perceivers neurally encode and use speech information from the visual modality, and this information has been found to facilitate spoken word recognition in tasks such as lexical decision (Kim, Davis, & Krins,…

  3. Significance of parametric spectral ratio methods in detection and recognition of whispered speech

    NASA Astrophysics Data System (ADS)

    Mathur, Arpit; Reddy, Shankar M.; Hegde, Rajesh M.

    2012-12-01

    In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.

  4. Current trends in small vocabulary speech recognition for equipment control

    NASA Astrophysics Data System (ADS)

    Doukas, Nikolaos; Bardis, Nikolaos G.

    2017-09-01

    Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.

  5. Preschoolers Benefit From Visually Salient Speech Cues

    PubMed Central

    Holt, Rachael Frush

    2015-01-01

    Purpose This study explored visual speech influence in preschoolers using 3 developmentally appropriate tasks that vary in perceptual difficulty and task demands. They also examined developmental differences in the ability to use visually salient speech cues and visual phonological knowledge. Method Twelve adults and 27 typically developing 3- and 4-year-old children completed 3 audiovisual (AV) speech integration tasks: matching, discrimination, and recognition. The authors compared AV benefit for visually salient and less visually salient speech discrimination contrasts and assessed the visual saliency of consonant confusions in auditory-only and AV word recognition. Results Four-year-olds and adults demonstrated visual influence on all measures. Three-year-olds demonstrated visual influence on speech discrimination and recognition measures. All groups demonstrated greater AV benefit for the visually salient discrimination contrasts. AV recognition benefit in 4-year-olds and adults depended on the visual saliency of speech sounds. Conclusions Preschoolers can demonstrate AV speech integration. Their AV benefit results from efficient use of visually salient speech cues. Four-year-olds, but not 3-year-olds, used visual phonological knowledge to take advantage of visually salient speech cues, suggesting possible developmental differences in the mechanisms of AV benefit. PMID:25322336

  6. Speech Perception in Noise by Children With Cochlear Implants

    PubMed Central

    Caldwell, Amanda; Nittrouer, Susan

    2013-01-01

    Purpose Common wisdom suggests that listening in noise poses disproportionately greater difficulty for listeners with cochlear implants (CIs) than for peers with normal hearing (NH). The purpose of this study was to examine phonological, language, and cognitive skills that might help explain speech-in-noise abilities for children with CIs. Method Three groups of kindergartners (NH, hearing aid wearers, and CI users) were tested on speech recognition in quiet and noise and on tasks thought to underlie the abilities that fit into the domains of phonological awareness, general language, and cognitive skills. These last measures were used as predictor variables in regression analyses with speech-in-noise scores as dependent variables. Results Compared to children with NH, children with CIs did not perform as well on speech recognition in noise or on most other measures, including recognition in quiet. Two surprising results were that (a) noise effects were consistent across groups and (b) scores on other measures did not explain any group differences in speech recognition. Conclusions Limitations of implant processing take their primary toll on recognition in quiet and account for poor speech recognition and language/phonological deficits in children with CIs. Implications are that teachers/clinicians need to teach language/phonology directly and maximize signal-to-noise levels in the classroom. PMID:22744138

  7. Simulation of talking faces in the human brain improves auditory speech recognition

    PubMed Central

    von Kriegstein, Katharina; Dogan, Özgür; Grüter, Martina; Giraud, Anne-Lise; Kell, Christian A.; Grüter, Thomas; Kleinschmidt, Andreas; Kiebel, Stefan J.

    2008-01-01

    Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. PMID:18436648

  8. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    DOEpatents

    Holzrichter, J.F.; Ng, L.C.

    1998-03-17

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.

  9. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    DOEpatents

    Holzrichter, John F.; Ng, Lawrence C.

    1998-01-01

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.

  10. Characteristics of speaking style and implications for speech recognition.

    PubMed

    Shinozaki, Takahiro; Ostendorf, Mari; Atlas, Les

    2009-09-01

    Differences in speaking style are associated with more or less spectral variability, as well as different modulation characteristics. The greater variation in some styles (e.g., spontaneous speech and infant-directed speech) poses challenges for recognition but possibly also opportunities for learning more robust models, as evidenced by prior work and motivated by child language acquisition studies. In order to investigate this possibility, this work proposes a new method for characterizing speaking style (the modulation spectrum), examines spontaneous, read, adult-directed, and infant-directed styles in this space, and conducts pilot experiments in style detection and sampling for improved speech recognizer training. Speaking style classification is improved by using the modulation spectrum in combination with standard pitch and energy variation. Speech recognition experiments on a small vocabulary conversational speech recognition task show that sampling methods for training with a small amount of data benefit from the new features.

  11. How should a speech recognizer work?

    PubMed

    Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M

    2005-11-12

    Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.

  12. Evaluation of Speech Recognition of Cochlear Implant Recipients Using Adaptive, Digital Remote Microphone Technology and a Speech Enhancement Sound Processing Algorithm.

    PubMed

    Wolfe, Jace; Morais, Mila; Schafer, Erin; Agrawal, Smita; Koch, Dawn

    2015-05-01

    Cochlear implant recipients often experience difficulty with understanding speech in the presence of noise. Cochlear implant manufacturers have developed sound processing algorithms designed to improve speech recognition in noise, and research has shown these technologies to be effective. Remote microphone technology utilizing adaptive, digital wireless radio transmission has also been shown to provide significant improvement in speech recognition in noise. There are no studies examining the potential improvement in speech recognition in noise when these two technologies are used simultaneously. The goal of this study was to evaluate the potential benefits and limitations associated with the simultaneous use of a sound processing algorithm designed to improve performance in noise (Advanced Bionics ClearVoice) and a remote microphone system that incorporates adaptive, digital wireless radio transmission (Phonak Roger). A two-by-two way repeated measures design was used to examine performance differences obtained without these technologies compared to the use of each technology separately as well as the simultaneous use of both technologies. Eleven Advanced Bionics (AB) cochlear implant recipients, ages 11 to 68 yr. AzBio sentence recognition was measured in quiet and in the presence of classroom noise ranging in level from 50 to 80 dBA in 5-dB steps. Performance was evaluated in four conditions: (1) No ClearVoice and no Roger, (2) ClearVoice enabled without the use of Roger, (3) ClearVoice disabled with Roger enabled, and (4) simultaneous use of ClearVoice and Roger. Speech recognition in quiet was better than speech recognition in noise for all conditions. Use of ClearVoice and Roger each provided significant improvement in speech recognition in noise. The best performance in noise was obtained with the simultaneous use of ClearVoice and Roger. ClearVoice and Roger technology each improves speech recognition in noise, particularly when used at the same time. Because ClearVoice does not degrade performance in quiet settings, clinicians should consider recommending ClearVoice for routine, full-time use for AB implant recipients. Roger should be used in all instances in which remote microphone technology may assist the user in understanding speech in the presence of noise. American Academy of Audiology.

  13. Speech recognition: how good is good enough?

    PubMed

    Krohn, Richard

    2002-03-01

    Since its infancy in the early 1990s, the technology of speech recognition has undergone a rapid evolution. Not only has the reliability of the programming improved dramatically, the return on investment has become increasingly compelling. The author describes some of the latest health care applications of speech-recognition technology, and how the next advances will be made in this area.

  14. Speech recognition systems on the Cell Broadband Engine

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, Y; Jones, H; Vaidya, S

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousandsmore » of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.« less

  15. Lexical and phonological variability in preschool children with speech sound disorder.

    PubMed

    Macrae, Toby; Tyler, Ann A; Lewis, Kerry E

    2014-02-01

    The authors of this study examined relationships between measures of word and speech error variability and between these and other speech and language measures in preschool children with speech sound disorder (SSD). In this correlational study, 18 preschool children with SSD, age-appropriate receptive vocabulary, and normal oral motor functioning and hearing were assessed across 2 sessions. Experimental measures included word and speech error variability, receptive vocabulary, nonword repetition (NWR), and expressive language. Pearson product–moment correlation coefficients were calculated among the experimental measures. The correlation between word and speech error variability was slight and nonsignificant. The correlation between word variability and receptive vocabulary was moderate and negative, although nonsignificant. High word variability was associated with small receptive vocabularies. The correlations between speech error variability and NWR and between speech error variability and the mean length of children's utterances were moderate and negative, although both were nonsignificant. High speech error variability was associated with poor NWR and language scores. High word variability may reflect unstable lexical representations, whereas high speech error variability may reflect indistinct phonological representations. Preschool children with SSD who show abnormally high levels of different types of speech variability may require slightly different approaches to intervention.

  16. The Swedish Hayling task, and its relation to working memory, verbal ability, and speech-recognition-in-noise.

    PubMed

    Stenbäck, Victoria; Hällgren, Mathias; Lyxell, Björn; Larsby, Birgitta

    2015-06-01

    Cognitive functions and speech-recognition-in-noise were evaluated with a cognitive test battery, assessing response inhibition using the Hayling task, working memory capacity (WMC) and verbal information processing, and an auditory test of speech recognition. The cognitive tests were performed in silence whereas the speech recognition task was presented in noise. Thirty young normally-hearing individuals participated in the study. The aim of the study was to investigate one executive function, response inhibition, and whether it is related to individual working memory capacity (WMC), and how speech-recognition-in-noise relates to WMC and inhibitory control. The results showed a significant difference between initiation and response inhibition, suggesting that the Hayling task taps cognitive activity responsible for executive control. Our findings also suggest that high verbal ability was associated with better performance in the Hayling task. We also present findings suggesting that individuals who perform well on tasks involving response inhibition, and WMC, also perform well on a speech-in-noise task. Our findings indicate that capacity to resist semantic interference can be used to predict performance on speech-in-noise tasks. © 2015 Scandinavian Psychological Associations and John Wiley & Sons Ltd.

  17. An algorithm of improving speech emotional perception for hearing aid

    NASA Astrophysics Data System (ADS)

    Xi, Ji; Liang, Ruiyu; Fei, Xianju

    2017-07-01

    In this paper, a speech emotion recognition (SER) algorithm was proposed to improve the emotional perception of hearing-impaired people. The algorithm utilizes multiple kernel technology to overcome the drawback of SVM: slow training speed. Firstly, in order to improve the adaptive performance of Gaussian Radial Basis Function (RBF), the parameter determining the nonlinear mapping was optimized on the basis of Kernel target alignment. Then, the obtained Kernel Function was used as the basis kernel of Multiple Kernel Learning (MKL) with slack variable that could solve the over-fitting problem. However, the slack variable also brings the error into the result. Therefore, a soft-margin MKL was proposed to balance the margin against the error. Moreover, the relatively iterative algorithm was used to solve the combination coefficients and hyper-plane equations. Experimental results show that the proposed algorithm can acquire an accuracy of 90% for five kinds of emotions including happiness, sadness, anger, fear and neutral. Compared with KPCA+CCA and PIM-FSVM, the proposed algorithm has the highest accuracy.

  18. Contribution of auditory working memory to speech understanding in mandarin-speaking cochlear implant users.

    PubMed

    Tao, Duoduo; Deng, Rui; Jiang, Ye; Galvin, John J; Fu, Qian-Jie; Chen, Bing

    2014-01-01

    To investigate how auditory working memory relates to speech perception performance by Mandarin-speaking cochlear implant (CI) users. Auditory working memory and speech perception was measured in Mandarin-speaking CI and normal-hearing (NH) participants. Working memory capacity was measured using forward digit span and backward digit span; working memory efficiency was measured using articulation rate. Speech perception was assessed with: (a) word-in-sentence recognition in quiet, (b) word-in-sentence recognition in speech-shaped steady noise at +5 dB signal-to-noise ratio, (c) Chinese disyllable recognition in quiet, (d) Chinese lexical tone recognition in quiet. Self-reported school rank was also collected regarding performance in schoolwork. There was large inter-subject variability in auditory working memory and speech performance for CI participants. Working memory and speech performance were significantly poorer for CI than for NH participants. All three working memory measures were strongly correlated with each other for both CI and NH participants. Partial correlation analyses were performed on the CI data while controlling for demographic variables. Working memory efficiency was significantly correlated only with sentence recognition in quiet when working memory capacity was partialled out. Working memory capacity was correlated with disyllable recognition and school rank when efficiency was partialled out. There was no correlation between working memory and lexical tone recognition in the present CI participants. Mandarin-speaking CI users experience significant deficits in auditory working memory and speech performance compared with NH listeners. The present data suggest that auditory working memory may contribute to CI users' difficulties in speech understanding. The present pattern of results with Mandarin-speaking CI users is consistent with previous auditory working memory studies with English-speaking CI users, suggesting that the lexical importance of voice pitch cues (albeit poorly coded by the CI) did not influence the relationship between working memory and speech perception.

  19. Implementation of the Intelligent Voice System for Kazakh

    NASA Astrophysics Data System (ADS)

    Yessenbayev, Zh; Saparkhojayev, N.; Tibeyev, T.

    2014-04-01

    Modern speech technologies are highly advanced and widely used in day-to-day applications. However, this is mostly concerned with the languages of well-developed countries such as English, German, Japan, Russian, etc. As for Kazakh, the situation is less prominent and research in this field is only starting to evolve. In this research and application-oriented project, we introduce an intelligent voice system for the fast deployment of call-centers and information desks supporting Kazakh speech. The demand on such a system is obvious if the country's large size and small population is considered. The landline and cell phones become the only means of communication for the distant villages and suburbs. The system features Kazakh speech recognition and synthesis modules as well as a web-GUI for efficient dialog management. For speech recognition we use CMU Sphinx engine and for speech synthesis- MaryTTS. The web-GUI is implemented in Java enabling operators to quickly create and manage the dialogs in user-friendly graphical environment. The call routines are handled by Asterisk PBX and JBoss Application Server. The system supports such technologies and protocols as VoIP, VoiceXML, FastAGI, Java SpeechAPI and J2EE. For the speech recognition experiments we compiled and used the first Kazakh speech corpus with the utterances from 169 native speakers. The performance of the speech recognizer is 4.1% WER on isolated word recognition and 6.9% WER on clean continuous speech recognition tasks. The speech synthesis experiments include the training of male and female voices.

  20. The NTID speech recognition test: NSRT(®).

    PubMed

    Bochner, Joseph H; Garrison, Wayne M; Doherty, Karen A

    2015-07-01

    The purpose of this study was to collect and analyse data necessary for expansion of the NSRT item pool and to evaluate the NSRT adaptive testing software. Participants were administered pure-tone and speech recognition tests including W-22 and QuickSIN, as well as a set of 323 new NSRT items and NSRT adaptive tests in quiet and background noise. Performance on the adaptive tests was compared to pure-tone thresholds and performance on other speech recognition measures. The 323 new items were subjected to Rasch scaling analysis. Seventy adults with mild to moderately severe hearing loss participated in this study. Their mean age was 62.4 years (sd = 20.8). The 323 new NSRT items fit very well with the original item bank, enabling the item pool to be more than doubled in size. Data indicate high reliability coefficients for the NSRT and moderate correlations with pure-tone thresholds (PTA and HFPTA) and other speech recognition measures (W-22, QuickSIN, and SRT). The adaptive NSRT is an efficient and effective measure of speech recognition, providing valid and reliable information concerning respondents' speech perception abilities.

  1. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Holzrichter, J.F.; Ng, L.C.

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used formore » purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.« less

  2. Female voice communications in high level aircraft cockpit noises--part II: vocoder and automatic speech recognition systems.

    PubMed

    Nixon, C; Anderson, T; Morris, L; McCavitt, A; McKinley, R; Yeager, D; McDaniel, M

    1998-11-01

    The intelligibility of female and male speech is equivalent under most ordinary living conditions. However, due to small differences between their acoustic speech signals, called speech spectra, one can be more or less intelligible than the other in certain situations such as high levels of noise. Anecdotal information, supported by some empirical observations, suggests that some of the high intensity noise spectra of military aircraft cockpits may degrade the intelligibility of female speech more than that of male speech. In an applied research study, the intelligibility of female and male speech was measured in several high level aircraft cockpit noise conditions experienced in military aviation. In Part I, (Nixon CW, et al. Aviat Space Environ Med 1998; 69:675-83) female speech intelligibility measured in the spectra and levels of aircraft cockpit noises and with noise-canceling microphones was lower than that of the male speech in all conditions. However, the differences were small and only those at some of the highest noise levels were significant. Although speech intelligibility of both genders was acceptable during normal cruise noises, improvements are required in most of the highest levels of noise created during maximum aircraft operating conditions. These results are discussed in a Part I technical report. This Part II report examines the intelligibility in the same aircraft cockpit noises of vocoded female and male speech and the accuracy with which female and male speech in some of the cockpit noises were understood by automatic speech recognition systems. The intelligibility of vocoded female speech was generally the same as that of vocoded male speech. No significant differences were measured between the recognition accuracy of male and female speech by the automatic speech recognition systems. The intelligibility of female and male speech was equivalent for these conditions.

  3. Functional connectivity between face-movement and speech-intelligibility areas during auditory-only speech perception.

    PubMed

    Schall, Sonja; von Kriegstein, Katharina

    2014-01-01

    It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers' voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker's face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas.

  4. Recognition of Time-Compressed and Natural Speech with Selective Temporal Enhancements by Young and Elderly Listeners

    ERIC Educational Resources Information Center

    Gordon-Salant, Sandra; Fitzgibbons, Peter J.; Friedman, Sarah A.

    2007-01-01

    Purpose: The goal of this experiment was to determine whether selective slowing of speech segments improves recognition performance by young and elderly listeners. The hypotheses were (a) the benefits of time expansion occur for rapid speech but not for natural-rate speech, (b) selective time expansion of consonants produces greater score…

  5. The Frame Constraint on Experimentally Elicited Speech Errors in Japanese.

    PubMed

    Saito, Akie; Inoue, Tomoyoshi

    2017-06-01

    The so-called syllable position effect in speech errors has been interpreted as reflecting constraints posed by the frame structure of a given language, which is separately operating from linguistic content during speech production. The effect refers to the phenomenon that when a speech error occurs, replaced and replacing sounds tend to be in the same position within a syllable or word. Most of the evidence for the effect comes from analyses of naturally occurring speech errors in Indo-European languages, and there are few studies examining the effect in experimentally elicited speech errors and in other languages. This study examined whether experimentally elicited sound errors in Japanese exhibits the syllable position effect. In Japanese, the sub-syllabic unit known as "mora" is considered to be a basic sound unit in production. Results showed that the syllable position effect occurred in mora errors, suggesting that the frame constrains the ordering of sounds during speech production.

  6. Perceptual Bias in Speech Error Data Collection: Insights from Spanish Speech Errors

    ERIC Educational Resources Information Center

    Perez, Elvira; Santiago, Julio; Palma, Alfonso; O'Seaghdha, Padraig G.

    2007-01-01

    This paper studies the reliability and validity of naturalistic speech errors as a tool for language production research. Possible biases when collecting naturalistic speech errors are identified and specific predictions derived. These patterns are then contrasted with published reports from Germanic languages (English, German and Dutch) and one…

  7. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review

    NASA Astrophysics Data System (ADS)

    Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH

    2017-09-01

    This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.

  8. Robot Command Interface Using an Audio-Visual Speech Recognition System

    NASA Astrophysics Data System (ADS)

    Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy

    In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.

  9. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    PubMed Central

    Caballero-Morales, Santiago-Omar

    2013-01-01

    An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness). Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR's output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech. PMID:23935410

  10. Deterioration of Speech Recognition Ability Over a Period of 5 Years in Adults Ages 18 to 70 Years: Results of the Dutch Online Speech-in-Noise Test.

    PubMed

    Stam, Mariska; Smits, Cas; Twisk, Jos W R; Lemke, Ulrike; Festen, Joost M; Kramer, Sophia E

    2015-01-01

    The first aim of the present study was to determine the change in speech recognition in noise over a period of 5 years in participants ages 18 to 70 years at baseline. The second aim was to investigate whether age, gender, educational level, the level of initial speech recognition in noise, and reported chronic conditions were associated with a change in speech recognition in noise. The baseline and 5-year follow-up data of 427 participants with and without hearing impairment participating in the National Longitudinal Study on Hearing (NL-SH) were analyzed. The ability to recognize speech in noise was measured twice with the online National Hearing Test, a digit-triplet speech-in-noise test. Speech-reception-threshold in noise (SRTn) scores were calculated, corresponding to 50% speech intelligibility. Unaided SRTn scores obtained with the same transducer (headphones or loudspeakers) at both test moments were included. Changes in SRTn were calculated as a raw shift (T1 - T0) and an adjusted shift for regression towards the mean. Paired t tests and multivariable linear regression analyses were applied. The mean increase (i.e., deterioration) in SRTn was 0.38-dB signal-to-noise ratio (SNR) over 5 years (p < 0.001). Results of the multivariable regression analyses showed that the age group of 50 to 59 years had a significantly larger deterioration in SRTn compared with the age group of 18 to 39 years (raw shift: beta: 0.64-dB SNR; 95% confidence interval: 0.07-1.22; p = 0.028, adjusted for initial speech recognition level - adjusted shift: beta: 0.82-dB SNR; 95% confidence interval: 0.27-1.34; p = 0.004). Gender, educational level, and the number of chronic conditions were not associated with a change in SRTn over time. No significant differences in increase of SRTn were found between the initial levels of speech recognition (i.e., good, insufficient, or poor) when taking into account the phenomenon regression towards the mean. The study results indicate that hearing deterioration of speech recognition in noise over 5 years can also be detected in adults ages 18 to 70 years. This rather small numeric change might represent a relevant impact on an individual's ability to understand speech in everyday life.

  11. Enhancing speech recognition using improved particle swarm optimization based hidden Markov model.

    PubMed

    Selvaraj, Lokesh; Ganesan, Balakrishnan

    2014-01-01

    Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO) is suggested. The suggested methodology contains four stages, namely, (i) denoising, (ii) feature mining (iii), vector quantization, and (iv) IPSO based hidden Markov model (HMM) technique (IP-HMM). At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC), mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.

  12. Speaker recognition with temporal cues in acoustic and electric hearing

    NASA Astrophysics Data System (ADS)

    Vongphoe, Michael; Zeng, Fan-Gang

    2005-08-01

    Natural spoken language processing includes not only speech recognition but also identification of the speaker's gender, age, emotional, and social status. Our purpose in this study is to evaluate whether temporal cues are sufficient to support both speech and speaker recognition. Ten cochlear-implant and six normal-hearing subjects were presented with vowel tokens spoken by three men, three women, two boys, and two girls. In one condition, the subject was asked to recognize the vowel. In the other condition, the subject was asked to identify the speaker. Extensive training was provided for the speaker recognition task. Normal-hearing subjects achieved nearly perfect performance in both tasks. Cochlear-implant subjects achieved good performance in vowel recognition but poor performance in speaker recognition. The level of the cochlear implant performance was functionally equivalent to normal performance with eight spectral bands for vowel recognition but only to one band for speaker recognition. These results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants. Several methods, including explicit encoding of fundamental frequency and frequency modulation, are proposed to improve speaker recognition for current cochlear implant users.

  13. A speech-controlled environmental control system for people with severe dysarthria.

    PubMed

    Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca

    2007-06-01

    Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.

  14. Improving speech-in-noise recognition for children with hearing loss: Potential effects of language abilities, binaural summation, and head shadow

    PubMed Central

    Nittrouer, Susan; Caldwell-Tarr, Amanda; Tarr, Eric; Lowenstein, Joanna H.; Rice, Caitlin; Moberly, Aaron C.

    2014-01-01

    Objective: This study examined speech recognition in noise for children with hearing loss, compared it to recognition for children with normal hearing, and examined mechanisms that might explain variance in children’s abilities to recognize speech in noise. Design: Word recognition was measured in two levels of noise, both when the speech and noise were co-located in front and when the noise came separately from one side. Four mechanisms were examined as factors possibly explaining variance: vocabulary knowledge, sensitivity to phonological structure, binaural summation, and head shadow. Study sample: Participants were 113 eight-year-old children. Forty-eight had normal hearing (NH) and 65 had hearing loss: 18 with hearing aids (HAs), 19 with one cochlear implant (CI), and 28 with two CIs. Results: Phonological sensitivity explained a significant amount of between-groups variance in speech-in-noise recognition. Little evidence of binaural summation was found. Head shadow was similar in magnitude for children with NH and with CIs, regardless of whether they wore one or two CIs. Children with HAs showed reduced head shadow effects. Conclusion: These outcomes suggest that in order to improve speech-in-noise recognition for children with hearing loss, intervention needs to be comprehensive, focusing on both language abilities and auditory mechanisms. PMID:23834373

  15. Auditory training of speech recognition with interrupted and continuous noise maskers by children with hearing impairment

    PubMed Central

    Sullivan, Jessica R.; Thibodeau, Linda M.; Assmann, Peter F.

    2013-01-01

    Previous studies have indicated that individuals with normal hearing (NH) experience a perceptual advantage for speech recognition in interrupted noise compared to continuous noise. In contrast, adults with hearing impairment (HI) and younger children with NH receive a minimal benefit. The objective of this investigation was to assess whether auditory training in interrupted noise would improve speech recognition in noise for children with HI and perhaps enhance their utilization of glimpsing skills. A partially-repeated measures design was used to evaluate the effectiveness of seven 1-h sessions of auditory training in interrupted and continuous noise. Speech recognition scores in interrupted and continuous noise were obtained from pre-, post-, and 3 months post-training from 24 children with moderate-to-severe hearing loss. Children who participated in auditory training in interrupted noise demonstrated a significantly greater improvement in speech recognition compared to those who trained in continuous noise. Those who trained in interrupted noise demonstrated similar improvements in both noise conditions while those who trained in continuous noise only showed modest improvements in the interrupted noise condition. This study presents direct evidence that auditory training in interrupted noise can be beneficial in improving speech recognition in noise for children with HI. PMID:23297921

  16. Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners with Simulated Age-Related Hearing Loss

    ERIC Educational Resources Information Center

    Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian

    2017-01-01

    Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist…

  17. The Relationship between Binaural Benefit and Difference in Unilateral Speech Recognition Performance for Bilateral Cochlear Implant Users

    PubMed Central

    Yoon, Yang-soo; Li, Yongxin; Kang, Hou-Yong; Fu, Qian-Jie

    2011-01-01

    Objective The full benefit of bilateral cochlear implants may depend on the unilateral performance with each device, the speech materials, processing ability of the user, and/or the listening environment. In this study, bilateral and unilateral speech performances were evaluated in terms of recognition of phonemes and sentences presented in quiet or in noise. Design Speech recognition was measured for unilateral left, unilateral right, and bilateral listening conditions; speech and noise were presented at 0° azimuth. The “binaural benefit” was defined as the difference between bilateral performance and unilateral performance with the better ear. Study Sample 9 adults with bilateral cochlear implants participated. Results On average, results showed a greater binaural benefit in noise than in quiet for all speech tests. More importantly, the binaural benefit was greater when unilateral performance was similar across ears. As the difference in unilateral performance between ears increased, the binaural advantage decreased; this functional relationship was observed across the different speech materials and noise levels even though there was substantial intra- and inter-subject variability. Conclusions The results indicate that subjects who show symmetry in speech recognition performance between implanted ears in general show a large binaural benefit. PMID:21696329

  18. Investigation of an HMM/ANN hybrid structure in pattern recognition application using cepstral analysis of dysarthric (distorted) speech signals.

    PubMed

    Polur, Prasad D; Miller, Gerald E

    2006-10-01

    Computer speech recognition of individuals with dysarthria, such as cerebral palsy patients requires a robust technique that can handle conditions of very high variability and limited training data. In this study, application of a 10 state ergodic hidden Markov model (HMM)/artificial neural network (ANN) hybrid structure for a dysarthric speech (isolated word) recognition system, intended to act as an assistive tool, was investigated. A small size vocabulary spoken by three cerebral palsy subjects was chosen. The effect of such a structure on the recognition rate of the system was investigated by comparing it with an ergodic hidden Markov model as a control tool. This was done in order to determine if this modified technique contributed to enhanced recognition of dysarthric speech. The speech was sampled at 11 kHz. Mel frequency cepstral coefficients were extracted from them using 15 ms frames and served as training input to the hybrid model setup. The subsequent results demonstrated that the hybrid model structure was quite robust in its ability to handle the large variability and non-conformity of dysarthric speech. The level of variability in input dysarthric speech patterns sometimes limits the reliability of the system. However, its application as a rehabilitation/control tool to assist dysarthric motor impaired individuals holds sufficient promise.

  19. Effects of intelligibility on working memory demand for speech perception.

    PubMed

    Francis, Alexander L; Nusbaum, Howard C

    2009-08-01

    Understanding low-intelligibility speech is effortful. In three experiments, we examined the effects of intelligibility on working memory (WM) demands imposed by perception of synthetic speech. In all three experiments, a primary speeded word recognition task was paired with a secondary WM-load task designed to vary the availability of WM capacity during speech perception. Speech intelligibility was varied either by training listeners to use available acoustic cues in a more diagnostic manner (as in Experiment 1) or by providing listeners with more informative acoustic cues (i.e., better speech quality, as in Experiments 2 and 3). In the first experiment, training significantly improved intelligibility and recognition speed; increasing WM load significantly slowed recognition. A significant interaction between training and load indicated that the benefit of training on recognition speed was observed only under low memory load. In subsequent experiments, listeners received no training; intelligibility was manipulated by changing synthesizers. Improving intelligibility without training improved recognition accuracy, and increasing memory load still decreased it, but more intelligible speech did not produce more efficient use of available WM capacity. This suggests that perceptual learning modifies the way available capacity is used, perhaps by increasing the use of more phonetically informative features and/or by decreasing use of less informative ones.

  20. Self-Assessed Hearing Handicap in Older Adults With Poorer-Than-Predicted Speech Recognition in Noise.

    PubMed

    Eckert, Mark A; Matthews, Lois J; Dubno, Judy R

    2017-01-01

    Even older adults with relatively mild hearing loss report hearing handicap, suggesting that hearing handicap is not completely explained by reduced speech audibility. We examined the extent to which self-assessed ratings of hearing handicap using the Hearing Handicap Inventory for the Elderly (HHIE; Ventry & Weinstein, 1982) were significantly associated with measures of speech recognition in noise that controlled for differences in speech audibility. One hundred sixty-two middle-aged and older adults had HHIE total scores that were significantly associated with audibility-adjusted measures of speech recognition for low-context but not high-context sentences. These findings were driven by HHIE items involving negative feelings related to communication difficulties that also captured variance in subjective ratings of effort and frustration that predicted speech recognition. The average pure-tone threshold accounted for some of the variance in the association between the HHIE and audibility-adjusted speech recognition, suggesting an effect of central and peripheral auditory system decline related to elevated thresholds. The accumulation of difficult listening experiences appears to produce a self-assessment of hearing handicap resulting from (a) reduced audibility of stimuli, (b) declines in the central and peripheral auditory system function, and (c) additional individual variation in central nervous system function.

  1. Self-Assessed Hearing Handicap in Older Adults With Poorer-Than-Predicted Speech Recognition in Noise

    PubMed Central

    Matthews, Lois J.; Dubno, Judy R.

    2017-01-01

    Purpose Even older adults with relatively mild hearing loss report hearing handicap, suggesting that hearing handicap is not completely explained by reduced speech audibility. Method We examined the extent to which self-assessed ratings of hearing handicap using the Hearing Handicap Inventory for the Elderly (HHIE; Ventry & Weinstein, 1982) were significantly associated with measures of speech recognition in noise that controlled for differences in speech audibility. Results One hundred sixty-two middle-aged and older adults had HHIE total scores that were significantly associated with audibility-adjusted measures of speech recognition for low-context but not high-context sentences. These findings were driven by HHIE items involving negative feelings related to communication difficulties that also captured variance in subjective ratings of effort and frustration that predicted speech recognition. The average pure-tone threshold accounted for some of the variance in the association between the HHIE and audibility-adjusted speech recognition, suggesting an effect of central and peripheral auditory system decline related to elevated thresholds. Conclusion The accumulation of difficult listening experiences appears to produce a self-assessment of hearing handicap resulting from (a) reduced audibility of stimuli, (b) declines in the central and peripheral auditory system function, and (c) additional individual variation in central nervous system function. PMID:28060993

  2. Continuous multiword recognition performance of young and elderly listeners in ambient noise

    NASA Astrophysics Data System (ADS)

    Sato, Hiroshi

    2005-09-01

    Hearing threshold shift due to aging is known as a dominant factor to degrade speech recognition performance in noisy conditions. On the other hand, cognitive factors of aging-relating speech recognition performance in various speech-to-noise conditions are not well established. In this study, two kinds of speech test were performed to examine how working memory load relates to speech recognition performance. One is word recognition test with high-familiarity, four-syllable Japanese words (single-word test). In this test, each word was presented to listeners; the listeners were asked to write the word down on paper with enough time to answer. In the other test, five continuous word were presented to listeners and listeners were asked to write the word down after just five words were presented (multiword test). Both tests were done in various speech-to-noise ratios under 50-dBA Hoth spectrum noise with more than 50 young and elderly subjects. The results of two experiments suggest that (1) Hearing level is related to scores of both tests. (2) Scores of single-word test are well correlated with those of multiword test. (3) Scores of multiword test are not improved as speech-to-noise ratio improves in the condition where scores of single-word test reach their ceiling.

  3. Speech recognition: Acoustic phonetic and lexical knowledge representation

    NASA Astrophysics Data System (ADS)

    Zue, V. W.

    1983-02-01

    The purpose of this program is to develop a speech data base facility under which the acoustic characteristics of speech sounds in various contexts can be studied conveniently; investigate the phonological properties of a large lexicon of, say 10,000 words, and determine to what extent the phontactic constraints can be utilized in speech recognition; study the acoustic cues that are used to mark work boundaries; develop a test bed in the form of a large-vocabulary, IWR system to study the interactions of acoustic, phonetic and lexical knowledge; and develop a limited continuous speech recognition system with the goal of recognizing any English word from its spelling in order to assess the interactions of higher-level knowledge sources.

  4. 16p11.2 Deletion mice display cognitive deficits in touchscreen learning and novelty recognition tasks.

    PubMed

    Yang, Mu; Lewis, Freeman C; Sarvi, Michael S; Foley, Gillian M; Crawley, Jacqueline N

    2015-12-01

    Chromosomal 16p11.2 deletion syndrome frequently presents with intellectual disabilities, speech delays, and autism. Here we investigated the Dolmetsch line of 16p11.2 heterozygous (+/-) mice on a range of cognitive tasks with different neuroanatomical substrates. Robust novel object recognition deficits were replicated in two cohorts of 16p11.2+/- mice, confirming previous findings. A similarly robust deficit in object location memory was discovered in +/-, indicating impaired spatial novelty recognition. Generalizability of novelty recognition deficits in +/- mice extended to preference for social novelty. Robust learning deficits and cognitive inflexibility were detected using Bussey-Saksida touchscreen operant chambers. During acquisition of pairwise visual discrimination, +/- mice required significantly more training trials to reach criterion than wild-type littermates (+/+), and made more errors and correction errors than +/+. In the reversal phase, all +/+ reached criterion, whereas most +/- failed to reach criterion by the 30-d cutoff. Contextual and cued fear conditioning were normal in +/-. These cognitive phenotypes may be relevant to some aspects of cognitive impairments in humans with 16p11.2 deletion, and support the use of 16p11.2+/- mice as a model system for discovering treatments for cognitive impairments in 16p11.2 deletion syndrome. © 2015 Yang et al.; Published by Cold Spring Harbor Laboratory Press.

  5. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    NASA Astrophysics Data System (ADS)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  6. Functional Connectivity between Face-Movement and Speech-Intelligibility Areas during Auditory-Only Speech Perception

    PubMed Central

    Schall, Sonja; von Kriegstein, Katharina

    2014-01-01

    It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers’ voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker’s face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas. PMID:24466026

  7. Two Stage Data Augmentation for Low Resourced Speech Recognition (Author’s Manuscript)

    DTIC Science & Technology

    2016-09-12

    speech recognition, deep neural networks, data augmentation 1. Introduction When training data is limited—whether it be audio or text—the obvious...Schwartz, and S. Tsakalidis, “Enhancing low resource keyword spotting with au- tomatically retrieved web documents,” in Interspeech, 2015, pp. 839–843. [2...and F. Seide, “Feature learning in deep neural networks - a study on speech recognition tasks,” in International Conference on Learning Representations

  8. Getting What You Want: Accurate Document Filtering in a Terabyte World

    DTIC Science & Technology

    2002-11-01

    models are used widely in speech recognition and have shown promise for ad-hoc information retrieval (Ponte and Croft, 1998; Lafferty and Zhai, 2001...tasks is focused on developing techniques similar to those used in speech recognition. However the differing requirements of speech recognition and...Conference on Research and Development in Information Retrieval. ACM. 6. T.Ault, and Y. Yang. (2001.) kNN at TREC-9: A failure analysis. In

  9. V2S: Voice to Sign Language Translation System for Malaysian Deaf People

    NASA Astrophysics Data System (ADS)

    Mean Foong, Oi; Low, Tang Jung; La, Wai Wan

    The process of learning and understand the sign language may be cumbersome to some, and therefore, this paper proposes a solution to this problem by providing a voice (English Language) to sign language translation system using Speech and Image processing technique. Speech processing which includes Speech Recognition is the study of recognizing the words being spoken, regardless of whom the speaker is. This project uses template-based recognition as the main approach in which the V2S system first needs to be trained with speech pattern based on some generic spectral parameter set. These spectral parameter set will then be stored as template in a database. The system will perform the recognition process through matching the parameter set of the input speech with the stored templates to finally display the sign language in video format. Empirical results show that the system has 80.3% recognition rate.

  10. Hemispheric lateralization of linguistic prosody recognition in comparison to speech and speaker recognition.

    PubMed

    Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina

    2014-11-15

    Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on an inter-hemispheric mechanism which exploits both a right-hemispheric sensitivity to pitch information and a left-hemispheric dominance in speech processing. Copyright © 2014 Elsevier Inc. All rights reserved.

  11. A speech pronunciation practice system for speech-impaired children: A study to measure its success.

    PubMed

    Salim, Siti Salwah; Mustafa, Mumtaz Begum Binti Peer; Asemi, Adeleh; Ahmad, Azila; Mohamed, Noraini; Ghazali, Kamila Binti

    2016-09-01

    The speech pronunciation practice (SPP) system enables children with speech impairments to practise and improve their speech pronunciation. However, little is known about the surrogate measures of the SPP system. This research aims to measure the success and effectiveness of the SPP system using three surrogate measures: usage (frequency of use), performance (recognition accuracy) and satisfaction (children's subjective reactions), and how these measures are aligned with the success of the SPP system, as well as to each other. We have measured the absolute change in the word error rate (WER) between the pre- and post-training, using the ANOVA test. Correlation co-efficiency (CC) analysis was conducted to test the relation between the surrogate measures, while a Structural Equation Model (SEM) was used to investigate the causal relations between the measures. The CC test results indicate a positive correlation between the surrogate measures. The SEM supports all the proposed gtheses. The ANOVA results indicate that SPP is effective in reducing the WER of impaired speech. The SPP system is an effective assistive tool, especially for high levels of severity. We found that performance is a mediator of the relation between "usage" and "satisfaction". Copyright © 2016 Elsevier Ltd. All rights reserved.

  12. Spoken Word Recognition Errors in Speech Audiometry: A Measure of Hearing Performance?

    PubMed Central

    Coene, Martine; van der Lee, Anneke; Govaerts, Paul J.

    2015-01-01

    This report provides a detailed analysis of incorrect responses from an open-set spoken word-repetition task which is part of a Dutch speech audiometric test battery. Single-consonant confusions were analyzed from 230 normal hearing participants in terms of the probability of choice of a particular response on the basis of acoustic-phonetic, lexical, and frequency variables. The results indicate that consonant confusions are better predicted by lexical knowledge than by acoustic properties of the stimulus word. A detailed analysis of the transmission of phonetic features indicates that “voicing” is best preserved whereas “manner of articulation” yields most perception errors. As consonant confusion matrices are often used to determine the degree and type of a patient's hearing impairment, to predict a patient's gain in hearing performance with hearing devices and to optimize the device settings in view of maximum output, the observed findings are highly relevant for the audiological practice. Based on our findings, speech audiometric outcomes provide a combined auditory-linguistic profile of the patient. The use of confusion matrices might therefore not be the method best suited to measure hearing performance. Ideally, they should be complemented by other listening task types that are known to have less linguistic bias, such as phonemic discrimination. PMID:26557717

  13. Linguistic contributions to speech-on-speech masking for native and non-native listeners: language familiarity and semantic content.

    PubMed

    Brouwer, Susanne; Van Engen, Kristin J; Calandruccio, Lauren; Bradlow, Ann R

    2012-02-01

    This study examined whether speech-on-speech masking is sensitive to variation in the degree of similarity between the target and the masker speech. Three experiments investigated whether speech-in-speech recognition varies across different background speech languages (English vs Dutch) for both English and Dutch targets, as well as across variation in the semantic content of the background speech (meaningful vs semantically anomalous sentences), and across variation in listener status vis-à-vis the target and masker languages (native, non-native, or unfamiliar). The results showed that the more similar the target speech is to the masker speech (e.g., same vs different language, same vs different levels of semantic content), the greater the interference on speech recognition accuracy. Moreover, the listener's knowledge of the target and the background language modulate the size of the release from masking. These factors had an especially strong effect on masking effectiveness in highly unfavorable listening conditions. Overall this research provided evidence that that the degree of target-masker similarity plays a significant role in speech-in-speech recognition. The results also give insight into how listeners assign their resources differently depending on whether they are listening to their first or second language. © 2012 Acoustical Society of America

  14. Linguistic contributions to speech-on-speech masking for native and non-native listeners: Language familiarity and semantic content

    PubMed Central

    Brouwer, Susanne; Van Engen, Kristin J.; Calandruccio, Lauren; Bradlow, Ann R.

    2012-01-01

    This study examined whether speech-on-speech masking is sensitive to variation in the degree of similarity between the target and the masker speech. Three experiments investigated whether speech-in-speech recognition varies across different background speech languages (English vs Dutch) for both English and Dutch targets, as well as across variation in the semantic content of the background speech (meaningful vs semantically anomalous sentences), and across variation in listener status vis-à-vis the target and masker languages (native, non-native, or unfamiliar). The results showed that the more similar the target speech is to the masker speech (e.g., same vs different language, same vs different levels of semantic content), the greater the interference on speech recognition accuracy. Moreover, the listener’s knowledge of the target and the background language modulate the size of the release from masking. These factors had an especially strong effect on masking effectiveness in highly unfavorable listening conditions. Overall this research provided evidence that that the degree of target-masker similarity plays a significant role in speech-in-speech recognition. The results also give insight into how listeners assign their resources differently depending on whether they are listening to their first or second language. PMID:22352516

  15. Speech Recognition in Adults With Cochlear Implants: The Effects of Working Memory, Phonological Sensitivity, and Aging.

    PubMed

    Moberly, Aaron C; Harris, Michael S; Boyce, Lauren; Nittrouer, Susan

    2017-04-14

    Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced years of hearing loss obtains a cochlear implant. Thirty adults with cochlear implants and 30 age-matched controls with age-normal hearing underwent testing of verbal working memory using digit span and serial recall of words. Phonological capacities were assessed using a lexical decision task and nonword repetition. Recognition of words in sentences in speech-shaped noise was measured. Implant users had only slightly poorer working memory accuracy than did controls and only on serial recall of words; however, phonological sensitivity was highly impaired. Working memory did not facilitate speech recognition in noise for either group. Phonological sensitivity predicted sentence recognition for implant users but not for listeners with normal hearing. Clinical speech recognition outcomes for adult implant users relate to the ability of these users to process phonological information. Results suggest that phonological capacities may serve as potential clinical targets through rehabilitative training. Such novel interventions may be particularly helpful for older adult implant users.

  16. Speech Recognition in Adults With Cochlear Implants: The Effects of Working Memory, Phonological Sensitivity, and Aging

    PubMed Central

    Harris, Michael S.; Boyce, Lauren; Nittrouer, Susan

    2017-01-01

    Purpose Models of speech recognition suggest that “top-down” linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced years of hearing loss obtains a cochlear implant. Method Thirty adults with cochlear implants and 30 age-matched controls with age-normal hearing underwent testing of verbal working memory using digit span and serial recall of words. Phonological capacities were assessed using a lexical decision task and nonword repetition. Recognition of words in sentences in speech-shaped noise was measured. Results Implant users had only slightly poorer working memory accuracy than did controls and only on serial recall of words; however, phonological sensitivity was highly impaired. Working memory did not facilitate speech recognition in noise for either group. Phonological sensitivity predicted sentence recognition for implant users but not for listeners with normal hearing. Conclusion Clinical speech recognition outcomes for adult implant users relate to the ability of these users to process phonological information. Results suggest that phonological capacities may serve as potential clinical targets through rehabilitative training. Such novel interventions may be particularly helpful for older adult implant users. PMID:28384805

  17. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners.

    PubMed

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions.

  18. Speech Recognition Thresholds for Multilingual Populations.

    ERIC Educational Resources Information Center

    Ramkissoon, Ishara

    2001-01-01

    This article traces the development of speech audiometry in the United States and reports on the current status, focusing on the needs of a multilingual population in terms of measuring speech recognition threshold (SRT). It also discusses sociolinguistic considerations, alternative SRT stimuli for second language learners, and research on using…

  19. Random Deep Belief Networks for Recognizing Emotions from Speech Signals.

    PubMed

    Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

    2017-01-01

    Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.

  20. Random Deep Belief Networks for Recognizing Emotions from Speech Signals

    PubMed Central

    Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

    2017-01-01

    Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition. PMID:28356908

  1. Speech recognition for embedded automatic positioner for laparoscope

    NASA Astrophysics Data System (ADS)

    Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin

    2014-07-01

    In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.

  2. Investigation of potential cognitive tests for use with older adults in audiology clinics.

    PubMed

    Vaughan, Nancy; Storzbach, Daniel; Furukawa, Izumi

    2008-01-01

    Cognitive declines in working memory and processing speed are hallmarks of aging. Deficits in speech understanding also are seen in aging individuals. A clinical test to determine whether the cognitive aging changes contribute to aging speech understanding difficulties would be helpful for determining rehabilitation strategies in audiology clinics. To identify a clinical neurocognitive test or battery of tests that could be used in audiology clinics to help explain deficits in speech recognition in some older listeners. A correlational study examining the association between certain cognitive test scores and speech recognition performance. Speeded (time-compressed) speech was used to increase the cognitive processing load. Two hundred twenty-five adults aged 50 through 75 years were participants in this study. Both batteries of tests were administered to all participants in two separate sessions. A selected battery of neurocognitive tests and a time-compressed speech recognition test battery using various rates of speech were administered. Principal component analysis was used to extract the important component factors from each set of tests, and regression models were constructed to examine the association between tests and to identify the neurocognitive test most strongly associated with speech recognition performance. A sequencing working memory test (Letter-Number Sequencing [LNS]) was most strongly associated with rapid speech understanding. The association between the LNS test results and the compressed sentence recognition scores (CSRS) was strong even when age and hearing loss were controlled. The LNS is a sequencing test that provides information about temporal processing at the cognitive level and may prove useful in diagnosis of speech understanding problems, and in the development of aural rehabilitation and training strategies.

  3. Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

    NASA Astrophysics Data System (ADS)

    Maskeliunas, Rytis; Rudzionis, Vytautas

    2011-06-01

    In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90%.

  4. Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem

    PubMed Central

    Liu, Xunying; Zhang, Chao; Woodland, Phil; Fonteneau, Elisabeth

    2017-01-01

    There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain. PMID:28945744

  5. Is Comprehension Necessary for Error Detection? A Conflict-Based Account of Monitoring in Speech Production

    ERIC Educational Resources Information Center

    Nozari, Nazbanou; Dell, Gary S.; Schwartz, Myrna F.

    2011-01-01

    Despite the existence of speech errors, verbal communication is successful because speakers can detect (and correct) their errors. The standard theory of speech-error detection, the perceptual-loop account, posits that the comprehension system monitors production output for errors. Such a comprehension-based monitor, however, cannot explain the…

  6. Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear When Exposed to 65 dB of Auditory Noise

    PubMed Central

    Söderlund, Göran B. W.; Jobs, Elisabeth Nilsson

    2016-01-01

    The most common neuropsychiatric condition in the in children is attention deficit hyperactivity disorder (ADHD), affecting ∼6–9% of the population. ADHD is distinguished by inattention and hyperactive, impulsive behaviors as well as poor performance in various cognitive tasks often leading to failures at school. Sensory and perceptual dysfunctions have also been noticed. Prior research has mainly focused on limitations in executive functioning where differences are often explained by deficits in pre-frontal cortex activation. Less notice has been given to sensory perception and subcortical functioning in ADHD. Recent research has shown that children with ADHD diagnosis have a deviant auditory brain stem response compared to healthy controls. The aim of the present study was to investigate if the speech recognition threshold differs between attentive and children with ADHD symptoms in two environmental sound conditions, with and without external noise. Previous research has namely shown that children with attention deficits can benefit from white noise exposure during cognitive tasks and here we investigate if noise benefit is present during an auditory perceptual task. For this purpose we used a modified Hagerman’s speech recognition test where children with and without attention deficits performed a binaural speech recognition task to assess the speech recognition threshold in no noise and noise conditions (65 dB). Results showed that the inattentive group displayed a higher speech recognition threshold than typically developed children and that the difference in speech recognition threshold disappeared when exposed to noise at supra threshold level. From this we conclude that inattention can partly be explained by sensory perceptual limitations that can possibly be ameliorated through noise exposure. PMID:26858679

  7. Strategies for distant speech recognitionin reverberant environments

    NASA Astrophysics Data System (ADS)

    Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro

    2015-12-01

    Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

  8. Optimal pattern synthesis for speech recognition based on principal component analysis

    NASA Astrophysics Data System (ADS)

    Korsun, O. N.; Poliyev, A. V.

    2018-02-01

    The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.

  9. Age and measurement time-of-day effects on speech recognition in noise.

    PubMed

    Veneman, Carrie E; Gordon-Salant, Sandra; Matthews, Lois J; Dubno, Judy R

    2013-01-01

    The purpose of this study was to determine the effect of measurement time of day on speech recognition in noise and the extent to which time-of-day effects differ with age. Older adults tend to have more difficulty understanding speech in noise than younger adults, even when hearing is normal. Two possible contributors to this age difference in speech recognition may be measurement time of day and inhibition. Most younger adults are "evening-type," showing peak circadian arousal in the evening, whereas most older adults are "morning-type," with circadian arousal peaking in the morning. Tasks that require inhibition of irrelevant information have been shown to be affected by measurement time of day, with maximum performance attained at one's peak time of day. The authors hypothesized that a change in inhibition will be associated with measurement time of day and therefore affect speech recognition in noise, with better performance in the morning for older adults and in the evening for younger adults. Fifteen younger evening-type adults (20-28 years) and 15 older morning-type adults with normal hearing (66-78 years) listened to the Hearing in Noise Test (HINT) and the Quick Speech in Noise (QuickSIN) test in the morning and evening (peak and off-peak times). Time of day preference was assessed using the Morningness-Eveningness Questionnaire. Sentences and noise were presented binaurally through insert earphones. During morning and evening sessions, participants solved word-association problems within the visual-distraction task (VDT), which was used as an estimate of inhibition. After each session, participants rated perceived mental demand of the tasks using a revised version of the NASA Task Load Index. Younger adults performed significantly better on the speech-in-noise tasks and rated themselves as requiring significantly less mental demand when tested at their peak (evening) than off-peak (morning) time of day. In contrast, time-of-day effects were not observed for the older adults on the speech recognition or rating tasks. Although older adults required significantly more advantageous signal-to-noise ratios than younger adults for equivalent speech-recognition performance, a significantly larger younger versus older age difference in speech recognition was observed in the evening than in the morning. Older adults performed significantly poorer than younger adults on the VDT, but performance was not affected by measurement time of day. VDT performance for misleading distracter items was significantly correlated with HINT and QuickSIN test performance at the peak measurement time of day. Although all participants had normal hearing, speech recognition in noise was significantly poorer for older than younger adults, with larger age-related differences in the evening (an off-peak time for older adults) than in the morning. The significant effect of measurement time of day suggests that this factor may impact the clinical assessment of speech recognition in noise for all individuals. It appears that inhibition, as estimated by a visual distraction task for misleading visual items, is a cognitive mechanism that is related to speech-recognition performance in noise, at least at a listener's peak time of day.

  10. Multitasking During Degraded Speech Recognition in School-Age Children

    PubMed Central

    Ward, Kristina M.; Brehm, Laurel

    2017-01-01

    Multitasking requires individuals to allocate their cognitive resources across different tasks. The purpose of the current study was to assess school-age children’s multitasking abilities during degraded speech recognition. Children (8 to 12 years old) completed a dual-task paradigm including a sentence recognition (primary) task containing speech that was either unprocessed or noise-band vocoded with 8, 6, or 4 spectral channels and a visual monitoring (secondary) task. Children’s accuracy and reaction time on the visual monitoring task was quantified during the dual-task paradigm in each condition of the primary task and compared with single-task performance. Children experienced dual-task costs in the 6- and 4-channel conditions of the primary speech recognition task with decreased accuracy on the visual monitoring task relative to baseline performance. In all conditions, children’s dual-task performance on the visual monitoring task was strongly predicted by their single-task (baseline) performance on the task. Results suggest that children’s proficiency with the secondary task contributes to the magnitude of dual-task costs while multitasking during degraded speech recognition. PMID:28105890

  11. Multitasking During Degraded Speech Recognition in School-Age Children.

    PubMed

    Grieco-Calub, Tina M; Ward, Kristina M; Brehm, Laurel

    2017-01-01

    Multitasking requires individuals to allocate their cognitive resources across different tasks. The purpose of the current study was to assess school-age children's multitasking abilities during degraded speech recognition. Children (8 to 12 years old) completed a dual-task paradigm including a sentence recognition (primary) task containing speech that was either unprocessed or noise-band vocoded with 8, 6, or 4 spectral channels and a visual monitoring (secondary) task. Children's accuracy and reaction time on the visual monitoring task was quantified during the dual-task paradigm in each condition of the primary task and compared with single-task performance. Children experienced dual-task costs in the 6- and 4-channel conditions of the primary speech recognition task with decreased accuracy on the visual monitoring task relative to baseline performance. In all conditions, children's dual-task performance on the visual monitoring task was strongly predicted by their single-task (baseline) performance on the task. Results suggest that children's proficiency with the secondary task contributes to the magnitude of dual-task costs while multitasking during degraded speech recognition.

  12. Processing Electromyographic Signals to Recognize Words

    NASA Technical Reports Server (NTRS)

    Jorgensen, C. C.; Lee, D. D.

    2009-01-01

    A recently invented speech-recognition method applies to words that are articulated by means of the tongue and throat muscles but are otherwise not voiced or, at most, are spoken sotto voce. This method could satisfy a need for speech recognition under circumstances in which normal audible speech is difficult, poses a hazard, is disturbing to listeners, or compromises privacy. The method could also be used to augment traditional speech recognition by providing an additional source of information about articulator activity. The method can be characterized as intermediate between (1) conventional speech recognition through processing of voice sounds and (2) a method, not yet developed, of processing electroencephalographic signals to extract unspoken words directly from thoughts. This method involves computational processing of digitized electromyographic (EMG) signals from muscle innervation acquired by surface electrodes under a subject's chin near the tongue and on the side of the subject s throat near the larynx. After preprocessing, digitization, and feature extraction, EMG signals are processed by a neural-network pattern classifier, implemented in software, that performs the bulk of the recognition task as described.

  13. Relations Between Self-reported Executive Functioning and Speech Perception Skills in Adult Cochlear Implant Users.

    PubMed

    Moberly, Aaron C; Patel, Tirth R; Castellanos, Irina

    2018-02-01

    As a result of their hearing loss, adults with cochlear implants (CIs) would self-report poorer executive functioning (EF) skills than normal-hearing (NH) peers, and these EF skills would be associated with performance on speech recognition tasks. EF refers to a group of high order neurocognitive skills responsible for behavioral and emotional regulation during goal-directed activity, and EF has been found to be poorer in children with CIs than their NH age-matched peers. Moreover, there is increasing evidence that neurocognitive skills, including some EF skills, contribute to the ability to recognize speech through a CI. Thirty postlingually deafened adults with CIs and 42 age-matched NH adults were enrolled. Participants and their spouses or significant others (informants) completed well-validated self-reports or informant-reports of EF, the Behavior Rating Inventory of Executive Function - Adult (BRIEF-A). CI users' speech recognition skills were assessed in quiet using several measures of sentence recognition. NH peers were tested for recognition of noise-vocoded versions of the same speech stimuli. CI users self-reported difficulty on EF tasks of shifting and task monitoring. In CI users, measures of speech recognition correlated with several self-reported EF skills. The present findings provide further evidence that neurocognitive factors, including specific EF skills, may decline in association with hearing loss, and that some of these EF skills contribute to speech processing under degraded listening conditions.

  14. Spontaneous Speech Collection for the CSR Corpus

    DTIC Science & Technology

    1992-01-01

    Menlo Park, California 94025 1. ABSTRACT As part of a pilot data collection for DARPA’s Continuous Speech Recognition ( CSR ) speech corpus, SRI...International experi- mented with the collection of spontaneous speeoh material. The bulk of the CSR pilot data was read versions of news articles from...variable. 2. INTRODUCTION The CSR (Continuous Speech Recognition) Corpus collec- tion can be considered the successor to the Resource Man- agemen t

  15. Eyes and ears: Using eye tracking and pupillometry to understand challenges to speech recognition.

    PubMed

    Van Engen, Kristin J; McLaughlin, Drew J

    2018-05-04

    Although human speech recognition is often experienced as relatively effortless, a number of common challenges can render the task more difficult. Such challenges may originate in talkers (e.g., unfamiliar accents, varying speech styles), the environment (e.g. noise), or in listeners themselves (e.g., hearing loss, aging, different native language backgrounds). Each of these challenges can reduce the intelligibility of spoken language, but even when intelligibility remains high, they can place greater processing demands on listeners. Noisy conditions, for example, can lead to poorer recall for speech, even when it has been correctly understood. Speech intelligibility measures, memory tasks, and subjective reports of listener difficulty all provide critical information about the effects of such challenges on speech recognition. Eye tracking and pupillometry complement these methods by providing objective physiological measures of online cognitive processing during listening. Eye tracking records the moment-to-moment direction of listeners' visual attention, which is closely time-locked to unfolding speech signals, and pupillometry measures the moment-to-moment size of listeners' pupils, which dilate in response to increased cognitive load. In this paper, we review the uses of these two methods for studying challenges to speech recognition. Copyright © 2018. Published by Elsevier B.V.

  16. Robust relationship between reading span and speech recognition in noise

    PubMed Central

    Souza, Pamela; Arehart, Kathryn

    2015-01-01

    Objective Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. Design The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. Study sample The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Results Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Conclusions Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition. PMID:25975360

  17. Robust relationship between reading span and speech recognition in noise.

    PubMed

    Souza, Pamela; Arehart, Kathryn

    2015-01-01

    Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition.

  18. Automatic lip reading by using multimodal visual features

    NASA Astrophysics Data System (ADS)

    Takahashi, Shohei; Ohya, Jun

    2013-12-01

    Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.

  19. [Perception of emotional intonation of noisy speech signal with different acoustic parameters by adults of different age and gender].

    PubMed

    Dmitrieva, E S; Gel'man, V Ia

    2011-01-01

    The listener-distinctive features of recognition of different emotional intonations (positive, negative and neutral) of male and female speakers in the presence or absence of background noise were studied in 49 adults aged 20-79 years. In all the listeners noise produced the most pronounced decrease in recognition accuracy for positive emotional intonation ("joy") as compared to other intonations, whereas it did not influence the recognition accuracy of "anger" in 65-79-year-old listeners. The higher emotion recognition rates of a noisy signal were observed for speech emotional intonations expressed by female speakers. Acoustic characteristics of noisy and clear speech signals underlying perception of speech emotional prosody were found for adult listeners of different age and gender.

  20. Evaluating deep learning architectures for Speech Emotion Recognition.

    PubMed

    Fayek, Haytham M; Lech, Margaret; Cavedon, Lawrence

    2017-08-01

    Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    ERIC Educational Resources Information Center

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  2. Vocal Tract Representation in the Recognition of Cerebral Palsied Speech

    ERIC Educational Resources Information Center

    Rudzicz, Frank; Hirst, Graeme; van Lieshout, Pascal

    2012-01-01

    Purpose: In this study, the authors explored articulatory information as a means of improving the recognition of dysarthric speech by machine. Method: Data were derived chiefly from the TORGO database of dysarthric articulation (Rudzicz, Namasivayam, & Wolff, 2011) in which motions of various points in the vocal tract are measured during speech.…

  3. Micro-Based Speech Recognition: Instructional Innovation for Handicapped Learners.

    ERIC Educational Resources Information Center

    Horn, Carin E.; Scott, Brian L.

    A new voice based learning system (VBLS), which allows the handicapped user to interact with a microcomputer by voice commands, is described. Speech or voice recognition is the computerized process of identifying a spoken word or phrase, including those resulting from speech impediments. This new technology is helpful to the severely physically…

  4. Speech Recognition Software for Language Learning: Toward an Evaluation of Validity and Student Perceptions

    ERIC Educational Resources Information Center

    Cordier, Deborah

    2009-01-01

    A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…

  5. Noise Robust Speech Recognition Applied to Voice-Driven Wheelchair

    NASA Astrophysics Data System (ADS)

    Sasou, Akira; Kojima, Hiroaki

    2009-12-01

    Conventional voice-driven wheelchairs usually employ headset microphones that are capable of achieving sufficient recognition accuracy, even in the presence of surrounding noise. However, such interfaces require users to wear sensors such as a headset microphone, which can be an impediment, especially for the hand disabled. Conversely, it is also well known that the speech recognition accuracy drastically degrades when the microphone is placed far from the user. In this paper, we develop a noise robust speech recognition system for a voice-driven wheelchair. This system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors. We verified the effectiveness of our system in experiments in different environments, and confirmed that our system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors.

  6. Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition.

    PubMed

    Schädler, Marc René; Kollmeier, Birger

    2015-04-01

    To test if simultaneous spectral and temporal processing is required to extract robust features for automatic speech recognition (ASR), the robust spectro-temporal two-dimensional-Gabor filter bank (GBFB) front-end from Schädler, Meyer, and Kollmeier [J. Acoust. Soc. Am. 131, 4134-4151 (2012)] was de-composed into a spectral one-dimensional-Gabor filter bank and a temporal one-dimensional-Gabor filter bank. A feature set that is extracted with these separate spectral and temporal modulation filter banks was introduced, the separate Gabor filter bank (SGBFB) features, and evaluated on the CHiME (Computational Hearing in Multisource Environments) keywords-in-noise recognition task. From the perspective of robust ASR, the results showed that spectral and temporal processing can be performed independently and are not required to interact with each other. Using SGBFB features permitted the signal-to-noise ratio (SNR) to be lowered by 1.2 dB while still performing as well as the GBFB-based reference system, which corresponds to a relative improvement of the word error rate by 12.8%. Additionally, the real time factor of the spectro-temporal processing could be reduced by more than an order of magnitude. Compared to human listeners, the SNR needed to be 13 dB higher when using Mel-frequency cepstral coefficient features, 11 dB higher when using GBFB features, and 9 dB higher when using SGBFB features to achieve the same recognition performance.

  7. Patterns of Post-Stroke Brain Damage that Predict Speech Production Errors in Apraxia of Speech and Aphasia Dissociate

    PubMed Central

    Basilakos, Alexandra; Rorden, Chris; Bonilha, Leonardo; Moser, Dana; Fridriksson, Julius

    2015-01-01

    Background and Purpose Acquired apraxia of speech (AOS) is a motor speech disorder caused by brain damage. AOS often co-occurs with aphasia, a language disorder in which patients may also demonstrate speech production errors. The overlap of speech production deficits in both disorders has raised questions regarding if AOS emerges from a unique pattern of brain damage or as a sub-element of the aphasic syndrome. The purpose of this study was to determine whether speech production errors in AOS and aphasia are associated with distinctive patterns of brain injury. Methods Forty-three patients with history of a single left-hemisphere stroke underwent comprehensive speech and language testing. The Apraxia of Speech Rating Scale was used to rate speech errors specific to AOS versus speech errors that can also be associated with AOS and/or aphasia. Localized brain damage was identified using structural MRI, and voxel-based lesion-impairment mapping was used to evaluate the relationship between speech errors specific to AOS, those that can occur in AOS and/or aphasia, and brain damage. Results The pattern of brain damage associated with AOS was most strongly associated with damage to cortical motor regions, with additional involvement of somatosensory areas. Speech production deficits that could be attributed to AOS and/or aphasia were associated with damage to the temporal lobe and the inferior pre-central frontal regions. Conclusion AOS likely occurs in conjunction with aphasia due to the proximity of the brain areas supporting speech and language, but the neurobiological substrate for each disorder differs. PMID:25908457

  8. The Effect of Remote Masking on the Reception of Speech by Young School-Age Children.

    PubMed

    Youngdahl, Carla L; Healy, Eric W; Yoho, Sarah E; Apoux, Frédéric; Holt, Rachael Frush

    2018-02-15

    Psychoacoustic data indicate that infants and children are less likely than adults to focus on a spectral region containing an anticipated signal and are more susceptible to remote masking of a signal. These detection tasks suggest that infants and children, unlike adults, do not listen selectively. However, less is known about children's ability to listen selectively during speech recognition. Accordingly, the current study examines remote masking during speech recognition in children and adults. Adults and 7- and 5-year-old children performed sentence recognition in the presence of various spectrally remote maskers. Intelligibility was determined for each remote-masker condition, and performance was compared across age groups. It was found that speech recognition for 5-year-olds was reduced in the presence of spectrally remote noise, whereas the maskers had no effect on the 7-year-olds or adults. Maskers of different bandwidth and remoteness had similar effects. In accord with psychoacoustic data, young children do not appear to focus on a spectral region of interest and ignore other regions during speech recognition. This tendency may help account for their typically poorer speech perception in noise. This study also appears to capture an important developmental stage, during which a substantial refinement in spectral listening occurs.

  9. Conflict monitoring in speech processing: An fMRI study of error detection in speech production and perception.

    PubMed

    Gauvin, Hanna S; De Baene, Wouter; Brass, Marcel; Hartsuiker, Robert J

    2016-02-01

    To minimize the number of errors in speech, and thereby facilitate communication, speech is monitored before articulation. It is, however, unclear at which level during speech production monitoring takes place, and what mechanisms are used to detect and correct errors. The present study investigated whether internal verbal monitoring takes place through the speech perception system, as proposed by perception-based theories of speech monitoring, or whether mechanisms independent of perception are applied, as proposed by production-based theories of speech monitoring. With the use of fMRI during a tongue twister task we observed that error detection in internal speech during noise-masked overt speech production and error detection in speech perception both recruit the same neural network, which includes pre-supplementary motor area (pre-SMA), dorsal anterior cingulate cortex (dACC), anterior insula (AI), and inferior frontal gyrus (IFG). Although production and perception recruit similar areas, as proposed by perception-based accounts, we did not find activation in superior temporal areas (which are typically associated with speech perception) during internal speech monitoring in speech production as hypothesized by these accounts. On the contrary, results are highly compatible with a domain general approach to speech monitoring, by which internal speech monitoring takes place through detection of conflict between response options, which is subsequently resolved by a domain general executive center (e.g., the ACC). Copyright © 2015 Elsevier Inc. All rights reserved.

  10. Speech Recognition in Noise by Children with and without Dyslexia: How is it Related to Reading?

    PubMed

    Nittrouer, Susan; Krieg, Letitia M; Lowenstein, Joanna H

    2018-06-01

    Developmental dyslexia is commonly viewed as a phonological deficit that makes it difficult to decode written language. But children with dyslexia typically exhibit other problems, as well, including poor speech recognition in noise. The purpose of this study was to examine whether the speech-in-noise problems of children with dyslexia are related to their reading problems, and if so, if a common underlying factor might explain both. The specific hypothesis examined was that a spectral processing disorder results in these children receiving smeared signals, which could explain both the diminished sensitivity to phonological structure - leading to reading problems - and the speech recognition in noise difficulties. The alternative hypothesis tested in this study was that children with dyslexia simply have broadly based language deficits. Ninety-seven children between the ages of 7 years; 10 months and 12 years; 9 months participated: 46 with dyslexia and 51 without dyslexia. Children were tested on two dependent measures: word reading and recognition in noise with two types of sentence materials: as unprocessed (UP) signals, and as spectrally smeared (SM) signals. Data were collected for four predictor variables: phonological awareness, vocabulary, grammatical knowledge, and digit span. Children with dyslexia showed deficits on both dependent and all predictor variables. Their scores for speech recognition in noise were poorer than those of children without dyslexia for both the UP and SM signals, but by equivalent amounts across signal conditions indicating that they were not disproportionately hindered by spectral distortion. Correlation analyses on scores from children with dyslexia showed that reading ability and speech-in-noise recognition were only mildly correlated, and each skill was related to different underlying abilities. No substantial evidence was found to support the suggestion that the reading and speech recognition in noise problems of children with dyslexia arise from a single factor that could be defined as a spectral processing disorder. The reading and speech recognition in noise deficits of these children appeared to be largely independent. Copyright © 2018 Elsevier Ltd. All rights reserved.

  11. Schizophrenia alters intra-network functional connectivity in the caudate for detecting speech under informational speech masking conditions.

    PubMed

    Zheng, Yingjun; Wu, Chao; Li, Juanhua; Li, Ruikeng; Peng, Hongjun; She, Shenglin; Ning, Yuping; Li, Liang

    2018-04-04

    Speech recognition under noisy "cocktail-party" environments involves multiple perceptual/cognitive processes, including target detection, selective attention, irrelevant signal inhibition, sensory/working memory, and speech production. Compared to health listeners, people with schizophrenia are more vulnerable to masking stimuli and perform worse in speech recognition under speech-on-speech masking conditions. Although the schizophrenia-related speech-recognition impairment under "cocktail-party" conditions is associated with deficits of various perceptual/cognitive processes, it is crucial to know whether the brain substrates critically underlying speech detection against informational speech masking are impaired in people with schizophrenia. Using functional magnetic resonance imaging (fMRI), this study investigated differences between people with schizophrenia (n = 19, mean age = 33 ± 10 years) and their matched healthy controls (n = 15, mean age = 30 ± 9 years) in intra-network functional connectivity (FC) specifically associated with target-speech detection under speech-on-speech-masking conditions. The target-speech detection performance under the speech-on-speech-masking condition in participants with schizophrenia was significantly worse than that in matched healthy participants (healthy controls). Moreover, in healthy controls, but not participants with schizophrenia, the strength of intra-network FC within the bilateral caudate was positively correlated with the speech-detection performance under the speech-masking conditions. Compared to controls, patients showed altered spatial activity pattern and decreased intra-network FC in the caudate. In people with schizophrenia, the declined speech-detection performance under speech-on-speech masking conditions is associated with reduced intra-caudate functional connectivity, which normally contributes to detecting target speech against speech masking via its functions of suppressing masking-speech signals.

  12. Characterizing Articulation in Apraxic Speech Using Real-Time Magnetic Resonance Imaging.

    PubMed

    Hagedorn, Christina; Proctor, Michael; Goldstein, Louis; Wilson, Stephen M; Miller, Bruce; Gorno-Tempini, Maria Luisa; Narayanan, Shrikanth S

    2017-04-14

    Real-time magnetic resonance imaging (MRI) and accompanying analytical methods are shown to capture and quantify salient aspects of apraxic speech, substantiating and expanding upon evidence provided by clinical observation and acoustic and kinematic data. Analysis of apraxic speech errors within a dynamic systems framework is provided and the nature of pathomechanisms of apraxic speech discussed. One adult male speaker with apraxia of speech was imaged using real-time MRI while producing spontaneous speech, repeated naming tasks, and self-paced repetition of word pairs designed to elicit speech errors. Articulatory data were analyzed, and speech errors were detected using time series reflecting articulatory activity in regions of interest. Real-time MRI captured two types of apraxic gestural intrusion errors in a word pair repetition task. Gestural intrusion errors in nonrepetitive speech, multiple silent initiation gestures at the onset of speech, and covert (unphonated) articulation of entire monosyllabic words were also captured. Real-time MRI and accompanying analytical methods capture and quantify many features of apraxic speech that have been previously observed using other modalities while offering high spatial resolution. This patient's apraxia of speech affected the ability to select only the appropriate vocal tract gestures for a target utterance, suppressing others, and to coordinate them in time.

  13. Application of advanced speech technology in manned penetration bombers

    NASA Astrophysics Data System (ADS)

    North, R.; Lea, W.

    1982-03-01

    This report documents research on the potential use of speech technology in a manned penetration bomber aircraft (B-52/G and H). The objectives of the project were to analyze the pilot/copilot crewstation tasks over a three-hour-and forty-minute mission and determine the tasks that would benefit the most from conversion to speech recognition/generation, determine the technological feasibility of each of the identified tasks, and prioritize these tasks based on these criteria. Secondary objectives of the program were to enunciate research strategies in the application of speech technologies in airborne environments, and develop guidelines for briefing user commands on the potential of using speech technologies in the cockpit. The results of this study indicated that for the B-52 crewmember, speech recognition would be most beneficial for retrieving chart and procedural data that is contained in the flight manuals. Technological feasibility of these tasks indicated that the checklist and procedural retrieval tasks would be highly feasible for a speech recognition system.

  14. Using genetic algorithms with subjective input from human subjects: implications for fitting hearing aids and cochlear implants.

    PubMed

    Başkent, Deniz; Eiler, Cheryl L; Edwards, Brent

    2007-06-01

    To present a comprehensive analysis of the feasibility of genetic algorithms (GA) for finding the best fit of hearing aids or cochlear implants for individual users in clinical or research settings, where the algorithm is solely driven by subjective human input. Due to varying pathology, the best settings of an auditory device differ for each user. It is also likely that listening preferences vary at the same time. The settings of a device customized for a particular user can only be evaluated by the user. When optimization algorithms are used for fitting purposes, this situation poses a difficulty for a systematic and quantitative evaluation of the suitability of the fitting parameters produced by the algorithm. In the present study, an artificial listening environment was generated by distorting speech using a noiseband vocoder. The settings produced by the GA for this listening problem could objectively be evaluated by measuring speech recognition and comparing the performance to the best vocoder condition where speech was least distorted. Nine normal-hearing subjects participated in the study. The parameters to be optimized were the number of vocoder channels, the shift between the input frequency range and the synthesis frequency range, and the compression-expansion of the input frequency range over the synthesis frequency range. The subjects listened to pairs of sentences processed with the vocoder, and entered a preference for the sentence with better intelligibility. The GA modified the solutions iteratively according to the subject preferences. The program converged when the user ranked the same set of parameters as the best in three consecutive steps. The results produced by the GA were analyzed for quality by measuring speech intelligibility, for test-retest reliability by running the GA three times with each subject, and for convergence properties. Speech recognition scores averaged across subjects were similar for the best vocoder solution and for the solutions produced by the GA. The average number of iterations was 8 and the average convergence time was 25.5 minutes. The settings produced by different GA runs for the same subject were slightly different; however, speech recognition scores measured with these settings were similar. Individual data from subjects showed that in each run, a small number of GA solutions produced poorer speech intelligibility than for the best setting. This was probably a result of the combination of the inherent randomness of the GA, the convergence criterion used in the present study, and possible errors that the users might have made during the paired comparisons. On the other hand, the effect of these errors was probably small compared to the other two factors, as a comparison between subjective preferences and objective measures showed that for many subjects the two were in good agreement. The results showed that the GA was able to produce good solutions by using listener preferences in a relatively short time. For practical applications, the program can be made more robust by running the GA twice or by not using an automatic stopping criterion, and it can be made faster by optimizing the number of the paired comparisons completed in each iteration.

  15. Foot Structure in Japanese Speech Errors: Normal vs. Pathological

    ERIC Educational Resources Information Center

    Miyakoda, Haruko

    2008-01-01

    Although many studies of speech errors have been presented in the literature, most have focused on errors occurring at either the segmental or feature level. Few, if any, studies have dealt with the prosodic structure of errors. This paper aims to fill this gap by taking up the issue of prosodic structure in Japanese speech errors, with a focus on…

  16. The Frame Constraint on Experimentally Elicited Speech Errors in Japanese

    ERIC Educational Resources Information Center

    Saito, Akie; Inoue, Tomoyoshi

    2017-01-01

    The so-called syllable position effect in speech errors has been interpreted as reflecting constraints posed by the frame structure of a given language, which is separately operating from linguistic content during speech production. The effect refers to the phenomenon that when a speech error occurs, replaced and replacing sounds tend to be in the…

  17. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    NASA Astrophysics Data System (ADS)

    Selouani, Sid-Ahmed; O'Shaughnessy, Douglas

    2003-12-01

    Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  18. How linguistic closure and verbal working memory relate to speech recognition in noise--a review.

    PubMed

    Besser, Jana; Koelewijn, Thomas; Zekveld, Adriana A; Kramer, Sophia E; Festen, Joost M

    2013-06-01

    The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations.

  19. How Linguistic Closure and Verbal Working Memory Relate to Speech Recognition in Noise—A Review

    PubMed Central

    Koelewijn, Thomas; Zekveld, Adriana A.; Kramer, Sophia E.; Festen, Joost M.

    2013-01-01

    The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations. PMID:23945955

  20. The Effect of Dynamic Pitch on Speech Recognition in Temporally Modulated Noise

    ERIC Educational Resources Information Center

    Shen, Jung; Souza, Pamela E.

    2017-01-01

    Purpose: This study investigated the effect of dynamic pitch in target speech on older and younger listeners' speech recognition in temporally modulated noise. First, we examined whether the benefit from dynamic-pitch cues depends on the temporal modulation of noise. Second, we tested whether older listeners can benefit from dynamic-pitch cues for…

  1. Introduction and Overview of the Vicens-Reddy Speech Recognition System.

    ERIC Educational Resources Information Center

    Kameny, Iris; Ritea, H.

    The Vicens-Reddy System is unique in the sense that it approaches the problem of speech recognition as a whole, rather than treating particular aspects of the problems as in previous attempts. For example, where earlier systems treated only segmentation of speech into phoneme groups, or detected phonemes in a given context, the Vicens-Reddy System…

  2. Accommodation and Compliance Series: Employees with Arthritis

    MedlinePlus

    ... handed keyboard, an articulating keyboard tray, speech recognition software, a trackball, and office equipment for a workstation ... space heater, additional window insulation, and speech recognition software. An insurance clerk with arthritis from systemic lupus ...

  3. Patterns of poststroke brain damage that predict speech production errors in apraxia of speech and aphasia dissociate.

    PubMed

    Basilakos, Alexandra; Rorden, Chris; Bonilha, Leonardo; Moser, Dana; Fridriksson, Julius

    2015-06-01

    Acquired apraxia of speech (AOS) is a motor speech disorder caused by brain damage. AOS often co-occurs with aphasia, a language disorder in which patients may also demonstrate speech production errors. The overlap of speech production deficits in both disorders has raised questions on whether AOS emerges from a unique pattern of brain damage or as a subelement of the aphasic syndrome. The purpose of this study was to determine whether speech production errors in AOS and aphasia are associated with distinctive patterns of brain injury. Forty-three patients with history of a single left-hemisphere stroke underwent comprehensive speech and language testing. The AOS Rating Scale was used to rate speech errors specific to AOS versus speech errors that can also be associated with both AOS and aphasia. Localized brain damage was identified using structural magnetic resonance imaging, and voxel-based lesion-impairment mapping was used to evaluate the relationship between speech errors specific to AOS, those that can occur in AOS or aphasia, and brain damage. The pattern of brain damage associated with AOS was most strongly associated with damage to cortical motor regions, with additional involvement of somatosensory areas. Speech production deficits that could be attributed to AOS or aphasia were associated with damage to the temporal lobe and the inferior precentral frontal regions. AOS likely occurs in conjunction with aphasia because of the proximity of the brain areas supporting speech and language, but the neurobiological substrate for each disorder differs. © 2015 American Heart Association, Inc.

  4. [Research on Barrier-free Home Environment System Based on Speech Recognition].

    PubMed

    Zhu, Husheng; Yu, Hongliu; Shi, Ping; Fang, Youfang; Jian, Zhuo

    2015-10-01

    The number of people with physical disabilities is increasing year by year, and the trend of population aging is more and more serious. In order to improve the quality of the life, a control system of accessible home environment for the patients with serious disabilities was developed to control the home electrical devices with the voice of the patients. The control system includes a central control platform, a speech recognition module, a terminal operation module, etc. The system combines the speech recognition control technology and wireless information transmission technology with the embedded mobile computing technology, and interconnects the lamp, electronic locks, alarms, TV and other electrical devices in the home environment as a whole system through a wireless network node. The experimental results showed that speech recognition success rate was more than 84% in the home environment.

  5. Noise-robust speech recognition through auditory feature detection and spike sequence decoding.

    PubMed

    Schafer, Phillip B; Jin, Dezhe Z

    2014-03-01

    Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.

  6. Speech-recognition interfaces for music information retrieval

    NASA Astrophysics Data System (ADS)

    Goto, Masataka

    2005-09-01

    This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)

  7. Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.

    ERIC Educational Resources Information Center

    Makhoul, John I.

    The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…

  8. Dynamic relation between working memory capacity and speech recognition in noise during the first 6 months of hearing aid use.

    PubMed

    Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker

    2014-11-23

    The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.

  9. Recognition of Speech from the Television with Use of a Wireless Technology Designed for Cochlear Implants.

    PubMed

    Duke, Mila Morais; Wolfe, Jace; Schafer, Erin

    2016-05-01

    Cochlear implant (CI) recipients often experience difficulty understanding speech in noise and speech that originates from a distance. Many CI recipients also experience difficulty understanding speech originating from a television. Use of hearing assistance technology (HAT) may improve speech recognition in noise and for signals that originate from more than a few feet from the listener; however, there are no published studies evaluating the potential benefits of a wireless HAT designed to deliver audio signals from a television directly to a CI sound processor. The objective of this study was to compare speech recognition in quiet and in noise of CI recipients with the use of their CI alone and with the use of their CI and a wireless HAT (Cochlear Wireless TV Streamer). A two-way repeated measures design was used to evaluate performance differences obtained in quiet and in competing noise (65 dBA) with the CI sound processor alone and with the sound processor coupled to the Cochlear Wireless TV Streamer. Sixteen users of Cochlear Nucleus 24 Freedom, CI512, and CI422 implants were included in the study. Participants were evaluated in four conditions including use of the sound processor alone and use of the sound processor with the wireless streamer in quiet and in the presence of competing noise at 65 dBA. Speech recognition was evaluated in each condition with two full lists of Computer-Assisted Speech Perception Testing and Training Sentence-Level Test sentences presented from a light-emitting diode television. Speech recognition in noise was significantly better with use of the wireless streamer compared to participants' performance with their CI sound processor alone. There was also a nonsignificant trend toward better performance in quiet with use of the TV Streamer. Performance was significantly poorer when evaluated in noise compared to performance in quiet when the TV Streamer was not used. Use of the Cochlear Wireless TV Streamer designed to stream audio from a television directly to a CI sound processor provides better speech recognition in quiet and in noise when compared to performance obtained with use of the CI sound processor alone. American Academy of Audiology.

  10. Institute for the Study of Human Capabilities Summary Descriptions of Research for the Period September 1988 through June 1989

    DTIC Science & Technology

    1989-06-01

    12 1.7 Application of the Modified Speech Transmission Index to Monaural and Binaural Speech Recognition in Normal and Impaired...describe all of the data from both groups. 1.7 Application of the Modified Speech Transmission Index to Monaural and Binaural Speech Recognition in...were obtained for materials presented to each ear separately (monaurally) and to both ears ( binaurally ). Results from the normal listeners are accurately

  11. Emotion recognition from speech: tools and challenges

    NASA Astrophysics Data System (ADS)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  12. Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

    PubMed Central

    Zhu, Lianzhang; Chen, Leiming; Zhao, Dehai

    2017-01-01

    Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed. PMID:28737705

  13. Speed and Accuracy of Rapid Speech Output by Adolescents with Residual Speech Sound Errors Including Rhotics

    ERIC Educational Resources Information Center

    Preston, Jonathan L.; Edwards, Mary Louise

    2009-01-01

    Children with residual speech sound errors are often underserved clinically, yet there has been a lack of recent research elucidating the specific deficits in this population. Adolescents aged 10-14 with residual speech sound errors (RE) that included rhotics were compared to normally speaking peers on tasks assessing speed and accuracy of speech…

  14. The Role of Supralexical Prosodic Units in Speech Production: Evidence from the Distribution of Speech Errors

    ERIC Educational Resources Information Center

    Choe, Wook Kyung

    2013-01-01

    The current dissertation represents one of the first systematic studies of the distribution of speech errors within supralexical prosodic units. Four experiments were conducted to gain insight into the specific role of these units in speech planning and production. The first experiment focused on errors in adult English. These were found to be…

  15. Hearing Handicap and Speech Recognition Correlate With Self-Reported Listening Effort and Fatigue.

    PubMed

    Alhanbali, Sara; Dawes, Piers; Lloyd, Simon; Munro, Kevin J

    To investigate the correlations between hearing handicap, speech recognition, listening effort, and fatigue. Eighty-four adults with hearing loss (65 to 85 years) completed three self-report questionnaires: the Fatigue Assessment Scale, the Effort Assessment Scale, and the Hearing Handicap Inventory for Elderly. Audiometric assessment included pure-tone audiometry and speech recognition in noise. There was a significant positive correlation between handicap and fatigue (r = 0.39, p < 0.05) and handicap and effort (r = 0.73, p < 0.05). There were significant (but lower) correlations between speech recognition and fatigue (r = 0.22, p < 0.05) or effort (r = 0.32, p< 0.05). There was no significant correlation between hearing level and fatigue or effort. Hearing handicap and speech recognition both correlate with self-reported listening effort and fatigue, which is consistent with a model of listening effort and fatigue where perceived difficulty is related to sustained effort and fatigue for unrewarding tasks over which the listener has low control. A clinical implication is that encouraging clients to recognize and focus on the pleasure and positive experiences of listening may result in greater satisfaction and benefit from hearing aid use.

  16. Development of coffee maker service robot using speech and face recognition systems using POMDP

    NASA Astrophysics Data System (ADS)

    Budiharto, Widodo; Meiliana; Santoso Gunawan, Alexander Agung

    2016-07-01

    There are many development of intelligent service robot in order to interact with user naturally. This purpose can be done by embedding speech and face recognition ability on specific tasks to the robot. In this research, we would like to propose Intelligent Coffee Maker Robot which the speech recognition is based on Indonesian language and powered by statistical dialogue systems. This kind of robot can be used in the office, supermarket or restaurant. In our scenario, robot will recognize user's face and then accept commands from the user to do an action, specifically in making a coffee. Based on our previous work, the accuracy for speech recognition is about 86% and face recognition is about 93% in laboratory experiments. The main problem in here is to know the intention of user about how sweetness of the coffee. The intelligent coffee maker robot should conclude the user intention through conversation under unreliable automatic speech in noisy environment. In this paper, this spoken dialog problem is treated as a partially observable Markov decision process (POMDP). We describe how this formulation establish a promising framework by empirical results. The dialog simulations are presented which demonstrate significant quantitative outcome.

  17. Financial and workflow analysis of radiology reporting processes in the planning phase of implementation of a speech recognition system

    NASA Astrophysics Data System (ADS)

    Whang, Tom; Ratib, Osman M.; Umamoto, Kathleen; Grant, Edward G.; McCoy, Michael J.

    2002-05-01

    The goal of this study is to determine the financial value and workflow improvements achievable by replacing traditional transcription services with a speech recognition system in a large, university hospital setting. Workflow metrics were measured at two hospitals, one of which exclusively uses a transcription service (UCLA Medical Center), and the other which exclusively uses speech recognition (West Los Angeles VA Hospital). Workflow metrics include time spent per report (the sum of time spent interpreting, dictating, reviewing, and editing), transcription turnaround, and total report turnaround. Compared to traditional transcription, speech recognition resulted in radiologists spending 13-32% more time per report, but it also resulted in reduction of report turnaround time by 22-62% and reduction of marginal cost per report by 94%. The model developed here helps justify the introduction of a speech recognition system by showing that the benefits of reduced operating costs and decreased turnaround time outweigh the cost of increased time spent per report. Whether the ultimate goal is to achieve a financial objective or to improve operational efficiency, it is important to conduct a thorough analysis of workflow before implementation.

  18. Effects of cooperating and conflicting cues on speech intonation recognition by cochlear implant users and normal hearing listeners.

    PubMed

    Peng, Shu-Chen; Lu, Nelson; Chatterjee, Monita

    2009-01-01

    Cochlear implant (CI) recipients have only limited access to fundamental frequency (F0) information, and thus exhibit deficits in speech intonation recognition. For speech intonation, F0 serves as the primary cue, and other potential acoustic cues (e.g. intensity properties) may also contribute. This study examined the effects of cooperating or conflicting acoustic cues on speech intonation recognition by adult CI and normal hearing (NH) listeners with full-spectrum and spectrally degraded speech stimuli. Identification of speech intonation that signifies question and statement contrasts was measured in 13 CI recipients and 4 NH listeners, using resynthesized bi-syllabic words, where F0 and intensity properties were systematically manipulated. The stimulus set was comprised of tokens whose acoustic cues (i.e. F0 contour and intensity patterns) were either cooperating or conflicting. Subjects identified if each stimulus is a 'statement' or a 'question' in a single-interval, 2-alternative forced-choice (2AFC) paradigm. Logistic models were fitted to the data, and estimated coefficients were compared under cooperating and conflicting conditions, between the subject groups (CI vs. NH), and under full-spectrum and spectrally degraded conditions for NH listeners. The results indicated that CI listeners' intonation recognition was enhanced by cooperating F0 contour and intensity cues, but was adversely affected by these cues being conflicting. On the other hand, with full-spectrum stimuli, NH listeners' intonation recognition was not affected by cues being cooperating or conflicting. The effects of cues being cooperating or conflicting were comparable between the CI group and NH listeners with spectrally degraded stimuli. These findings suggest the importance of taking multiple acoustic sources for speech recognition into consideration in aural rehabilitation for CI recipients. Copyright (C) 2009 S. Karger AG, Basel.

  19. Effects of cooperating and conflicting cues on speech intonation recognition by cochlear implant users and normal hearing listeners

    PubMed Central

    Peng, Shu-Chen; Lu, Nelson; Chatterjee, Monita

    2009-01-01

    Cochlear implant (CI) recipients have only limited access to fundamental frequency (F0) information, and thus exhibit deficits in speech intonation recognition. For speech intonation, F0 serves as the primary cue, and other potential acoustic cues (e.g., intensity properties) may also contribute. This study examined the effects of acoustic cues being cooperating or conflicting on speech intonation recognition by adult cochlear implant (CI), and normal-hearing (NH) listeners with full-spectrum and spectrally degraded speech stimuli. Identification of speech intonation that signifies question and statement contrasts was measured in 13 CI recipients and 4 NH listeners, using resynthesized bi-syllabic words, where F0 and intensity properties were systematically manipulated. The stimulus set was comprised of tokens whose acoustic cues, i.e., F0 contour and intensity patterns, were either cooperating or conflicting. Subjects identified if each stimulus is a “statement” or a “question” in a single-interval, two-alternative forced-choice (2AFC) paradigm. Logistic models were fitted to the data, and estimated coefficients were compared under cooperating and conflicting conditions, between the subject groups (CI vs. NH), and under full-spectrum and spectrally degraded conditions for NH listeners. The results indicated that CI listeners’ intonation recognition was enhanced by F0 contour and intensity cues being cooperating, but was adversely affected by these cues being conflicting. On the other hand, with full-spectrum stimuli, NH listeners’ intonation recognition was not affected by cues being cooperating or conflicting. The effects of cues being cooperating or conflicting were comparable between the CI group and NH listeners with spectrally-degraded stimuli. These findings suggest the importance of taking multiple acoustic sources for speech recognition into consideration in aural rehabilitation for CI recipients. PMID:19372651

  20. Improving Mobile Phone Speech Recognition by Personalized Amplification: Application in People with Normal Hearing and Mild-to-Moderate Hearing Loss.

    PubMed

    Kam, Anna Chi Shan; Sung, John Ka Keung; Lee, Tan; Wong, Terence Ka Cheong; van Hasselt, Andrew

    In this study, the authors evaluated the effect of personalized amplification on mobile phone speech recognition in people with and without hearing loss. This prospective study used double-blind, within-subjects, repeated measures, controlled trials to evaluate the effectiveness of applying personalized amplification based on the hearing level captured on the mobile device. The personalized amplification settings were created using modified one-third gain targets. The participants in this study included 100 adults of age between 20 and 78 years (60 with age-adjusted normal hearing and 40 with hearing loss). The performance of the participants with personalized amplification and standard settings was compared using both subjective and speech-perception measures. Speech recognition was measured in quiet and in noise using Cantonese disyllabic words. Subjective ratings on the quality, clarity, and comfortableness of the mobile signals were measured with an 11-point visual analog scale. Subjective preferences of the settings were also obtained by a paired-comparison procedure. The personalized amplification application provided better speech recognition via the mobile phone both in quiet and in noise for people with hearing impairment (improved 8 to 10%) and people with normal hearing (improved 1 to 4%). The improvement in speech recognition was significantly better for people with hearing impairment. When the average device output level was matched, more participants preferred to have the individualized gain than not to have it. The personalized amplification application has the potential to improve speech recognition for people with mild-to-moderate hearing loss, as well as people with normal hearing, in particular when listening in noisy environments.

  1. Study of environmental sound source identification based on hidden Markov model for robust speech recognition

    NASA Astrophysics Data System (ADS)

    Nishiura, Takanobu; Nakamura, Satoshi

    2003-10-01

    Humans communicate with each other through speech by focusing on the target speech among environmental sounds in real acoustic environments. We can easily identify the target sound from other environmental sounds. For hands-free speech recognition, the identification of the target speech from environmental sounds is imperative. This mechanism may also be important for a self-moving robot to sense the acoustic environments and communicate with humans. Therefore, this paper first proposes hidden Markov model (HMM)-based environmental sound source identification. Environmental sounds are modeled by three states of HMMs and evaluated using 92 kinds of environmental sounds. The identification accuracy was 95.4%. This paper also proposes a new HMM composition method that composes speech HMMs and an HMM of categorized environmental sounds for robust environmental sound-added speech recognition. As a result of the evaluation experiments, we confirmed that the proposed HMM composition outperforms the conventional HMM composition with speech HMMs and a noise (environmental sound) HMM trained using noise periods prior to the target speech in a captured signal. [Work supported by Ministry of Public Management, Home Affairs, Posts and Telecommunications of Japan.

  2. Real-time spectrum estimation–based dual-channel speech-enhancement algorithm for cochlear implant

    PubMed Central

    2012-01-01

    Background Improvement of the cochlear implant (CI) front-end signal acquisition is needed to increase speech recognition in noisy environments. To suppress the directional noise, we introduce a speech-enhancement algorithm based on microphone array beamforming and spectral estimation. The experimental results indicate that this method is robust to directional mobile noise and strongly enhances the desired speech, thereby improving the performance of CI devices in a noisy environment. Methods The spectrum estimation and the array beamforming methods were combined to suppress the ambient noise. The directivity coefficient was estimated in the noise-only intervals, and was updated to fit for the mobile noise. Results The proposed algorithm was realized in the CI speech strategy. For actual parameters, we use Maxflat filter to obtain fractional sampling points and cepstrum method to differentiate the desired speech frame and the noise frame. The broadband adjustment coefficients were added to compensate the energy loss in the low frequency band. Discussions The approximation of the directivity coefficient is tested and the errors are discussed. We also analyze the algorithm constraint for noise estimation and distortion in CI processing. The performance of the proposed algorithm is analyzed and further be compared with other prevalent methods. Conclusions The hardware platform was constructed for the experiments. The speech-enhancement results showed that our algorithm can suppresses the non-stationary noise with high SNR. Excellent performance of the proposed algorithm was obtained in the speech enhancement experiments and mobile testing. And signal distortion results indicate that this algorithm is robust with high SNR improvement and low speech distortion. PMID:23006896

  3. Statistical properties of Chinese phonemic networks

    NASA Astrophysics Data System (ADS)

    Yu, Shuiyuan; Liu, Haitao; Xu, Chunshan

    2011-04-01

    The study of properties of speech sound systems is of great significance in understanding the human cognitive mechanism and the working principles of speech sound systems. Some properties of speech sound systems, such as the listener-oriented feature and the talker-oriented feature, have been unveiled with the statistical study of phonemes in human languages and the research of the interrelations between human articulatory gestures and the corresponding acoustic parameters. With all the phonemes of speech sound systems treated as a coherent whole, our research, which focuses on the dynamic properties of speech sound systems in operation, investigates some statistical parameters of Chinese phoneme networks based on real text and dictionaries. The findings are as follows: phonemic networks have high connectivity degrees and short average distances; the degrees obey normal distribution and the weighted degrees obey power law distribution; vowels enjoy higher priority than consonants in the actual operation of speech sound systems; the phonemic networks have high robustness against targeted attacks and random errors. In addition, for investigating the structural properties of a speech sound system, a statistical study of dictionaries is conducted, which shows the higher frequency of shorter words and syllables and the tendency that the longer a word is, the shorter the syllables composing it are. From these structural properties and dynamic properties one can derive the following conclusion: the static structure of a speech sound system tends to promote communication efficiency and save articulation effort while the dynamic operation of this system gives preference to reliable transmission and easy recognition. In short, a speech sound system is an effective, efficient and reliable communication system optimized in many aspects.

  4. Speech therapy and voice recognition instrument

    NASA Technical Reports Server (NTRS)

    Cohen, J.; Babcock, M. L.

    1972-01-01

    Characteristics of electronic circuit for examining variations in vocal excitation for diagnostic purposes and in speech recognition for determiniog voice patterns and pitch changes are described. Operation of the circuit is discussed and circuit diagram is provided.

  5. Evaluation of the importance of time-frequency contributions to speech intelligibility in noise

    PubMed Central

    Yu, Chengzhu; Wójcicki, Kamil K.; Loizou, Philipos C.; Hansen, John H. L.; Johnson, Michael T.

    2014-01-01

    Recent studies on binary masking techniques make the assumption that each time-frequency (T-F) unit contributes an equal amount to the overall intelligibility of speech. The present study demonstrated that the importance of each T-F unit to speech intelligibility varies in accordance with speech content. Specifically, T-F units are categorized into two classes, speech-present T-F units and speech-absent T-F units. Results indicate that the importance of each speech-present T-F unit to speech intelligibility is highly related to the loudness of its target component, while the importance of each speech-absent T-F unit varies according to the loudness of its masker component. Two types of mask errors are also considered, which include miss and false alarm errors. Consistent with previous work, false alarm errors are shown to be more harmful to speech intelligibility than miss errors when the mixture signal-to-noise ratio (SNR) is below 0 dB. However, the relative importance between the two types of error is conditioned on the SNR level of the input speech signal. Based on these observations, a mask-based objective measure, the loudness weighted hit-false, is proposed for predicting speech intelligibility. The proposed objective measure shows significantly higher correlation with intelligibility compared to two existing mask-based objective measures. PMID:24815280

  6. Multilevel Analysis in Analyzing Speech Data

    ERIC Educational Resources Information Center

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  7. A Development of a System Enables Character Input and PC Operation via Voice for a Physically Disabled Person with a Speech Impediment

    NASA Astrophysics Data System (ADS)

    Tanioka, Toshimasa; Egashira, Hiroyuki; Takata, Mayumi; Okazaki, Yasuhisa; Watanabe, Kenzi; Kondo, Hiroki

    We have designed and implemented a PC operation support system for a physically disabled person with a speech impediment via voice. Voice operation is an effective method for a physically disabled person with involuntary movement of the limbs and the head. We have applied a commercial speech recognition engine to develop our system for practical purposes. Adoption of a commercial engine reduces development cost and will contribute to make our system useful to another speech impediment people. We have customized commercial speech recognition engine so that it can recognize the utterance of a person with a speech impediment. We have restricted the words that the recognition engine recognizes and separated a target words from similar words in pronunciation to avoid misrecognition. Huge number of words registered in commercial speech recognition engines cause frequent misrecognition for speech impediments' utterance, because their utterance is not clear and unstable. We have solved this problem by narrowing the choice of input down in a small number and also by registering their ambiguous pronunciations in addition to the original ones. To realize all character inputs and all PC operation with a small number of words, we have designed multiple input modes with categorized dictionaries and have introduced two-step input in each mode except numeral input to enable correct operation with small number of words. The system we have developed is in practical level. The first author of this paper is physically disabled with a speech impediment. He has been able not only character input into PC but also to operate Windows system smoothly by using this system. He uses this system in his daily life. This paper is written by him with this system. At present, the speech recognition is customized to him. It is, however, possible to customize for other users by changing words and registering new pronunciation according to each user's utterance.

  8. Speech Enhancement Using Gaussian Scale Mixture Models

    PubMed Central

    Hao, Jiucang; Lee, Te-Won; Sejnowski, Terrence J.

    2011-01-01

    This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress. PMID:21359139

  9. Use of Adaptive Digital Signal Processing to Improve Speech Communication for Normally Hearing aand Hearing-Impaired Subjects.

    ERIC Educational Resources Information Center

    Harris, Richard W.; And Others

    1988-01-01

    A two-microphone adaptive digital noise cancellation technique improved word-recognition ability for 20 normal and 12 hearing-impaired adults by reducing multitalker speech babble and speech spectrum noise 18-22 dB. Word recognition improvements averaged 37-50 percent for normal and 27-40 percent for hearing-impaired subjects. Improvement was best…

  10. Multilingual Phoneme Models for Rapid Speech Processing System Development

    DTIC Science & Technology

    2006-09-01

    processes are used to develop an Arabic speech recognition system starting from monolingual English models, In- ternational Phonetic Association (IPA...clusters. It was found that multilingual bootstrapping methods out- perform monolingual English bootstrapping methods on the Arabic evaluation data initially...International Phonetic Alphabet . . . . . . . . . 7 2.3.2 Multilingual vs. Monolingual Speech Recognition 7 2.3.3 Data-Driven Approaches

  11. An Exploration of the Potential of Automatic Speech Recognition to Assist and Enable Receptive Communication in Higher Education

    ERIC Educational Resources Information Center

    Wald, Mike

    2006-01-01

    The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search…

  12. Multilayer perceptron, fuzzy sets, and classification

    NASA Technical Reports Server (NTRS)

    Pal, Sankar K.; Mitra, Sushmita

    1992-01-01

    A fuzzy neural network model based on the multilayer perceptron, using the back-propagation algorithm, and capable of fuzzy classification of patterns is described. The input vector consists of membership values to linguistic properties while the output vector is defined in terms of fuzzy class membership values. This allows efficient modeling of fuzzy or uncertain patterns with appropriate weights being assigned to the backpropagated errors depending upon the membership values at the corresponding outputs. During training, the learning rate is gradually decreased in discrete steps until the network converges to a minimum error solution. The effectiveness of the algorithm is demonstrated on a speech recognition problem. The results are compared with those of the conventional MLP, the Bayes classifier, and the other related models.

  13. A novel probabilistic framework for event-based speech recognition

    NASA Astrophysics Data System (ADS)

    Juneja, Amit; Espy-Wilson, Carol

    2003-10-01

    One of the reasons for unsatisfactory performance of the state-of-the-art automatic speech recognition (ASR) systems is the inferior acoustic modeling of low-level acoustic-phonetic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal, but such a system for continuous speech recognition (CSR) is not known to exist. A probabilistic and statistical framework for CSR based on the idea of the representation of speech sounds by bundles of binary valued articulatory phonetic features is proposed. Multiple probabilistic sequences of linguistically motivated landmarks are obtained using binary classifiers of manner phonetic features-syllabic, sonorant and continuant-and the knowledge-based acoustic parameters (APs) that are acoustic correlates of those features. The landmarks are then used for the extraction of knowledge-based APs for source and place phonetic features and their binary classification. Probabilistic landmark sequences are constrained using manner class language models for isolated or connected word recognition. The proposed method could overcome the disadvantages encountered by the early acoustic-phonetic knowledge-based systems that led the ASR community to switch to systems highly dependent on statistical pattern analysis methods and probabilistic language or grammar models.

  14. Effect of a Bluetooth-implemented hearing aid on speech recognition performance: subjective and objective measurement.

    PubMed

    Kim, Min-Beom; Chung, Won-Ho; Choi, Jeesun; Hong, Sung Hwa; Cho, Yang-Sun; Park, Gyuseok; Lee, Sangmin

    2014-06-01

    The object was to evaluate speech perception improvement through Bluetooth-implemented hearing aids in hearing-impaired adults. Thirty subjects with bilateral symmetric moderate sensorineural hearing loss participated in this study. A Bluetooth-implemented hearing aid was fitted unilaterally in all study subjects. Objective speech recognition score and subjective satisfaction were measured with a Bluetooth-implemented hearing aid to replace the acoustic connection from either a cellular phone or a loudspeaker system. In each system, participants were assigned to 4 conditions: wireless speech signal transmission into hearing aid (wireless mode) in quiet or noisy environment and conventional speech signal transmission using external microphone of hearing aid (conventional mode) in quiet or noisy environment. Also, participants completed questionnaires to investigate subjective satisfaction. Both cellular phone and loudspeaker system situation, participants showed improvements in sentence and word recognition scores with wireless mode compared to conventional mode in both quiet and noise conditions (P < .001). Participants also reported subjective improvements, including better sound quality, less noise interference, and better accuracy naturalness, when using the wireless mode (P < .001). Bluetooth-implemented hearing aids helped to improve subjective and objective speech recognition performances in quiet and noisy environments during the use of electronic audio devices.

  15. Speech recognition in individuals with sensorineural hearing loss.

    PubMed

    de Andrade, Adriana Neves; Iorio, Maria Cecilia Martinelli; Gil, Daniela

    2016-01-01

    Hearing loss can negatively influence the communication performance of individuals, who should be evaluated with suitable material and in situations of listening close to those found in everyday life. To analyze and compare the performance of patients with mild-to-moderate sensorineural hearing loss in speech recognition tests carried out in silence and with noise, according to the variables ear (right and left) and type of stimulus presentation. The study included 19 right-handed individuals with mild-to-moderate symmetrical bilateral sensorineural hearing loss, submitted to the speech recognition test with words in different modalities and speech test with white noise and pictures. There was no significant difference between right and left ears in any of the tests. The mean number of correct responses in the speech recognition test with pictures, live voice, and recorded monosyllables was 97.1%, 85.9%, and 76.1%, respectively, whereas after the introduction of noise, the performance decreased to 72.6% accuracy. The best performances in the Speech Recognition Percentage Index were obtained using monosyllabic stimuli, represented by pictures presented in silence, with no significant differences between the right and left ears. After the introduction of competitive noise, there was a decrease in individuals' performance. Copyright © 2015 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.

  16. Relationships between Structural and Acoustic Properties of Maternal Talk and Children's Early Word Recognition

    ERIC Educational Resources Information Center

    Suttora, Chiara; Salerni, Nicoletta; Zanchi, Paola; Zampini, Laura; Spinelli, Maria; Fasolo, Mirco

    2017-01-01

    This study aimed to investigate specific associations between structural and acoustic characteristics of infant-directed (ID) speech and word recognition. Thirty Italian-acquiring children and their mothers were tested when the children were 1;3. Children's word recognition was measured with the looking-while-listening task. Maternal ID speech was…

  17. Speech Recognition in Adults with Cochlear Implants: The Effects of Working Memory, Phonological Sensitivity, and Aging

    ERIC Educational Resources Information Center

    Moberly, Aaron C.; Harris, Michael S.; Boyce, Lauren; Nittrouer, Susan

    2017-01-01

    Purpose: Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced…

  18. The Effect of Asymmetrical Signal Degradation on Binaural Speech Recognition in Children and Adults.

    ERIC Educational Resources Information Center

    Rothpletz, Ann M.; Tharpe, Anne Marie; Grantham, D. Wesley

    2004-01-01

    To determine the effect of asymmetrical signal degradation on binaural speech recognition, 28 children and 14 adults were administered a sentence recognition task amidst multitalker babble. There were 3 listening conditions: (a) monaural, with mild degradation in 1 ear; (b) binaural, with mild degradation in both ears (symmetric degradation); and…

  19. Learning Models and Real-Time Speech Recognition.

    ERIC Educational Resources Information Center

    Danforth, Douglas G.; And Others

    This report describes the construction and testing of two "psychological" learning models for the purpose of computer recognition of human speech over the telephone. One of the two models was found to be superior in all tests. A regression analysis yielded a 92.3% recognition rate for 14 subjects ranging in age from 6 to 13 years. Tests…

  20. Use of Authentic-Speech Technique for Teaching Sound Recognition to EFL Students

    ERIC Educational Resources Information Center

    Sersen, William J.

    2011-01-01

    The main objective of this research was to test an authentic-speech technique for improving the sound-recognition skills of EFL (English as a foreign language) students at Roi-Et Rajabhat University. The secondary objective was to determine the correlation, if any, between students' self-evaluation of sound-recognition progress and the actual…

  1. What happens to the motor theory of perception when the motor system is damaged?

    PubMed

    Stasenko, Alena; Garcea, Frank E; Mahon, Bradford Z

    2013-09-01

    Motor theories of perception posit that motor information is necessary for successful recognition of actions. Perhaps the most well known of this class of proposals is the motor theory of speech perception, which argues that speech recognition is fundamentally a process of identifying the articulatory gestures (i.e. motor representations) that were used to produce the speech signal. Here we review neuropsychological evidence from patients with damage to the motor system, in the context of motor theories of perception applied to both manual actions and speech. Motor theories of perception predict that patients with motor impairments will have impairments for action recognition. Contrary to that prediction, the available neuropsychological evidence indicates that recognition can be spared despite profound impairments to production. These data falsify strong forms of the motor theory of perception, and frame new questions about the dynamical interactions that govern how information is exchanged between input and output systems.

  2. What happens to the motor theory of perception when the motor system is damaged?

    PubMed Central

    Stasenko, Alena; Garcea, Frank E.; Mahon, Bradford Z.

    2016-01-01

    Motor theories of perception posit that motor information is necessary for successful recognition of actions. Perhaps the most well known of this class of proposals is the motor theory of speech perception, which argues that speech recognition is fundamentally a process of identifying the articulatory gestures (i.e. motor representations) that were used to produce the speech signal. Here we review neuropsychological evidence from patients with damage to the motor system, in the context of motor theories of perception applied to both manual actions and speech. Motor theories of perception predict that patients with motor impairments will have impairments for action recognition. Contrary to that prediction, the available neuropsychological evidence indicates that recognition can be spared despite profound impairments to production. These data falsify strong forms of the motor theory of perception, and frame new questions about the dynamical interactions that govern how information is exchanged between input and output systems. PMID:26823687

  3. Sentence Recognition Prediction for Hearing-impaired Listeners in Stationary and Fluctuation Noise With FADE

    PubMed Central

    Schädler, Marc René; Warzybok, Anna; Meyer, Bernd T.; Brand, Thomas

    2016-01-01

    To characterize the individual patient’s hearing impairment as obtained with the matrix sentence recognition test, a simulation Framework for Auditory Discrimination Experiments (FADE) is extended here using the Attenuation and Distortion (A+D) approach by Plomp as a blueprint for setting the individual processing parameters. FADE has been shown to predict the outcome of both speech recognition tests and psychoacoustic experiments based on simulations using an automatic speech recognition system requiring only few assumptions. It builds on the closed-set matrix sentence recognition test which is advantageous for testing individual speech recognition in a way comparable across languages. Individual predictions of speech recognition thresholds in stationary and in fluctuating noise were derived using the audiogram and an estimate of the internal level uncertainty for modeling the individual Plomp curves fitted to the data with the Attenuation (A-) and Distortion (D-) parameters of the Plomp approach. The “typical” audiogram shapes from Bisgaard et al with or without a “typical” level uncertainty and the individual data were used for individual predictions. As a result, the individualization of the level uncertainty was found to be more important than the exact shape of the individual audiogram to accurately model the outcome of the German Matrix test in stationary or fluctuating noise for listeners with hearing impairment. The prediction accuracy of the individualized approach also outperforms the (modified) Speech Intelligibility Index approach which is based on the individual threshold data only. PMID:27604782

  4. Assessment of Spectral and Temporal Resolution in Cochlear Implant Users Using Psychoacoustic Discrimination and Speech Cue Categorization.

    PubMed

    Winn, Matthew B; Won, Jong Ho; Moon, Il Joon

    This study was conducted to measure auditory perception by cochlear implant users in the spectral and temporal domains, using tests of either categorization (using speech-based cues) or discrimination (using conventional psychoacoustic tests). The authors hypothesized that traditional nonlinguistic tests assessing spectral and temporal auditory resolution would correspond to speech-based measures assessing specific aspects of phonetic categorization assumed to depend on spectral and temporal auditory resolution. The authors further hypothesized that speech-based categorization performance would ultimately be a superior predictor of speech recognition performance, because of the fundamental nature of speech recognition as categorization. Nineteen cochlear implant listeners and 10 listeners with normal hearing participated in a suite of tasks that included spectral ripple discrimination, temporal modulation detection, and syllable categorization, which was split into a spectral cue-based task (targeting the /ba/-/da/ contrast) and a timing cue-based task (targeting the /b/-/p/ and /d/-/t/ contrasts). Speech sounds were manipulated to contain specific spectral or temporal modulations (formant transitions or voice onset time, respectively) that could be categorized. Categorization responses were quantified using logistic regression to assess perceptual sensitivity to acoustic phonetic cues. Word recognition testing was also conducted for cochlear implant listeners. Cochlear implant users were generally less successful at utilizing both spectral and temporal cues for categorization compared with listeners with normal hearing. For the cochlear implant listener group, spectral ripple discrimination was significantly correlated with the categorization of formant transitions; both were correlated with better word recognition. Temporal modulation detection using 100- and 10-Hz-modulated noise was not correlated either with the cochlear implant subjects' categorization of voice onset time or with word recognition. Word recognition was correlated more closely with categorization of the controlled speech cues than with performance on the psychophysical discrimination tasks. When evaluating people with cochlear implants, controlled speech-based stimuli are feasible to use in tests of auditory cue categorization, to complement traditional measures of auditory discrimination. Stimuli based on specific speech cues correspond to counterpart nonlinguistic measures of discrimination, but potentially show better correspondence with speech perception more generally. The ubiquity of the spectral (formant transition) and temporal (voice onset time) stimulus dimensions across languages highlights the potential to use this testing approach even in cases where English is not the native language.

  5. Assessment of spectral and temporal resolution in cochlear implant users using psychoacoustic discrimination and speech cue categorization

    PubMed Central

    Winn, Matthew B.; Won, Jong Ho; Moon, Il Joon

    2016-01-01

    Objectives This study was conducted to measure auditory perception by cochlear implant users in the spectral and temporal domains, using tests of either categorization (using speech-based cues) or discrimination (using conventional psychoacoustic tests). We hypothesized that traditional nonlinguistic tests assessing spectral and temporal auditory resolution would correspond to speech-based measures assessing specific aspects of phonetic categorization assumed to depend on spectral and temporal auditory resolution. We further hypothesized that speech-based categorization performance would ultimately be a superior predictor of speech recognition performance, because of the fundamental nature of speech recognition as categorization. Design Nineteen CI listeners and 10 listeners with normal hearing (NH) participated in a suite of tasks that included spectral ripple discrimination (SRD), temporal modulation detection (TMD), and syllable categorization, which was split into a spectral-cue-based task (targeting the /ba/-/da/ contrast) and a timing-cue-based task (targeting the /b/-/p/ and /d/-/t/ contrasts). Speech sounds were manipulated in order to contain specific spectral or temporal modulations (formant transitions or voice onset time, respectively) that could be categorized. Categorization responses were quantified using logistic regression in order to assess perceptual sensitivity to acoustic phonetic cues. Word recognition testing was also conducted for CI listeners. Results CI users were generally less successful at utilizing both spectral and temporal cues for categorization compared to listeners with normal hearing. For the CI listener group, SRD was significantly correlated with the categorization of formant transitions; both were correlated with better word recognition. TMD using 100 Hz and 10 Hz modulated noise was not correlated with the CI subjects’ categorization of VOT, nor with word recognition. Word recognition was correlated more closely with categorization of the controlled speech cues than with performance on the psychophysical discrimination tasks. Conclusions When evaluating people with cochlear implants, controlled speech-based stimuli are feasible to use in tests of auditory cue categorization, to complement traditional measures of auditory discrimination. Stimuli based on specific speech cues correspond to counterpart non-linguistic measures of discrimination, but potentially show better correspondence with speech perception more generally. The ubiquity of the spectral (formant transition) and temporal (VOT) stimulus dimensions across languages highlights the potential to use this testing approach even in cases where English is not the native language. PMID:27438871

  6. Speech recognition in advanced rotorcraft - Using speech controls to reduce manual control overload

    NASA Technical Reports Server (NTRS)

    Vidulich, Michael A.; Bortolussi, Michael R.

    1988-01-01

    An experiment has been conducted to ascertain the usefulness of helicopter pilot speech controls and their effect on time-sharing performance, under the impetus of multiple-resource theories of attention which predict that time-sharing should be more efficient with mixed manual and speech controls than with all-manual ones. The test simulation involved an advanced, single-pilot scout/attack helicopter. Performance and subjective workload levels obtained supported the claimed utility of speech recognition-based controls; specifically, time-sharing performance was improved while preparing a data-burst transmission of information during helicopter hover.

  7. Investigating Perceptual Biases, Data Reliability, and Data Discovery in a Methodology for Collecting Speech Errors From Audio Recordings.

    PubMed

    Alderete, John; Davies, Monica

    2018-04-01

    This work describes a methodology of collecting speech errors from audio recordings and investigates how some of its assumptions affect data quality and composition. Speech errors of all types (sound, lexical, syntactic, etc.) were collected by eight data collectors from audio recordings of unscripted English speech. Analysis of these errors showed that: (i) different listeners find different errors in the same audio recordings, but (ii) the frequencies of error patterns are similar across listeners; (iii) errors collected "online" using on the spot observational techniques are more likely to be affected by perceptual biases than "offline" errors collected from audio recordings; and (iv) datasets built from audio recordings can be explored and extended in a number of ways that traditional corpus studies cannot be.

  8. Incorporating Speech Recognition into a Natural User Interface

    NASA Technical Reports Server (NTRS)

    Chapa, Nicholas

    2017-01-01

    The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to investigate options for a Speech Recognition System. To that end I attempted to integrate Sphinx4 into a user interface. Sphinx4 had great accuracy and is the only free program able to perform offline speech dictation. However it had a limited dictionary of words that could be recognized, single syllable words were almost impossible for it to hear, and since it ran on Java it could not be integrated into the Unity based NUI. PocketSphinx ran much faster than Sphinx4 which would've made it ideal as a plugin to the Unity NUI, unfortunately creating a C# wrapper for the C code made the program unusable with Unity due to the wrapper slowing code execution and class files becoming unreachable. Unity Grammar Recognizer is the ideal speech recognition interface, it is flexible in recognizing multiple variations of the same command. It is also the most accurate program in recognizing speech due to using an XML grammar to specify speech structure instead of relying solely on a Dictionary and Language model. The Unity Grammar Recognizer will be used with the NUI for these reasons as well as being written in C# which further simplifies the incorporation.

  9. Perception of speech in reverberant conditions using AM-FM cochlear implant simulation.

    PubMed

    Drgas, Szymon; Blaszak, Magdalena A

    2010-10-01

    This study assessed the effects of speech misidentification and cognitive processing errors in normal-hearing adults listening to degraded auditory input signals simulating cochlear implants in reverberation conditions. Three variables were controlled: number of vocoder channels (six and twelve), instantaneous frequency change rate (none, 50, 400 Hz), and enclosures (different reverberation conditions). The analyses were made on the basis of: (a) nonsense word recognition scores for eight young normal-hearing listeners, (b) 'ease of listening' based on the time of response, and (c) the subjective measure of difficulty. The maximum score of speech intelligibility in cochlear implant simulation was 70% for non-reverberant conditions with a 12-channel vocoder and changes of instantaneous frequency limited to 400 Hz. In the presence of reflections, word misidentification was about 10-20 percentage points higher. There was little difference between the 50 and 400 Hz frequency modulation cut-off for the 12-channel vocoder; however, in the case of six channels this difference was more significant. The results of the experiment suggest that the information other than F0, that is carried by FM, can be sufficient to improve speech intelligibility in the real-world conditions.

  10. A novel speech processing algorithm based on harmonicity cues in cochlear implant

    NASA Astrophysics Data System (ADS)

    Wang, Jian; Chen, Yousheng; Zhang, Zongping; Chen, Yan; Zhang, Weifeng

    2017-08-01

    This paper proposed a novel speech processing algorithm in cochlear implant, which used harmonicity cues to enhance tonal information in Mandarin Chinese speech recognition. The input speech was filtered by a 4-channel band-pass filter bank. The frequency ranges for the four bands were: 300-621, 621-1285, 1285-2657, and 2657-5499 Hz. In each pass band, temporal envelope and periodicity cues (TEPCs) below 400 Hz were extracted by full wave rectification and low-pass filtering. The TEPCs were modulated by a sinusoidal carrier, the frequency of which was fundamental frequency (F0) and its harmonics most close to the center frequency of each band. Signals from each band were combined together to obtain an output speech. Mandarin tone, word, and sentence recognition in quiet listening conditions were tested for the extensively used continuous interleaved sampling (CIS) strategy and the novel F0-harmonic algorithm. Results found that the F0-harmonic algorithm performed consistently better than CIS strategy in Mandarin tone, word, and sentence recognition. In addition, sentence recognition rate was higher than word recognition rate, as a result of contextual information in the sentence. Moreover, tone 3 and 4 performed better than tone 1 and tone 2, due to the easily identified features of the former. In conclusion, the F0-harmonic algorithm could enhance tonal information in cochlear implant speech processing due to the use of harmonicity cues, thereby improving Mandarin tone, word, and sentence recognition. Further study will focus on the test of the F0-harmonic algorithm in noisy listening conditions.

  11. Speech recognition in one- and two-talker maskers in school-age children and adults: Development of perceptual masking and glimpsing

    PubMed Central

    Buss, Emily; Leibold, Lori J.; Porter, Heather L.; Grose, John H.

    2017-01-01

    Children perform more poorly than adults on a wide range of masked speech perception paradigms, but this effect is particularly pronounced when the masker itself is also composed of speech. The present study evaluated two factors that might contribute to this effect: the ability to perceptually isolate the target from masker speech, and the ability to recognize target speech based on sparse cues (glimpsing). Speech reception thresholds (SRTs) were estimated for closed-set, disyllabic word recognition in children (5–16 years) and adults in a one- or two-talker masker. Speech maskers were 60 dB sound pressure level (SPL), and they were either presented alone or in combination with a 50-dB-SPL speech-shaped noise masker. There was an age effect overall, but performance was adult-like at a younger age for the one-talker than the two-talker masker. Noise tended to elevate SRTs, particularly for older children and adults, and when summed with the one-talker masker. Removing time-frequency epochs associated with a poor target-to-masker ratio markedly improved SRTs, with larger effects for younger listeners; the age effect was not eliminated, however. Results were interpreted as indicating that development of speech-in-speech recognition is likely impacted by development of both perceptual masking and the ability recognize speech based on sparse cues. PMID:28464682

  12. Australian children with cleft palate achieve age-appropriate speech by 5 years of age.

    PubMed

    Chacon, Antonia; Parkin, Melissa; Broome, Kate; Purcell, Alison

    2017-12-01

    Children with cleft palate demonstrate atypical speech sound development, which can influence their intelligibility, literacy and learning. There is limited documentation regarding how speech sound errors change over time in cleft palate speech and the effect that these errors have upon mono-versus polysyllabic word production. The objective of this study was to examine the phonetic and phonological speech skills of children with cleft palate at ages 3 and 5. A cross-sectional observational design was used. Eligible participants were aged 3 or 5 years with a repaired cleft palate. The Diagnostic Evaluation of Articulation and Phonology (DEAP) Articulation subtest and a non-standardised list of mono- and polysyllabic words were administered once for each child. The Profile of Phonology (PROPH) was used to analyse each child's speech. N = 51 children with cleft palate participated in the study. Three-year-old children with cleft palate produced significantly more speech errors than their typically-developing peers, but no difference was apparent at 5 years. The 5-year-olds demonstrated greater phonetic and phonological accuracy than the 3-year-old children. Polysyllabic words were more affected by errors than monosyllables in the 3-year-old group only. Children with cleft palate are prone to phonetic and phonological speech errors in their preschool years. Most of these speech errors approximate typically-developing children by 5 years. At 3 years, word shape has an influence upon phonological speech accuracy. Speech pathology intervention is indicated to support the intelligibility of these children from their earliest stages of development. Copyright © 2017 Elsevier B.V. All rights reserved.

  13. Working memory capacity may influence perceived effort during aided speech recognition in noise.

    PubMed

    Rudner, Mary; Lunner, Thomas; Behrens, Thomas; Thorén, Elisabet Sundewall; Rönnberg, Jerker

    2012-09-01

    Recently there has been interest in using subjective ratings as a measure of perceived effort during speech recognition in noise. Perceived effort may be an indicator of cognitive load. Thus, subjective effort ratings during speech recognition in noise may covary both with signal-to-noise ratio (SNR) and individual cognitive capacity. The present study investigated the relation between subjective ratings of the effort involved in listening to speech in noise, speech recognition performance, and individual working memory (WM) capacity in hearing impaired hearing aid users. In two experiments, participants with hearing loss rated perceived effort during aided speech perception in noise. Noise type and SNR were manipulated in both experiments, and in the second experiment hearing aid compression release settings were also manipulated. Speech recognition performance was measured along with WM capacity. There were 46 participants in all with bilateral mild to moderate sloping hearing loss. In Experiment 1 there were 16 native Danish speakers (eight women and eight men) with a mean age of 63.5 yr (SD = 12.1) and average pure tone (PT) threshold of 47. 6 dB (SD = 9.8). In Experiment 2 there were 30 native Swedish speakers (19 women and 11 men) with a mean age of 70 yr (SD = 7.8) and average PT threshold of 45.8 dB (SD = 6.6). A visual analog scale (VAS) was used for effort rating in both experiments. In Experiment 1, effort was rated at individually adapted SNRs while in Experiment 2 it was rated at fixed SNRs. Speech recognition in noise performance was measured using adaptive procedures in both experiments with Dantale II sentences in Experiment 1 and Hagerman sentences in Experiment 2. WM capacity was measured using a letter-monitoring task in Experiment 1 and the reading span task in Experiment 2. In both experiments, there was a strong and significant relation between rated effort and SNR that was independent of individual WM capacity, whereas the relation between rated effort and noise type seemed to be influenced by individual WM capacity. Experiment 2 showed that hearing aid compression setting influenced rated effort. Subjective ratings of the effort involved in speech recognition in noise reflect SNRs, and individual cognitive capacity seems to influence relative rating of noise type. American Academy of Audiology.

  14. Blind speech separation system for humanoid robot with FastICA for audio filtering and separation

    NASA Astrophysics Data System (ADS)

    Budiharto, Widodo; Santoso Gunawan, Alexander Agung

    2016-07-01

    Nowadays, there are many developments in building intelligent humanoid robot, mainly in order to handle voice and image. In this research, we propose blind speech separation system using FastICA for audio filtering and separation that can be used in education or entertainment. Our main problem is to separate the multi speech sources and also to filter irrelevant noises. After speech separation step, the results will be integrated with our previous speech and face recognition system which is based on Bioloid GP robot and Raspberry Pi 2 as controller. The experimental results show the accuracy of our blind speech separation system is about 88% in command and query recognition cases.

  15. Automatic speech recognition in air-ground data link

    NASA Technical Reports Server (NTRS)

    Armstrong, Herbert B.

    1989-01-01

    In the present air traffic system, information presented to the transport aircraft cockpit crew may originate from a variety of sources and may be presented to the crew in visual or aural form, either through cockpit instrument displays or, most often, through voice communication. Voice radio communications are the most error prone method for air-ground data link. Voice messages can be misstated or misunderstood and radio frequency congestion can delay or obscure important messages. To prevent proliferation, a multiplexed data link display can be designed to present information from multiple data link sources on a shared cockpit display unit (CDU) or multi-function display (MFD) or some future combination of flight management and data link information. An aural data link which incorporates an automatic speech recognition (ASR) system for crew response offers several advantages over visual displays. The possibility of applying ASR to the air-ground data link was investigated. The first step was to review current efforts in ASR applications in the cockpit and in air traffic control and evaluated their possible data line application. Next, a series of preliminary research questions is to be developed for possible future collaboration.

  16. New Ideas for Speech Recognition and Related Technologies

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Holzrichter, J F

    The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstratedmore » by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun to think differently about practical applications of such radars. In particular, his demonstrations of heart and lung motions have opened up many new areas of application for human and animal measurements.« less

  17. Human phoneme recognition depending on speech-intrinsic variability.

    PubMed

    Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger

    2010-11-01

    The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).

  18. A voice-input voice-output communication aid for people with severe speech impairment.

    PubMed

    Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter

    2013-01-01

    A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.

  19. The effect of sensorineural hearing loss and tinnitus on speech recognition over air and bone conduction military communications headsets.

    PubMed

    Manning, Candice; Mermagen, Timothy; Scharine, Angelique

    2017-06-01

    Military personnel are at risk for hearing loss due to noise exposure during deployment (USACHPPM, 2008). Despite mandated use of hearing protection, hearing loss and tinnitus are prevalent due to reluctance to use hearing protection. Bone conduction headsets can offer good speech intelligibility for normal hearing (NH) listeners while allowing the ears to remain open in quiet environments and the use of hearing protection when needed. Those who suffer from tinnitus, the experience of perceiving a sound not produced by an external source, often show degraded speech recognition; however, it is unclear whether this is a result of decreased hearing sensitivity or increased distractibility (Moon et al., 2015). It has been suggested that the vibratory stimulation of a bone conduction headset might ameliorate the effects of tinnitus on speech perception; however, there is currently no research to support or refute this claim (Hoare et al., 2014). Speech recognition of words presented over air conduction and bone conduction headsets was measured for three groups of listeners: NH, sensorineural hearing impaired, and/or tinnitus sufferers. Three levels of speech-to-noise (SNR = 0, -6, -12 dB) were created by embedding speech items in pink noise. Better speech recognition performance was observed with the bone conduction headset regardless of hearing profile, and speech intelligibility was a function of SNR. Discussion will include study limitations and the implications of these findings for those serving in the military. Published by Elsevier B.V.

  20. Heft Lemisphere: Exchanges Predominate in Segmental Speech Errors

    ERIC Educational Resources Information Center

    Nooteboom, Sieb G.; Quene, Hugo

    2013-01-01

    In most collections of segmental speech errors, exchanges are less frequent than anticipations and perseverations. However, it has been suggested that in inner speech exchanges might be more frequent than either anticipations or perseverations, because many half-way repaired errors (Yew...uhh...New York) are classified as repaired anticipations,…

  1. Speaker normalization for chinese vowel recognition in cochlear implants.

    PubMed

    Luo, Xin; Fu, Qian-Jie

    2005-07-01

    Because of the limited spectra-temporal resolution associated with cochlear implants, implant patients often have greater difficulty with multitalker speech recognition. The present study investigated whether multitalker speech recognition can be improved by applying speaker normalization techniques to cochlear implant speech processing. Multitalker Chinese vowel recognition was tested with normal-hearing Chinese-speaking subjects listening to a 4-channel cochlear implant simulation, with and without speaker normalization. For each subject, speaker normalization was referenced to the speaker that produced the best recognition performance under conditions without speaker normalization. To match the remaining speakers to this "optimal" output pattern, the overall frequency range of the analysis filter bank was adjusted for each speaker according to the ratio of the mean third formant frequency values between the specific speaker and the reference speaker. Results showed that speaker normalization provided a small but significant improvement in subjects' overall recognition performance. After speaker normalization, subjects' patterns of recognition performance across speakers changed, demonstrating the potential for speaker-dependent effects with the proposed normalization technique.

  2. Speech Perception, Word Recognition and the Structure of the Lexicon. Research on Speech Perception Progress Report No. 10.

    ERIC Educational Resources Information Center

    Pisoni, David B.; And Others

    The results of three projects concerned with auditory word recognition and the structure of the lexicon are reported in this paper. The first project described was designed to test experimentally several specific predictions derived from MACS, a simulation model of the Cohort Theory of word recognition. The second project description provides the…

  3. The Development of the Orthographic Consistency Effect in Speech Recognition: From Sublexical to Lexical Involvement

    ERIC Educational Resources Information Center

    Ventura, Paulo; Morais, Jose; Kolinsky, Regine

    2007-01-01

    The influence of orthography on children's on-line auditory word recognition was studied from the end of Grade 2 to the end of Grade 4, by examining the orthographic consistency effect [Ziegler, J. C., & Ferrand, L. (1998). Orthography shapes the perception of speech: The consistency effect in auditory recognition. "Psychonomic Bulletin & Review",…

  4. Evaluation of auditory functions for Royal Canadian Mounted Police officers.

    PubMed

    Vaillancourt, Véronique; Laroche, Chantal; Giguère, Christian; Beaulieu, Marc-André; Legault, Jean-Pierre

    2011-06-01

    Auditory fitness for duty (AFFD) testing is an important element in an assessment of workers' ability to perform job tasks safely and effectively. Functional hearing is particularly critical to job performance in law enforcement. Most often, assessment is based on pure-tone detection thresholds; however, its validity can be questioned and challenged in court. In an attempt to move beyond the pure-tone audiogram, some organizations like the Royal Canadian Mounted Police (RCMP) are incorporating additional testing to supplement audiometric data in their AFFD protocols, such as measurements of speech recognition in quiet and/or in noise, and sound localization. This article reports on the assessment of RCMP officers wearing hearing aids in speech recognition and sound localization tasks. The purpose was to quantify individual performance in different domains of hearing identified as necessary components of fitness for duty, and to document the type of hearing aids prescribed in the field and their benefit for functional hearing. The data are to help RCMP in making more informed decisions regarding AFFD in officers wearing hearing aids. The proposed new AFFD protocol included unaided and aided measures of speech recognition in quiet and in noise using the Hearing in Noise Test (HINT) and sound localization in the left/right (L/R) and front/back (F/B) horizontal planes. Sixty-four officers were identified and selected by the RCMP to take part in this study on the basis of hearing thresholds exceeding current audiometrically based criteria. This article reports the results of 57 officers wearing hearing aids. Based on individual results, 49% of officers were reclassified from nonoperational status to operational with limitations on fine hearing duties, given their unaided and/or aided performance. Group data revealed that hearing aids (1) improved speech recognition thresholds on the HINT, the effects being most prominent in Quiet and in conditions of spatial separation between target and noise (Noise Right and Noise Left) and least considerable in Noise Front; (2) neither significantly improved nor impeded L/R localization; and (3) substantially increased F/B errors in localization in a number of cases. Additional analyses also pointed to the poor ability of threshold data to predict functional abilities for speech in noise (r² = 0.26 to 0.33) and sound localization (r² = 0.03 to 0.28). Only speech in quiet (r² = 0.68 to 0.85) is predicted adequately from threshold data. Combined with previous findings, results indicate that the use of hearing aids can considerably affect F/B localization abilities in a number of individuals. Moreover, speech understanding in noise and sound localization abilities were poorly predicted from pure-tone thresholds, demonstrating the need to specifically test these abilities, both unaided and aided, when assessing AFFD. Finally, further work is needed to develop empirically based hearing criteria for the RCMP and identify best practices in hearing aid fittings for optimal functional hearing abilities. American Academy of Audiology.

  5. Do What I Say! Voice Recognition Makes Major Advances.

    ERIC Educational Resources Information Center

    Ruley, C. Dorsey

    1994-01-01

    Explains voice recognition technology applications in the workplace, schools, and libraries. Highlights include a voice-controlled work station using the DragonDictate system that can be used with dyslexic students, converting text to speech, and converting speech to text. (LRW)

  6. Emotional recognition from the speech signal for a virtual education agent

    NASA Astrophysics Data System (ADS)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  7. Robotics control using isolated word recognition of voice input

    NASA Technical Reports Server (NTRS)

    Weiner, J. M.

    1977-01-01

    A speech input/output system is presented that can be used to communicate with a task oriented system. Human speech commands and synthesized voice output extend conventional information exchange capabilities between man and machine by utilizing audio input and output channels. The speech input facility is comprised of a hardware feature extractor and a microprocessor implemented isolated word or phrase recognition system. The recognizer offers a medium sized (100 commands), syntactically constrained vocabulary, and exhibits close to real time performance. The major portion of the recognition processing required is accomplished through software, minimizing the complexity of the hardware feature extractor.

  8. Measuring the effects of spectral smearing and enhancement on speech recognition in noise for adults and children

    PubMed Central

    Nittrouer, Susan; Tarr, Eric; Wucinich, Taylor; Moberly, Aaron C.; Lowenstein, Joanna H.

    2015-01-01

    Broadened auditory filters associated with sensorineural hearing loss have clearly been shown to diminish speech recognition in noise for adults, but far less is known about potential effects for children. This study examined speech recognition in noise for adults and children using simulated auditory filters of different widths. Specifically, 5 groups (20 listeners each) of adults or children (5 and 7 yrs), were asked to recognize sentences in speech-shaped noise. Seven-year-olds listened at 0 dB signal-to-noise ratio (SNR) only; 5-yr-olds listened at +3 or 0 dB SNR; and adults listened at 0 or −3 dB SNR. Sentence materials were processed both to smear the speech spectrum (i.e., simulate broadened filters), and to enhance the spectrum (i.e., simulate narrowed filters). Results showed: (1) Spectral smearing diminished recognition for listeners of all ages; (2) spectral enhancement did not improve recognition, and in fact diminished it somewhat; and (3) interactions were observed between smearing and SNR, but only for adults. That interaction made age effects difficult to gauge. Nonetheless, it was concluded that efforts to diagnose the extent of broadening of auditory filters and to develop techniques to correct this condition could benefit patients with hearing loss, especially children. PMID:25920851

  9. Histogram equalization with Bayesian estimation for noise robust speech recognition.

    PubMed

    Suh, Youngjoo; Kim, Hoirin

    2018-02-01

    The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histogram equalization approach has been proposed. Although this approach showed substantial performance improvement under noise environments, it still suffers from performance degradation due to the overfitting problem when test data are insufficient. To address this issue, the proposed histogram equalization technique employs the Bayesian estimation method in the test cumulative distribution function estimation. It was reported in a previous study conducted on the Aurora-4 task that the proposed approach provided substantial performance gains in speech recognition systems based on the acoustic modeling of the Gaussian mixture model-hidden Markov model. In this work, the proposed approach was examined in speech recognition systems with deep neural network-hidden Markov model (DNN-HMM), the current mainstream speech recognition approach where it also showed meaningful performance improvement over the conventional maximum likelihood estimation-based method. The fusion of the proposed features with the mel-frequency cepstral coefficients provided additional performance gains in DNN-HMM systems, which otherwise suffer from performance degradation in the clean test condition.

  10. A longitudinal study of the bilateral benefit in children with bilateral cochlear implants.

    PubMed

    Asp, Filip; Mäki-Torkko, Elina; Karltorp, Eva; Harder, Henrik; Hergils, Leif; Eskilsson, Gunnar; Stenfelt, Stefan

    2015-02-01

    To study the development of the bilateral benefit in children using bilateral cochlear implants by measurements of speech recognition and sound localization. Bilateral and unilateral speech recognition in quiet, in multi-source noise, and horizontal sound localization was measured at three occasions during a two-year period, without controlling for age or implant experience. Longitudinal and cross-sectional analyses were performed. Results were compared to cross-sectional data from children with normal hearing. Seventy-eight children aged 5.1-11.9 years, with a mean bilateral cochlear implant experience of 3.3 years and a mean age of 7.8 years, at inclusion in the study. Thirty children with normal hearing aged 4.8-9.0 years provided normative data. For children with cochlear implants, bilateral and unilateral speech recognition in quiet was comparable whereas a bilateral benefit for speech recognition in noise and sound localization was found at all three test occasions. Absolute performance was lower than in children with normal hearing. Early bilateral implantation facilitated sound localization. A bilateral benefit for speech recognition in noise and sound localization continues to exist over time for children with bilateral cochlear implants, but no relative improvement is found after three years of bilateral cochlear implant experience.

  11. Phonological mismatch makes aided speech recognition in noise cognitively taxing.

    PubMed

    Rudner, Mary; Foo, Catharina; Rönnberg, Jerker; Lunner, Thomas

    2007-12-01

    The working memory framework for Ease of Language Understanding predicts that speech processing becomes more effortful, thus requiring more explicit cognitive resources, when there is mismatch between speech input and phonological representations in long-term memory. To test this prediction, we changed the compression release settings in the hearing instruments of experienced users and allowed them to train for 9 weeks with the new settings. After training, aided speech recognition in noise was tested with both the trained settings and orthogonal settings. We postulated that training would lead to acclimatization to the trained setting, which in turn would involve establishment of new phonological representations in long-term memory. Further, we postulated that after training, testing with orthogonal settings would give rise to phonological mismatch, associated with more explicit cognitive processing. Thirty-two participants (mean=70.3 years, SD=7.7) with bilateral sensorineural hearing loss (pure-tone average=46.0 dB HL, SD=6.5), bilaterally fitted for more than 1 year with digital, two-channel, nonlinear signal processing hearing instruments and chosen from the patient population at the Linköping University Hospital were randomly assigned to 9 weeks training with new, fast (40 ms) or slow (640 ms), compression release settings in both channels. Aided speech recognition in noise performance was tested according to a design with three within-group factors: test occasion (T1, T2), test setting (fast, slow), and type of noise (unmodulated, modulated) and one between-group factor: experience setting (fast, slow) for two types of speech materials-the highly constrained Hagerman sentences and the less-predictable Hearing in Noise Test (HINT). Complex cognitive capacity was measured using the reading span and letter monitoring tests. PREDICTION: We predicted that speech recognition in noise at T2 with mismatched experience and test settings would be associated with more explicit cognitive processing and thus stronger correlations with complex cognitive measures, as well as poorer performance if complex cognitive capacity was exceeded. Under mismatch conditions, stronger correlations were found between performance on speech recognition with the Hagerman sentences and reading span, along with poorer speech recognition for participants with low reading span scores. No consistent mismatch effect was found with HINT. The mismatch prediction generated by the working memory framework for Ease of Language Understanding is supported for speech recognition in noise with the highly constrained Hagerman sentences but not the less-predictable HINT.

  12. Improving language models for radiology speech recognition.

    PubMed

    Paulett, John M; Langlotz, Curtis P

    2009-02-01

    Speech recognition systems have become increasingly popular as a means to produce radiology reports, for reasons both of efficiency and of cost. However, the suboptimal recognition accuracy of these systems can affect the productivity of the radiologists creating the text reports. We analyzed a database of over two million de-identified radiology reports to determine the strongest determinants of word frequency. Our results showed that body site and imaging modality had a similar influence on the frequency of words and of three-word phrases as did the identity of the speaker. These findings suggest that the accuracy of speech recognition systems could be significantly enhanced by further tailoring their language models to body site and imaging modality, which are readily available at the time of report creation.

  13. Connected word recognition using a cascaded neuro-computational model

    NASA Astrophysics Data System (ADS)

    Hoya, Tetsuya; van Leeuwen, Cees

    2016-10-01

    We propose a novel framework for processing a continuous speech stream that contains a varying number of words, as well as non-speech periods. Speech samples are segmented into word-tokens and non-speech periods. An augmented version of an earlier-proposed, cascaded neuro-computational model is used for recognising individual words within the stream. Simulation studies using both a multi-speaker-dependent and speaker-independent digit string database show that the proposed method yields a recognition performance comparable to that obtained by a benchmark approach using hidden Markov models with embedded training.

  14. Retrospective Analysis of Clinical Performance of an Estonian Speech Recognition System for Radiology: Effects of Different Acoustic and Language Models.

    PubMed

    Paats, A; Alumäe, T; Meister, E; Fridolin, I

    2018-04-30

    The aim of this study was to analyze retrospectively the influence of different acoustic and language models in order to determine the most important effects to the clinical performance of an Estonian language-based non-commercial radiology-oriented automatic speech recognition (ASR) system. An ASR system was developed for Estonian language in radiology domain by utilizing open-source software components (Kaldi toolkit, Thrax). The ASR system was trained with the real radiology text reports and dictations collected during development phases. The final version of the ASR system was tested by 11 radiologists who dictated 219 reports in total, in spontaneous manner in a real clinical environment. The audio files collected in the final phase were used to measure the performance of different versions of the ASR system retrospectively. ASR system versions were evaluated by word error rate (WER) for each speaker and modality and by WER difference for the first and the last version of the ASR system. Total average WER for the final version throughout all material was improved from 18.4% of the first version (v1) to 5.8% of the last (v8) version which corresponds to relative improvement of 68.5%. WER improvement was strongly related to modality and radiologist. In summary, the performance of the final ASR system version was close to optimal, delivering similar results to all modalities and being independent on user, the complexity of the radiology reports, user experience, and speech characteristics.

  15. Measuring listening effort: driving simulator vs. simple dual-task paradigm

    PubMed Central

    Wu, Yu-Hsiang; Aksan, Nazan; Rizzo, Matthew; Stangl, Elizabeth; Zhang, Xuyang; Bentler, Ruth

    2014-01-01

    Objectives The dual-task paradigm has been widely used to measure listening effort. The primary objectives of the study were to (1) investigate the effect of hearing aid amplification and a hearing aid directional technology on listening effort measured by a complicated, more real world dual-task paradigm, and (2) compare the results obtained with this paradigm to a simpler laboratory-style dual-task paradigm. Design The listening effort of adults with hearing impairment was measured using two dual-task paradigms, wherein participants performed a speech recognition task simultaneously with either a driving task in a simulator or a visual reaction-time task in a sound-treated booth. The speech materials and road noises for the speech recognition task were recorded in a van traveling on the highway in three hearing aid conditions: unaided, aided with omni directional processing (OMNI), and aided with directional processing (DIR). The change in the driving task or the visual reaction-time task performance across the conditions quantified the change in listening effort. Results Compared to the driving-only condition, driving performance declined significantly with the addition of the speech recognition task. Although the speech recognition score was higher in the OMNI and DIR conditions than in the unaided condition, driving performance was similar across these three conditions, suggesting that listening effort was not affected by amplification and directional processing. Results from the simple dual-task paradigm showed a similar trend: hearing aid technologies improved speech recognition performance, but did not affect performance in the visual reaction-time task (i.e., reduce listening effort). The correlation between listening effort measured using the driving paradigm and the visual reaction-time task paradigm was significant. The finding showing that our older (56 to 85 years old) participants’ better speech recognition performance did not result in reduced listening effort was not consistent with literature that evaluated younger (approximately 20 years old), normal hearing adults. Because of this, a follow-up study was conducted. In the follow-up study, the visual reaction-time dual-task experiment using the same speech materials and road noises was repeated on younger adults with normal hearing. Contrary to findings with older participants, the results indicated that the directional technology significantly improved performance in both speech recognition and visual reaction-time tasks. Conclusions Adding a speech listening task to driving undermined driving performance. Hearing aid technologies significantly improved speech recognition while driving, but did not significantly reduce listening effort. Listening effort measured by dual-task experiments using a simulated real-world driving task and a conventional laboratory-style task was generally consistent. For a given listening environment, the benefit of hearing aid technologies on listening effort measured from younger adults with normal hearing may not be fully translated to older listeners with hearing impairment. PMID:25083599

  16. Speech recognition: Acoustic-phonetic knowledge acquisition and representation

    NASA Astrophysics Data System (ADS)

    Zue, Victor W.

    1988-09-01

    The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.

  17. Across-site patterns of modulation detection: Relation to speech recognitiona)

    PubMed Central

    Garadat, Soha N.; Zwolan, Teresa A.; Pfingst, Bryan E.

    2012-01-01

    The aim of this study was to identify across-site patterns of modulation detection thresholds (MDTs) in subjects with cochlear implants and to determine if removal of sites with the poorest MDTs from speech processor programs would result in improved speech recognition. Five hundred millisecond trains of symmetric-biphasic pulses were modulated sinusoidally at 10 Hz and presented at a rate of 900 pps using monopolar stimulation. Subjects were asked to discriminate a modulated pulse train from an unmodulated pulse train for all electrodes in quiet and in the presence of an interleaved unmodulated masker presented on the adjacent site. Across-site patterns of masked MDTs were then used to construct two 10-channel MAPs such that one MAP consisted of sites with the best masked MDTs and the other MAP consisted of sites with the worst masked MDTs. Subjects’ speech recognition skills were compared when they used these two different MAPs. Results showed that MDTs were variable across sites and were elevated in the presence of a masker by various amounts across sites. Better speech recognition was observed when the processor MAP consisted of sites with best masked MDTs, suggesting that temporal modulation sensitivity has important contributions to speech recognition with a cochlear implant. PMID:22559376

  18. Sentence Recognition Prediction for Hearing-impaired Listeners in Stationary and Fluctuation Noise With FADE: Empowering the Attenuation and Distortion Concept by Plomp With a Quantitative Processing Model.

    PubMed

    Kollmeier, Birger; Schädler, Marc René; Warzybok, Anna; Meyer, Bernd T; Brand, Thomas

    2016-09-07

    To characterize the individual patient's hearing impairment as obtained with the matrix sentence recognition test, a simulation Framework for Auditory Discrimination Experiments (FADE) is extended here using the Attenuation and Distortion (A+D) approach by Plomp as a blueprint for setting the individual processing parameters. FADE has been shown to predict the outcome of both speech recognition tests and psychoacoustic experiments based on simulations using an automatic speech recognition system requiring only few assumptions. It builds on the closed-set matrix sentence recognition test which is advantageous for testing individual speech recognition in a way comparable across languages. Individual predictions of speech recognition thresholds in stationary and in fluctuating noise were derived using the audiogram and an estimate of the internal level uncertainty for modeling the individual Plomp curves fitted to the data with the Attenuation (A-) and Distortion (D-) parameters of the Plomp approach. The "typical" audiogram shapes from Bisgaard et al with or without a "typical" level uncertainty and the individual data were used for individual predictions. As a result, the individualization of the level uncertainty was found to be more important than the exact shape of the individual audiogram to accurately model the outcome of the German Matrix test in stationary or fluctuating noise for listeners with hearing impairment. The prediction accuracy of the individualized approach also outperforms the (modified) Speech Intelligibility Index approach which is based on the individual threshold data only. © The Author(s) 2016.

  19. Repeated Speech Errors: Evidence for Learning

    ERIC Educational Resources Information Center

    Humphreys, Karin R.; Menzies, Heather; Lake, Johanna K.

    2010-01-01

    Three experiments elicited phonological speech errors using the SLIP procedure to investigate whether there is a tendency for speech errors on specific words to reoccur, and whether this effect can be attributed to implicit learning of an incorrect mapping from lemma to phonology for that word. In Experiment 1, when speakers made a phonological…

  20. Cascading Influences on the Production of Speech: Evidence from Articulation

    ERIC Educational Resources Information Center

    McMillan, Corey T.; Corley, Martin

    2010-01-01

    Recent investigations have supported the suggestion that phonological speech errors may reflect the simultaneous activation of more than one phonemic representation. This presents a challenge for speech error evidence which is based on the assumption of well-formedness, because we may continue to perceive well-formed errors, even when they are not…

  1. Science 101: How Does Speech-Recognition Software Work?

    ERIC Educational Resources Information Center

    Robertson, Bill

    2016-01-01

    This column provides background science information for elementary teachers. Many innovations with computer software begin with analysis of how humans do a task. This article takes a look at how humans recognize spoken words and explains the origins of speech-recognition software.

  2. Watch what you say, your computer might be listening: A review of automated speech recognition

    NASA Technical Reports Server (NTRS)

    Degennaro, Stephen V.

    1991-01-01

    Spoken language is the most convenient and natural means by which people interact with each other and is, therefore, a promising candidate for human-machine interactions. Speech also offers an additional channel for hands-busy applications, complementing the use of motor output channels for control. Current speech recognition systems vary considerably across a number of important characteristics, including vocabulary size, speaking mode, training requirements for new speakers, robustness to acoustic environments, and accuracy. Algorithmically, these systems range from rule-based techniques through more probabilistic or self-learning approaches such as hidden Markov modeling and neural networks. This tutorial begins with a brief summary of the relevant features of current speech recognition systems and the strengths and weaknesses of the various algorithmic approaches.

  3. Some Neurocognitive Correlates of Noise-Vocoded Speech Perception in Children With Normal Hearing: A Replication and Extension of ).

    PubMed

    Roman, Adrienne S; Pisoni, David B; Kronenberger, William G; Faulkner, Kathleen F

    Noise-vocoded speech is a valuable research tool for testing experimental hypotheses about the effects of spectral degradation on speech recognition in adults with normal hearing (NH). However, very little research has utilized noise-vocoded speech with children with NH. Earlier studies with children with NH focused primarily on the amount of spectral information needed for speech recognition without assessing the contribution of neurocognitive processes to speech perception and spoken word recognition. In this study, we first replicated the seminal findings reported by ) who investigated effects of lexical density and word frequency on noise-vocoded speech perception in a small group of children with NH. We then extended the research to investigate relations between noise-vocoded speech recognition abilities and five neurocognitive measures: auditory attention (AA) and response set, talker discrimination, and verbal and nonverbal short-term working memory. Thirty-one children with NH between 5 and 13 years of age were assessed on their ability to perceive lexically controlled words in isolation and in sentences that were noise-vocoded to four spectral channels. Children were also administered vocabulary assessments (Peabody Picture Vocabulary test-4th Edition and Expressive Vocabulary test-2nd Edition) and measures of AA (NEPSY AA and response set and a talker discrimination task) and short-term memory (visual digit and symbol spans). Consistent with the findings reported in the original ) study, we found that children perceived noise-vocoded lexically easy words better than lexically hard words. Words in sentences were also recognized better than the same words presented in isolation. No significant correlations were observed between noise-vocoded speech recognition scores and the Peabody Picture Vocabulary test-4th Edition using language quotients to control for age effects. However, children who scored higher on the Expressive Vocabulary test-2nd Edition recognized lexically easy words better than lexically hard words in sentences. Older children perceived noise-vocoded speech better than younger children. Finally, we found that measures of AA and short-term memory capacity were significantly correlated with a child's ability to perceive noise-vocoded isolated words and sentences. First, we successfully replicated the major findings from the ) study. Because familiarity, phonological distinctiveness and lexical competition affect word recognition, these findings provide additional support for the proposal that several foundational elementary neurocognitive processes underlie the perception of spectrally degraded speech. Second, we found strong and significant correlations between performance on neurocognitive measures and children's ability to recognize words and sentences noise-vocoded to four spectral channels. These findings extend earlier research suggesting that perception of spectrally degraded speech reflects early peripheral auditory processes, as well as additional contributions of executive function, specifically, selective attention and short-term memory processes in spoken word recognition. The present findings suggest that AA and short-term memory support robust spoken word recognition in children with NH even under compromised and challenging listening conditions. These results are relevant to research carried out with listeners who have hearing loss, because they are routinely required to encode, process, and understand spectrally degraded acoustic signals.

  4. Some Neurocognitive Correlates of Noise-Vocoded Speech Perception in Children with Normal Hearing: A Replication and Extension of Eisenberg et al., 2002

    PubMed Central

    Roman, Adrienne S.; Pisoni, David B.; Kronenberger, William G.; Faulkner, Kathleen F.

    2016-01-01

    Objectives Noise-vocoded speech is a valuable research tool for testing experimental hypotheses about the effects of spectral-degradation on speech recognition in adults with normal hearing (NH). However, very little research has utilized noise-vocoded speech with children with NH. Earlier studies with children with NH focused primarily on the amount of spectral information needed for speech recognition without assessing the contribution of neurocognitive processes to speech perception and spoken word recognition. In this study, we first replicated the seminal findings reported by Eisenberg et al. (2002) who investigated effects of lexical density and word frequency on noise-vocoded speech perception in a small group of children with NH. We then extended the research to investigate relations between noise-vocoded speech recognition abilities and five neurocognitive measures: auditory attention and response set, talker discrimination and verbal and nonverbal short-term working memory. Design Thirty-one children with NH between 5 and 13 years of age were assessed on their ability to perceive lexically controlled words in isolation and in sentences that were noise-vocoded to four spectral channels. Children were also administered vocabulary assessments (PPVT-4 and EVT-2) and measures of auditory attention (NEPSY Auditory Attention (AA) and Response Set (RS) and a talker discrimination task (TD)) and short-term memory (visual digit and symbol spans). Results Consistent with the findings reported in the original Eisenberg et al. (2002) study, we found that children perceived noise-vocoded lexically easy words better than lexically hard words. Words in sentences were also recognized better than the same words presented in isolation. No significant correlations were observed between noise-vocoded speech recognition scores and the PPVT-4 using language quotients to control for age effects. However, children who scored higher on the EVT-2 recognized lexically easy words better than lexically hard words in sentences. Older children perceived noise-vocoded speech better than younger children. Finally, we found that measures of auditory attention and short-term memory capacity were significantly correlated with a child’s ability to perceive noise-vocoded isolated words and sentences. Conclusions First, we successfully replicated the major findings from the Eisenberg et al. (2002) study. Because familiarity, phonological distinctiveness and lexical competition affect word recognition, these findings provide additional support for the proposal that several foundational elementary neurocognitive processes underlie the perception of spectrally-degraded speech. Second, we found strong and significant correlations between performance on neurocognitive measures and children’s ability to recognize words and sentences noise-vocoded to four spectral channels. These findings extend earlier research suggesting that perception of spectrally-degraded speech reflects early peripheral auditory processes as well as additional contributions of executive function, specifically, selective attention and short-term memory processes in spoken word recognition. The present findings suggest that auditory attention and short-term memory support robust spoken word recognition in children with NH even under compromised and challenging listening conditions. These results are relevant to research carried out with listeners who have hearing loss, since they are routinely required to encode, process and understand spectrally-degraded acoustic signals. PMID:28045787

  5. Perception of Sung Speech in Bimodal Cochlear Implant Users.

    PubMed

    Crew, Joseph D; Galvin, John J; Fu, Qian-Jie

    2016-11-11

    Combined use of a hearing aid (HA) and cochlear implant (CI) has been shown to improve CI users' speech and music performance. However, different hearing devices, test stimuli, and listening tasks may interact and obscure bimodal benefits. In this study, speech and music perception were measured in bimodal listeners for CI-only, HA-only, and CI + HA conditions, using the Sung Speech Corpus, a database of monosyllabic words produced at different fundamental frequencies. Sentence recognition was measured using sung speech in which pitch was held constant or varied across words, as well as for spoken speech. Melodic contour identification (MCI) was measured using sung speech in which the words were held constant or varied across notes. Results showed that sentence recognition was poorer with sung speech relative to spoken, with little difference between sung speech with a constant or variable pitch; mean performance was better with CI-only relative to HA-only, and best with CI + HA. MCI performance was better with constant words versus variable words; mean performance was better with HA-only than with CI-only and was best with CI + HA. Relative to CI-only, a strong bimodal benefit was observed for speech and music perception. Relative to the better ear, bimodal benefits remained strong for sentence recognition but were marginal for MCI. While variations in pitch and timbre may negatively affect CI users' speech and music perception, bimodal listening may partially compensate for these deficits. © The Author(s) 2016.

  6. Sperry Univac speech communications technology

    NASA Technical Reports Server (NTRS)

    Medress, Mark F.

    1977-01-01

    Technology and systems for effective verbal communication with computers were developed. A continuous speech recognition system for verbal input, a word spotting system to locate key words in conversational speech, prosodic tools to aid speech analysis, and a prerecorded voice response system for speech output are described.

  7. Assistive Technology and Adults with Learning Disabilities: A Blueprint for Exploration and Advancement.

    ERIC Educational Resources Information Center

    Raskind, Marshall

    1993-01-01

    This article describes assistive technologies for persons with learning disabilities, including word processing, spell checking, proofreading programs, outlining/"brainstorming" programs, abbreviation expanders, speech recognition, speech synthesis/screen review, optical character recognition systems, personal data managers, free-form databases,…

  8. Speech Recognition for A Digital Video Library.

    ERIC Educational Resources Information Center

    Witbrock, Michael J.; Hauptmann, Alexander G.

    1998-01-01

    Production of the meta-data supporting the Informedia Digital Video Library interface is automated using techniques derived from artificial intelligence research. Speech recognition and natural-language processing, information retrieval, and image analysis are applied to produce an interface that helps users locate information and navigate more…

  9. Speech as a pilot input medium

    NASA Technical Reports Server (NTRS)

    Plummer, R. P.; Coler, C. R.

    1977-01-01

    The speech recognition system under development is a trainable pattern classifier based on a maximum-likelihood technique. An adjustable uncertainty threshold allows the rejection of borderline cases for which the probability of misclassification is high. The syntax of the command language spoken may be used as an aid to recognition, and the system adapts to changes in pronunciation if feedback from the user is available. Words must be separated by .25 second gaps. The system runs in real time on a mini-computer (PDP 11/10) and was tested on 120,000 speech samples from 10- and 100-word vocabularies. The results of these tests were 99.9% correct recognition for a vocabulary consisting of the ten digits, and 99.6% recognition for a 100-word vocabulary of flight commands, with a 5% rejection rate in each case. With no rejection, the recognition accuracies for the same vocabularies were 99.5% and 98.6% respectively.

  10. Error Consistency in Acquired Apraxia of Speech with Aphasia: Effects of the Analysis Unit

    ERIC Educational Resources Information Center

    Haley, Katarina L.; Cunningham, Kevin T.; Eaton, Catherine Torrington; Jacks, Adam

    2018-01-01

    Purpose: Diagnostic recommendations for acquired apraxia of speech (AOS) have been contradictory concerning whether speech sound errors are consistent or variable. Studies have reported divergent findings that, on face value, could argue either for or against error consistency as a diagnostic criterion. The purpose of this study was to explain…

  11. Listeners Experience Linguistic Masking Release in Noise-Vocoded Speech-in-Speech Recognition

    ERIC Educational Resources Information Center

    Viswanathan, Navin; Kokkinakis, Kostas; Williams, Brittany T.

    2018-01-01

    Purpose: The purpose of this study was to evaluate whether listeners with normal hearing perceiving noise-vocoded speech-in-speech demonstrate better intelligibility of target speech when the background speech was mismatched in language (linguistic release from masking [LRM]) and/or location (spatial release from masking [SRM]) relative to the…

  12. Detection of target phonemes in spontaneous and read speech.

    PubMed

    Mehta, G; Cutler, A

    1988-01-01

    Although spontaneous speech occurs more frequently in most listeners' experience than read speech, laboratory studies of human speech recognition typically use carefully controlled materials read from a script. The phonological and prosodic characteristics of spontaneous and read speech differ considerably, however, which suggests that laboratory results may not generalise to the recognition of spontaneous speech. In the present study listeners were presented with both spontaneous and read speech materials, and their response time to detect word-initial target phonemes was measured. Responses were, overall, equally fast in each speech mode. However, analysis of effects previously reported in phoneme detection studies revealed significant differences between speech modes. In read speech but not in spontaneous speech, later targets were detected more rapidly than targets preceded by short words. In contrast, in spontaneous speech but not in read speech, targets were detected more rapidly in accented than in unaccented words and in strong than in weak syllables. An explanation for this pattern is offered in terms of characteristic prosodic differences between spontaneous and read speech. The results support claims from previous work that listeners pay great attention to prosodic information in the process of recognising speech.

  13. Refreshing Refreshable Braille Displays.

    PubMed

    Russomanno, Alexander; O'Modhrain, Sile; Gillespie, R Brent; Rodger, Matthew W M

    2015-01-01

    The increased access to books afforded to blind people via e-publishing has given them long-sought independence for both recreational and educational reading. In most cases, blind readers access materials using speech output. For some content such as highly technical texts, music, and graphics, speech is not an appropriate access modality as it does not promote deep understanding. Therefore blind braille readers often prefer electronic braille displays. But, these are prohibitively expensive. The search is on, therefore, for a low-cost refreshable display that would go beyond current technologies and deliver graphical content as well as text. And many solutions have been proposed, some of which reduce costs by restricting the number of characters that can be displayed, even down to a single braille cell. In this paper, we demonstrate that restricting tactile cues during braille reading leads to poorer performance in a letter recognition task. In particular, we show that lack of sliding contact between the fingertip and the braille reading surface results in more errors and that the number of errors increases as a function of presentation speed. These findings suggest that single cell displays which do not incorporate sliding contact are likely to be less effective for braille reading.

  14. Modeling the effect of channel number and interaction on consonant recognition in a cochlear implant peak-picking strategy.

    PubMed

    Verschuur, Carl

    2009-03-01

    Difficulties in speech recognition experienced by cochlear implant users may be attributed both to information loss caused by signal processing and to information loss associated with the interface between the electrode array and auditory nervous system, including cross-channel interaction. The objective of the work reported here was to attempt to partial out the relative contribution of these different factors to consonant recognition. This was achieved by comparing patterns of consonant feature recognition as a function of channel number and presence/absence of background noise in users of the Nucleus 24 device with normal hearing subjects listening to acoustic models that mimicked processing of that device. Additionally, in the acoustic model experiment, a simulation of cross-channel spread of excitation, or "channel interaction," was varied. Results showed that acoustic model experiments were highly correlated with patterns of performance in better-performing cochlear implant users. Deficits to consonant recognition in this subgroup could be attributed to cochlear implant processing, whereas channel interaction played a much smaller role in determining performance errors. The study also showed that large changes to channel number in the Advanced Combination Encoder signal processing strategy led to no substantial changes in performance.

  15. LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.

    PubMed

    Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu

    2005-01-01

    Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.

  16. An Analysis of Individual Differences in Recognizing Monosyllabic Words Under the Speech Intelligibility Index Framework

    PubMed Central

    Shen, Yi; Kern, Allison B.

    2018-01-01

    Individual differences in the recognition of monosyllabic words, either in isolation (NU6 test) or in sentence context (SPIN test), were investigated under the theoretical framework of the speech intelligibility index (SII). An adaptive psychophysical procedure, namely the quick-band-importance-function procedure, was developed to enable the fitting of the SII model to individual listeners. Using this procedure, the band importance function (i.e., the relative weights of speech information across the spectrum) and the link function relating the SII to recognition scores can be simultaneously estimated while requiring only 200 to 300 trials of testing. Octave-frequency band importance functions and link functions were estimated separately for NU6 and SPIN materials from 30 normal-hearing listeners who were naïve to speech recognition experiments. For each type of speech material, considerable individual differences in the spectral weights were observed in some but not all frequency regions. At frequencies where the greatest intersubject variability was found, the spectral weights were correlated between the two speech materials, suggesting that the variability in spectral weights reflected listener-originated factors. PMID:29532711

  17. Particle Swarm Optimization Based Feature Enhancement and Feature Selection for Improved Emotion Recognition in Speech and Glottal Signals

    PubMed Central

    Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali

    2015-01-01

    In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature. PMID:25799141

  18. Effect of high-frequency spectral components in computer recognition of dysarthric speech based on a Mel-cepstral stochastic model.

    PubMed

    Polur, Prasad D; Miller, Gerald E

    2005-01-01

    Computer speech recognition of individuals with dysarthria, such as cerebral palsy patients, requires a robust technique that can handle conditions of very high variability and limited training data. In this study, a hidden Markov model (HMM) was constructed and conditions investigated that would provide improved performance for a dysarthric speech (isolated word) recognition system intended to act as an assistive/control tool. In particular, we investigated the effect of high-frequency spectral components on the recognition rate of the system to determine if they contributed useful additional information to the system. A small-size vocabulary spoken by three cerebral palsy subjects was chosen. Mel-frequency cepstral coefficients extracted with the use of 15 ms frames served as training input to an ergodic HMM setup. Subsequent results demonstrated that no significant useful information was available to the system for enhancing its ability to discriminate dysarthric speech above 5.5 kHz in the current set of dysarthric data. The level of variability in input dysarthric speech patterns limits the reliability of the system. However, its application as a rehabilitation/control tool to assist dysarthric motor-impaired individuals such as cerebral palsy subjects holds sufficient promise.

  19. Application of Business Process Management to drive the deployment of a speech recognition system in a healthcare organization.

    PubMed

    González Sánchez, María José; Framiñán Torres, José Manuel; Parra Calderón, Carlos Luis; Del Río Ortega, Juan Antonio; Vigil Martín, Eduardo; Nieto Cervera, Jaime

    2008-01-01

    We present a methodology based on Business Process Management to guide the development of a speech recognition system in a hospital in Spain. The methodology eases the deployment of the system by 1) involving the clinical staff in the process, 2) providing the IT professionals with a description of the process and its requirements, 3) assessing advantages and disadvantages of the speech recognition system, as well as its impact in the organisation, and 4) help reorganising the healthcare process before implementing the new technology in order to identify how it can better contribute to the overall objective of the organisation.

  20. [Investigating phonological planning processes in speech production through a speech-error induction technique].

    PubMed

    Nakayama, Masataka; Saito, Satoru

    2015-08-01

    The present study investigated principles of phonological planning, a common serial ordering mechanism for speech production and phonological short-term memory. Nakayama and Saito (2014) have investigated the principles by using a speech-error induction technique, in which participants were exposed to an auditory distracIor word immediately before an utterance of a target word. They demonstrated within-word adjacent mora exchanges and serial position effects on error rates. These findings support, respectively, the temporal distance and the edge principles at a within-word level. As this previous study induced errors using word distractors created by exchanging adjacent morae in the target words, it is possible that the speech errors are expressions of lexical intrusions reflecting interactive activation of phonological and lexical/semantic representations. To eliminate this possibility, the present study used nonword distractors that had no lexical or semantic representations. This approach successfully replicated the error patterns identified in the abovementioned study, further confirming that the temporal distance and edge principles are organizing precepts in phonological planning.

  1. Error Biases in Inner and Overt Speech: Evidence from Tongue Twisters

    ERIC Educational Resources Information Center

    Corley, Martin; Brocklehurst, Paul H.; Moat, H. Susannah

    2011-01-01

    To compare the properties of inner and overt speech, Oppenheim and Dell (2008) counted participants' self-reported speech errors when reciting tongue twisters either overtly or silently and found a bias toward substituting phonemes that resulted in words in both conditions, but a bias toward substituting similar phonemes only when speech was…

  2. Psychometric Functions for Shortened Administrations of a Speech Recognition Approach Using Tri-Word Presentations and Phonemic Scoring

    ERIC Educational Resources Information Center

    Gelfand, Stanley A.; Gelfand, Jessica T.

    2012-01-01

    Method: Complete psychometric functions for phoneme and word recognition scores at 8 signal-to-noise ratios from -15 dB to 20 dB were generated for the first 10, 20, and 25, as well as all 50, three-word presentations of the Tri-Word or Computer Assisted Speech Recognition Assessment (CASRA) Test (Gelfand, 1998) based on the results of 12…

  3. Adaptive method of recognition of signals for one and two-frequency signal system in the telephony on the background of speech

    NASA Astrophysics Data System (ADS)

    Kuznetsov, Michael V.

    2006-05-01

    For reliable teamwork of various systems of automatic telecommunication including transferring systems of optical communication networks it is necessary authentic recognition of signals for one- or two-frequency service signal system. The analysis of time parameters of an accepted signal allows increasing reliability of detection and recognition of the service signal system on a background of speech.

  4. Age-related Effects on Word Recognition: Reliance on Cognitive Control Systems with Structural Declines in Speech-responsive Cortex

    PubMed Central

    Walczak, Adam; Ahlstrom, Jayne; Denslow, Stewart; Horwitz, Amy; Dubno, Judy R.

    2008-01-01

    Speech recognition can be difficult and effortful for older adults, even for those with normal hearing. Declining frontal lobe cognitive control has been hypothesized to cause age-related speech recognition problems. This study examined age-related changes in frontal lobe function for 15 clinically normal hearing adults (21–75 years) when they performed a word recognition task that was made challenging by decreasing word intelligibility. Although there were no age-related changes in word recognition, there were age-related changes in the degree of activity within left middle frontal gyrus (MFG) and anterior cingulate (ACC) regions during word recognition. Older adults engaged left MFG and ACC regions when words were most intelligible compared to younger adults who engaged these regions when words were least intelligible. Declining gray matter volume within temporal lobe regions responsive to word intelligibility significantly predicted left MFG activity, even after controlling for total gray matter volume, suggesting that declining structural integrity of brain regions responsive to speech leads to the recruitment of frontal regions when words are easily understood. Electronic supplementary material The online version of this article (doi:10.1007/s10162-008-0113-3) contains supplementary material, which is available to authorized users. PMID:18274825

  5. An audiovisual emotion recognition system

    NASA Astrophysics Data System (ADS)

    Han, Yi; Wang, Guoyin; Yang, Yong; He, Kun

    2007-12-01

    Human emotions could be expressed by many bio-symbols. Speech and facial expression are two of them. They are both regarded as emotional information which is playing an important role in human-computer interaction. Based on our previous studies on emotion recognition, an audiovisual emotion recognition system is developed and represented in this paper. The system is designed for real-time practice, and is guaranteed by some integrated modules. These modules include speech enhancement for eliminating noises, rapid face detection for locating face from background image, example based shape learning for facial feature alignment, and optical flow based tracking algorithm for facial feature tracking. It is known that irrelevant features and high dimensionality of the data can hurt the performance of classifier. Rough set-based feature selection is a good method for dimension reduction. So 13 speech features out of 37 ones and 10 facial features out of 33 ones are selected to represent emotional information, and 52 audiovisual features are selected due to the synchronization when speech and video fused together. The experiment results have demonstrated that this system performs well in real-time practice and has high recognition rate. Our results also show that the work in multimodules fused recognition will become the trend of emotion recognition in the future.

  6. The Relationship Between Spectral Modulation Detection and Speech Recognition: Adult Versus Pediatric Cochlear Implant Recipients

    PubMed Central

    Noble, Jack H.; Camarata, Stephen M.; Sunderhaus, Linsey W.; Dwyer, Robert T.; Dawant, Benoit M.; Dietrich, Mary S.; Labadie, Robert F.

    2018-01-01

    Adult cochlear implant (CI) recipients demonstrate a reliable relationship between spectral modulation detection and speech understanding. Prior studies documenting this relationship have focused on postlingually deafened adult CI recipients—leaving an open question regarding the relationship between spectral resolution and speech understanding for adults and children with prelingual onset of deafness. Here, we report CI performance on the measures of speech recognition and spectral modulation detection for 578 CI recipients including 477 postlingual adults, 65 prelingual adults, and 36 prelingual pediatric CI users. The results demonstrated a significant correlation between spectral modulation detection and various measures of speech understanding for 542 adult CI recipients. For 36 pediatric CI recipients, however, there was no significant correlation between spectral modulation detection and speech understanding in quiet or in noise nor was spectral modulation detection significantly correlated with listener age or age at implantation. These findings suggest that pediatric CI recipients might not depend upon spectral resolution for speech understanding in the same manner as adult CI recipients. It is possible that pediatric CI users are making use of different cues, such as those contained within the temporal envelope, to achieve high levels of speech understanding. Further investigation is warranted to investigate the relationship between spectral and temporal resolution and speech recognition to describe the underlying mechanisms driving peripheral auditory processing in pediatric CI users. PMID:29716437

  7. Speech Recognition Using Multiple Features and Multiple Recognizers

    DTIC Science & Technology

    1991-12-03

    6 2.1 Introduction ............................................... 6 2.2 Human Speech Communication Process...119 How to Setup ASRT.......................................... 119 How to Use Interactive Menus .................................. 120...recognize a word from an acoustic signal. The human ear and brain perform this type of recognition with incredible speed and precision. Even though

  8. [Improvement in Phoneme Discrimination in Noise in Normal Hearing Adults].

    PubMed

    Schumann, A; Garea Garcia, L; Hoppe, U

    2017-02-01

    Objective: The study's aim was to examine the possibility to train phoneme-discrimination in noise with normal hearing adults, and its effectivity on speech recognition in noise. A specific computerised training program was used, consisting of special nonsense-syllables with background noise, to train participants' discrimination ability. Material and Methods: 46 normal hearing subjects took part in this study, 28 as training group participants, 18 as control group participants. Only the training group subjects were asked to train over a period of 3 weeks, twice a week for an hour with a computer-based training program. Speech recognition in noise were measured pre- to posttraining for the training group subjects with the Freiburger Einsilber Test. The control group subjects obtained test and restest measures within a 2-3 week break. For the training group follow-up speech recognition was measured 2-3 months after the end of the training. Results: The majority of training group subjects improved their phoneme discrimination significantly. Besides, their speech recognition in noise improved significantly during the training compared to the control group, and remained stable for a period of time. Conclusions: Phonem-Discrimination in noise can be trained by normal hearing adults. The improvements have got a positiv effect on speech recognition in noise, also for a longer period of time. © Georg Thieme Verlag KG Stuttgart · New York.

  9. Recognition and localization of speech by adult cochlear implant recipients wearing a digital hearing aid in the nonimplanted ear (bimodal hearing).

    PubMed

    Potts, Lisa G; Skinner, Margaret W; Litovsky, Ruth A; Strube, Michael J; Kuk, Francis

    2009-06-01

    The use of bilateral amplification is now common clinical practice for hearing aid users but not for cochlear implant recipients. In the past, most cochlear implant recipients were implanted in one ear and wore only a monaural cochlear implant processor. There has been recent interest in benefits arising from bilateral stimulation that may be present for cochlear implant recipients. One option for bilateral stimulation is the use of a cochlear implant in one ear and a hearing aid in the opposite nonimplanted ear (bimodal hearing). This study evaluated the effect of wearing a cochlear implant in one ear and a digital hearing aid in the opposite ear on speech recognition and localization. A repeated-measures correlational study was completed. Nineteen adult Cochlear Nucleus 24 implant recipients participated in the study. The participants were fit with a Widex Senso Vita 38 hearing aid to achieve maximum audibility and comfort within their dynamic range. Soundfield thresholds, loudness growth, speech recognition, localization, and subjective questionnaires were obtained six-eight weeks after the hearing aid fitting. Testing was completed in three conditions: hearing aid only, cochlear implant only, and cochlear implant and hearing aid (bimodal). All tests were repeated four weeks after the first test session. Repeated-measures analysis of variance was used to analyze the data. Significant effects were further examined using pairwise comparison of means or in the case of continuous moderators, regression analyses. The speech-recognition and localization tasks were unique, in that a speech stimulus presented from a variety of roaming azimuths (140 degree loudspeaker array) was used. Performance in the bimodal condition was significantly better for speech recognition and localization compared to the cochlear implant-only and hearing aid-only conditions. Performance was also different between these conditions when the location (i.e., side of the loudspeaker array that presented the word) was analyzed. In the bimodal condition, the speech-recognition and localization tasks were equal regardless of which side of the loudspeaker array presented the word, while performance was significantly poorer for the monaural conditions (hearing aid only and cochlear implant only) when the words were presented on the side with no stimulation. Binaural loudness summation of 1-3 dB was seen in soundfield thresholds and loudness growth in the bimodal condition. Measures of the audibility of sound with the hearing aid, including unaided thresholds, soundfield thresholds, and the Speech Intelligibility Index, were significant moderators of speech recognition and localization. Based on the questionnaire responses, participants showed a strong preference for bimodal stimulation. These findings suggest that a well-fit digital hearing aid worn in conjunction with a cochlear implant is beneficial to speech recognition and localization. The dynamic test procedures used in this study illustrate the importance of bilateral hearing for locating, identifying, and switching attention between multiple speakers. It is recommended that unilateral cochlear implant recipients, with measurable unaided hearing thresholds, be fit with a hearing aid.

  10. The Effectiveness of Clear Speech as a Masker

    ERIC Educational Resources Information Center

    Calandruccio, Lauren; Van Engen, Kristin; Dhar, Sumitrajit; Bradlow, Ann R.

    2010-01-01

    Purpose: It is established that speaking clearly is an effective means of enhancing intelligibility. Because any signal-processing scheme modeled after known acoustic-phonetic features of clear speech will likely affect both target and competing speech, it is important to understand how speech recognition is affected when a competing speech signal…

  11. Automatic speech recognition and training for severely dysarthric users of assistive technology: the STARDUST project.

    PubMed

    Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil

    2006-01-01

    The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent proximity to "normal" articulatory patterns. Severe dysarthric output may also be characterized by a small mass of distinguishable phonetic tokens making the acoustic differentiation of target words difficult. Speaker dependent computer speech recognition using Hidden Markov Models was achieved by the identification of robust phonetic elements within the individual speaker output patterns. A new system of speech training using computer generated visual and auditory feedback reduced the inconsistent production of key phonetic tokens over time.

  12. Visual speech information: a help or hindrance in perceptual processing of dysarthric speech.

    PubMed

    Borrie, Stephanie A

    2015-03-01

    This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.

  13. Processing F0 with cochlear implants: Modulation frequency discrimination and speech intonation recognition.

    PubMed

    Chatterjee, Monita; Peng, Shu-Chen

    2008-01-01

    Fundamental frequency (F0) processing by cochlear implant (CI) listeners was measured using a psychophysical task and a speech intonation recognition task. Listeners' Weber fractions for modulation frequency discrimination were measured using an adaptive, 3-interval, forced-choice paradigm: stimuli were presented through a custom research interface. In the speech intonation recognition task, listeners were asked to indicate whether resynthesized bisyllabic words, when presented in the free field through the listeners' everyday speech processor, were question-like or statement-like. The resynthesized tokens were systematically manipulated to have different initial-F0s to represent male vs. female voices, and different F0 contours (i.e. falling, flat, and rising) Although the CI listeners showed considerable variation in performance on both tasks, significant correlations were observed between the CI listeners' sensitivity to modulation frequency in the psychophysical task and their performance in intonation recognition. Consistent with their greater reliance on temporal cues, the CI listeners' performance in the intonation recognition task was significantly poorer with the higher initial-F0 stimuli than with the lower initial-F0 stimuli. Similar results were obtained with normal hearing listeners attending to noiseband-vocoded CI simulations with reduced spectral resolution.

  14. Processing F0 with Cochlear Implants: Modulation Frequency Discrimination and Speech Intonation Recognition

    PubMed Central

    Chatterjee, Monita; Peng, Shu-Chen

    2008-01-01

    Fundamental frequency (F0) processing by cochlear implant (CI) listeners was measured using a psychophysical task and a speech intonation recognition task. Listeners’ Weber fractions for modulation frequency discrimination were measured using an adaptive, 3-interval, forced-choice paradigm: stimuli were presented through a custom research interface. In the speech intonation recognition task, listeners were asked to indicate whether resynthesized bisyllabic words, when presented in the free field through the listeners’ everyday speech processor, were question-like or statement-like. The resynthesized tokens were systematically manipulated to have different initial F0s to represent male vs. female voices, and different F0 contours (i.e., falling, flat, and rising) Although the CI listeners showed considerable variation in performance on both tasks, significant correlations were observed between the CI listeners’ sensitivity to modulation frequency in the psychophysical task and their performance in intonation recognition. Consistent with their greater reliance on temporal cues, the CI listeners’ performance in the intonation recognition task was significantly poorer with the higher initial-F0 stimuli than with the lower initial-F0 stimuli. Similar results were obtained with normal hearing listeners attending to noiseband-vocoded CI simulations with reduced spectral resolution. PMID:18093766

  15. Benefits of adaptive FM systems on speech recognition in noise for listeners who use hearing aids.

    PubMed

    Thibodeau, Linda

    2010-06-01

    To compare the benefits of adaptive FM and fixed FM systems through measurement of speech recognition in noise with adults and students in clinical and real-world settings. Five adults and 5 students with moderate-to-severe hearing loss completed objective and subjective speech recognition in noise measures with the 2 types of FM processing. Sentence recognition was evaluated in a classroom for 5 competing noise levels ranging from 54 to 80 dBA while the FM microphone was positioned 6 in. from the signal loudspeaker to receive input at 84 dB SPL. The subjective measures included 2 classroom activities and 6 auditory lessons in a noisy, public aquarium. On the objective measures, adaptive FM processing resulted in significantly better speech recognition in noise than fixed FM processing for 68- and 73-dBA noise levels. On the subjective measures, all individuals preferred adaptive over fixed processing for half of the activities. Adaptive processing was also preferred by most (8-9) individuals for the remaining 4 activities. The adaptive FM processing resulted in significant improvements at the higher noise levels and was preferred by the majority of participants in most of the conditions.

  16. Working Memory and Speech Recognition in Noise Under Ecologically Relevant Listening Conditions: Effects of Visual Cues and Noise Type Among Adults With Hearing Loss.

    PubMed

    Miller, Christi W; Stewart, Erin K; Wu, Yu-Hsiang; Bishop, Christopher; Bentler, Ruth A; Tremblay, Kelly

    2017-08-16

    This study evaluated the relationship between working memory (WM) and speech recognition in noise with different noise types as well as in the presence of visual cues. Seventy-six adults with bilateral, mild to moderately severe sensorineural hearing loss (mean age: 69 years) participated. Using a cross-sectional design, 2 measures of WM were taken: a reading span measure, and Word Auditory Recognition and Recall Measure (Smith, Pichora-Fuller, & Alexander, 2016). Speech recognition was measured with the Multi-Modal Lexical Sentence Test for Adults (Kirk et al., 2012) in steady-state noise and 4-talker babble, with and without visual cues. Testing was under unaided conditions. A linear mixed model revealed visual cues and pure-tone average as the only significant predictors of Multi-Modal Lexical Sentence Test outcomes. Neither WM measure nor noise type showed a significant effect. The contribution of WM in explaining unaided speech recognition in noise was negligible and not influenced by noise type or visual cues. We anticipate that with audibility partially restored by hearing aids, the effects of WM will increase. For clinical practice to be affected, more significant effect sizes are needed.

  17. Neuronal Spoken Word Recognition: The Time Course of Processing Variation in the Speech Signal

    ERIC Educational Resources Information Center

    Schild, Ulrike; Roder, Brigitte; Friedrich, Claudia K.

    2012-01-01

    Recent neurobiological studies revealed evidence for lexical representations that are not specified for the coronal place of articulation (PLACE; Friedrich, Eulitz, & Lahiri, 2006; Friedrich, Lahiri, & Eulitz, 2008). Here we tested when these types of underspecified representations influence neuronal speech recognition. In a unimodal…

  18. Recognizing Speech under a Processing Load: Dissociating Energetic from Informational Factors

    ERIC Educational Resources Information Center

    Mattys, Sven L.; Brooks, Joanna; Cooke, Martin

    2009-01-01

    Effects of perceptual and cognitive loads on spoken-word recognition have so far largely escaped investigation. This study lays the foundations of a psycholinguistic approach to speech recognition in adverse conditions that draws upon the distinction between energetic masking, i.e., listening environments leading to signal degradation, and…

  19. Bilingual Computerized Speech Recognition Screening for Depression Symptoms

    ERIC Educational Resources Information Center

    Gonzalez, Gerardo; Carter, Colby; Blanes, Erika

    2007-01-01

    The Voice-Interactive Depression Assessment System (VIDAS) is a computerized speech recognition application for screening depression based on the Center for Epidemiological Studies--Depression scale in English and Spanish. Study 1 included 50 English and 47 Spanish speakers. Study 2 involved 108 English and 109 Spanish speakers. Participants…

  20. Multichannel Compression, Temporal Cues, and Audibility.

    ERIC Educational Resources Information Center

    Souza, Pamela E.; Turner, Christopher W.

    1998-01-01

    The effect of the reduction of the temporal envelope produced by multichannel compression on recognition was examined in 16 listeners with hearing loss, with particular focus on audibility of the speech signal. Multichannel compression improved speech recognition when superior audibility was provided by a two-channel compression system over linear…

  1. Informational masking of speech in children: Effects of ipsilateral and contralateral distracters

    NASA Astrophysics Data System (ADS)

    Wightman, Frederic L.; Kistler, Doris J.

    2005-11-01

    Using a closed-set speech recognition paradigm thought to be heavily influenced by informational masking, auditory selective attention was measured in 38 children (ages 4-16 years) and 8 adults (ages 20-30 years). The task required attention to a monaural target speech message that was presented with a time-synchronized distracter message in the same ear. In some conditions a second distracter message or a speech-shaped noise was presented to the other ear. Compared to adults, children required higher target/distracter ratios to reach comparable performance levels, reflecting more informational masking in these listeners. Informational masking in most conditions was confirmed by the fact that a large proportion of the errors made by the listeners were contained in the distracter message(s). There was a monotonic age effect, such that even the children in the oldest age group (13.6-16 years) demonstrated poorer performance than adults. For both children and adults, presentation of an additional distracter in the contralateral ear significantly reduced performance, even when the distracter messages were produced by a talker of different sex than the target talker. The results are consistent with earlier reports from pure-tone masking studies that informational masking effects are much larger in children than in adults.

  2. [Sigmund Freud, Rudolf Meringer and Carl Mayer: slips of the tongue and mis-readings. The history of a controversy].

    PubMed

    Hinterhuber, Hartmann

    2007-01-01

    In both his The Psychopathology of Everyday Life and his Lectures Sigmund Freud derived the terms unconscious, preconscious and conscious, particularly from slips in speech, slips in reading and forgetfulness. In these slips, Freud recognised parallels to dreams. In the work mentioned, he analysed these in depth as part of mental motivation. In the papers referred to, Sigmund Freud paid tribute to Rudolf Meringer and Carl Mayer's study which was published in 1895. Meringer and Mayer showed as phenomena reversals and rearrangement of whole words, syllables or sounds, along with pre-tones or anticipations and echoes, word contaminations and word substitutions as responsible for slips of the tongue. The present work demonstrates how passionately these three scientists have contributed to the controversy of their standpoints. For modern psycholinguistics and the psychology of language, speech errors are always an expression of a momentary malfunction of the human speech production system: for the cognitive process of speech production slips of the tongue offer an insight into speech processing. Pre-tones and echoes, serialization errors, as Meringer and Mayer recognised, represent the vast majority of slips of the tongue. They do not reveal any hidden point. But with lexical-semantic slips of the tongue the question of mental motivation is admissible. This short paper is a sign of appreciation and gratitude: firstly, a modest birthday gift for Sigmund Freud, secondly homage to Carl Mayer, who influenced generations of neurologists in his 40 years of chairing the Psychiatric-Neurological Clinic in Innsbruck, so that Hans Ganner rightly spoke of a "Carl Mayer School". But lastly, this short study is also-and especially-a late recognition of Rudolf Meringer, the great Austrian linguist. The view an individual has concerning mental processes and the "topology of the psychic apparatus" is decisive as to the power of determination attached to the unconscious.

  3. The effect of hearing aid technologies on listening in an automobile.

    PubMed

    Wu, Yu-Hsiang; Stangl, Elizabeth; Bentler, Ruth A; Stanziola, Rachel W

    2013-06-01

    Communication while traveling in an automobile often is very difficult for hearing aid users. This is because the automobile/road noise level is usually high, and listeners/drivers often do not have access to visual cues. Since the talker of interest usually is not located in front of the listener/driver, conventional directional processing that places the directivity beam toward the listener's front may not be helpful and, in fact, could have a negative impact on speech recognition (when compared to omnidirectional processing). Recently, technologies have become available in commercial hearing aids that are designed to improve speech recognition and/or listening effort in noisy conditions where talkers are located behind or beside the listener. These technologies include (1) a directional microphone system that uses a backward-facing directivity pattern (Back-DIR processing), (2) a technology that transmits audio signals from the ear with the better signal-to-noise ratio (SNR) to the ear with the poorer SNR (Side-Transmission processing), and (3) a signal processing scheme that suppresses the noise at the ear with the poorer SNR (Side-Suppression processing). The purpose of the current study was to determine the effect of (1) conventional directional microphones and (2) newer signal processing schemes (Back-DIR, Side-Transmission, and Side-Suppression) on listener's speech recognition performance and preference for communication in a traveling automobile. A single-blinded, repeated-measures design was used. Twenty-five adults with bilateral symmetrical sensorineural hearing loss aged 44 through 84 yr participated in the study. The automobile/road noise and sentences of the Connected Speech Test (CST) were recorded through hearing aids in a standard van moving at a speed of 70 mph on a paved highway. The hearing aids were programmed to omnidirectional microphone, conventional adaptive directional microphone, and the three newer schemes. CST sentences were presented from the side and back of the hearing aids, which were placed on the ears of a manikin. The recorded stimuli were presented to listeners via earphones in a sound-treated booth to assess speech recognition performance and preference with each programmed condition. Compared to omnidirectional microphones, conventional adaptive directional processing had a detrimental effect on speech recognition when speech was presented from the back or side of the listener. Back-DIR and Side-Transmission processing improved speech recognition performance (relative to both omnidirectional and adaptive directional processing) when speech was from the back and side, respectively. The performance with Side-Suppression processing was better than with adaptive directional processing when speech was from the side. The participants' preferences for a given processing scheme were generally consistent with speech recognition results. The finding that performance with adaptive directional processing was poorer than with omnidirectional microphones demonstrates the importance of selecting the correct microphone technology for different listening situations. The results also suggest the feasibility of using hearing aid technologies to provide a better listening experience for hearing aid users in automobiles. American Academy of Audiology.

  4. Speech Errors across the Lifespan

    ERIC Educational Resources Information Center

    Vousden, Janet I.; Maylor, Elizabeth A.

    2006-01-01

    Dell, Burger, and Svec (1997) proposed that the proportion of speech errors classified as anticipations (e.g., "moot and mouth") can be predicted solely from the overall error rate, such that the greater the error rate, the lower the anticipatory proportion (AP) of errors. We report a study examining whether this effect applies to changes in error…

  5. Not all sounds sound the same: Parkinson's disease affects differently emotion processing in music and in speech prosody.

    PubMed

    Lima, César F; Garrett, Carolina; Castro, São Luís

    2013-01-01

    Does emotion processing in music and speech prosody recruit common neurocognitive mechanisms? To examine this question, we implemented a cross-domain comparative design in Parkinson's disease (PD). Twenty-four patients and 25 controls performed emotion recognition tasks for music and spoken sentences. In music, patients had impaired recognition of happiness and peacefulness, and intact recognition of sadness and fear; this pattern was independent of general cognitive and perceptual abilities. In speech, patients had a small global impairment, which was significantly mediated by executive dysfunction. Hence, PD affected differently musical and prosodic emotions. This dissociation indicates that the mechanisms underlying the two domains are partly independent.

  6. Speech-on-speech masking with variable access to the linguistic content of the masker speech for native and nonnative english speakers.

    PubMed

    Calandruccio, Lauren; Bradlow, Ann R; Dhar, Sumitrajit

    2014-04-01

    Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared with native-accented English speech was reported in Calandruccio et al (2010a). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech Affiliationect masking release. A mixed-model design with within-subject (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech and high-intelligibility, moderate-intelligibility, and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Three listener groups were tested, including monolingual English speakers with normal hearing, nonnative English speakers with normal hearing, and monolingual English speakers with hearing loss. The nonnative English speakers were from various native language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetric mild sloping to moderate sensorineural hearing loss. Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the key words within the sentences (100 key words per masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and listener groups. Monolingual English speakers with normal hearing benefited when the competing speech signal was foreign accented compared with native accented, allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the nonnative English-speaking listeners with normal hearing nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Slight modifications between the target and the masker speech allowed monolingual English speakers with normal hearing to improve their recognition of native-accented English, even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. American Academy of Audiology.

  7. Application of speech recognition and synthesis in the general aviation cockpit

    NASA Technical Reports Server (NTRS)

    North, R. A.; Mountford, S. J.; Bergeron, H.

    1984-01-01

    Interactive speech recognition/synthesis technology is assessed as a method for the aleviation of single-pilot IFR flight workloads. Attention was given during this series of evaluations to the conditions typical of general aviation twin-engine aircrft cockpits, covering several commonly encountered IFR flight condition scenarios. The most beneficial speech command tasks are noted to be in the data retrieval domain, which would allow the pilot access to uplinked data, checklists, and performance charts. Data entry tasks also appear to benefit from this technology.

  8. Preschool Speech Error Patterns Predict Articulation and Phonological Awareness Outcomes in Children with Histories of Speech Sound Disorders

    ERIC Educational Resources Information Center

    Preston, Jonathan L.; Hull, Margaret; Edwards, Mary Louise

    2013-01-01

    Purpose: To determine if speech error patterns in preschoolers with speech sound disorders (SSDs) predict articulation and phonological awareness (PA) outcomes almost 4 years later. Method: Twenty-five children with histories of preschool SSDs (and normal receptive language) were tested at an average age of 4;6 (years;months) and were followed up…

  9. The Influence of Psycholinguistic Variables on Articulatory Errors in Naming in Progressive Motor Speech Degeneration

    ERIC Educational Resources Information Center

    Code, Chris; Tree, Jeremy; Ball, Martin

    2011-01-01

    We describe an analysis of speech errors on a confrontation naming task in a man with progressive speech degeneration of 10-year duration from Pick's disease. C.S. had a progressive non-fluent aphasia together with a motor speech impairment and early assessment indicated some naming impairments. There was also an absence of significant…

  10. Noise-immune multisensor transduction of speech

    NASA Astrophysics Data System (ADS)

    Viswanathan, Vishu R.; Henry, Claudia M.; Derr, Alan G.; Roucos, Salim; Schwartz, Richard M.

    1986-08-01

    Two types of configurations of multiple sensors were developed, tested and evaluated in speech recognition application for robust performance in high levels of acoustic background noise: One type combines the individual sensor signals to provide a single speech signal input, and the other provides several parallel inputs. For single-input systems, several configurations of multiple sensors were developed and tested. Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show that each of the two-sensor configurations tested outperforms the constituent individual sensors in high noise. Also presented are results comparing the performance of two-sensor configurations and individual sensors in speaker-dependent, isolated-word speech recognition tests performed using a commercial recognizer (Verbex 4000) in simulated fighter aircraft cockpit noise.

  11. Evaluation of speech recognizers for use in advanced combat helicopter crew station research and development

    NASA Technical Reports Server (NTRS)

    Simpson, Carol A.

    1990-01-01

    The U.S. Army Crew Station Research and Development Facility uses vintage 1984 speech recognizers. An evaluation was performed of newer off-the-shelf speech recognition devices to determine whether newer technology performance and capabilities are substantially better than that of the Army's current speech recognizers. The Phonetic Discrimination (PD-100) Test was used to compare recognizer performance in two ambient noise conditions: quiet office and helicopter noise. Test tokens were spoken by males and females and in isolated-word and connected-work mode. Better overall recognition accuracy was obtained from the newer recognizers. Recognizer capabilities needed to support the development of human factors design requirements for speech command systems in advanced combat helicopters are listed.

  12. "Who" is saying "what"? Brain-based decoding of human voice and speech.

    PubMed

    Formisano, Elia; De Martino, Federico; Bonte, Milene; Goebel, Rainer

    2008-11-07

    Can we decipher speech content ("what" is being said) and speaker identity ("who" is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.

  13. Effects of noise on speech recognition: Challenges for communication by service members.

    PubMed

    Le Prell, Colleen G; Clavier, Odile H

    2017-06-01

    Speech communication often takes place in noisy environments; this is an urgent issue for military personnel who must communicate in high-noise environments. The effects of noise on speech recognition vary significantly according to the sources of noise, the number and types of talkers, and the listener's hearing ability. In this review, speech communication is first described as it relates to current standards of hearing assessment for military and civilian populations. The next section categorizes types of noise (also called maskers) according to their temporal characteristics (steady or fluctuating) and perceptive effects (energetic or informational masking). Next, speech recognition difficulties experienced by listeners with hearing loss and by older listeners are summarized, and questions on the possible causes of speech-in-noise difficulty are discussed, including recent suggestions of "hidden hearing loss". The final section describes tests used by military and civilian researchers, audiologists, and hearing technicians to assess performance of an individual in recognizing speech in background noise, as well as metrics that predict performance based on a listener and background noise profile. This article provides readers with an overview of the challenges associated with speech communication in noisy backgrounds, as well as its assessment and potential impact on functional performance, and provides guidance for important new research directions relevant not only to military personnel, but also to employees who work in high noise environments. Copyright © 2016 Elsevier B.V. All rights reserved.

  14. Design of a robust baseband LPC coder for speech transmission over 9.6 kbit/s noisy channels

    NASA Astrophysics Data System (ADS)

    Viswanathan, V. R.; Russell, W. H.; Higgins, A. L.

    1982-04-01

    This paper describes the design of a baseband Linear Predictive Coder (LPC) which transmits speech over 9.6 kbit/sec synchronous channels with random bit errors of up to 1%. Presented are the results of our investigation of a number of aspects of the baseband LPC coder with the goal of maximizing the quality of the transmitted speech. Important among these aspects are: bandwidth of the baseband, coding of the baseband residual, high-frequency regeneration, and error protection of important transmission parameters. The paper discusses these and other issues, presents the results of speech-quality tests conducted during the various stages of optimization, and describes the details of the optimized speech coder. This optimized speech coding algorithm has been implemented as a real-time full-duplex system on an array processor. Informal listening tests of the real-time coder have shown that the coder produces good speech quality in the absence of channel bit errors and introduces only a slight degradation in quality for channel bit error rates of up to 1%.

  15. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation

    PubMed Central

    Banks, Briony; Gowen, Emma; Munro, Kevin J.; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker’s facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants’ eye gaze was recorded to verify that they looked at the speaker’s face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation. PMID:26283946

  16. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

    PubMed

    Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.

  17. Speech Recognition of Bimodal Cochlear Implant Recipients Using a Wireless Audio Streaming Accessory for the Telephone.

    PubMed

    Wolfe, Jace; Morais, Mila; Schafer, Erin

    2016-02-01

    The goals of the present investigation were (1) to evaluate recognition of recorded speech presented over a mobile telephone for a group of adult bimodal cochlear implant users, and (2) to measure the potential benefits of wireless hearing assistance technology (HAT) for mobile telephone speech recognition using bimodal stimulation (i.e., a cochlear implant in one ear and a hearing aid on the other ear). A three-by-two-way repeated measures design was used to evaluate mobile telephone sentence-recognition performance differences obtained in quiet and in noise with and without the wireless HAT accessory coupled to the hearing aid alone, CI sound processor alone, and in the bimodal condition. Outpatient cochlear implant clinic. Sixteen bimodal users with Nucleus 24, Freedom, CI512, or CI422 cochlear implants participated in this study. Performance was measured with and without the use of a wireless HAT for the telephone used with the hearing aid alone, CI alone, and bimodal condition. CNC word recognition in quiet and in noise with and without the use of a wireless HAT telephone accessory in the hearing aid alone, CI alone, and bimodal conditions. Results suggested that the bimodal condition gave significantly better speech recognition on the mobile telephone with the wireless HAT. A wireless HAT for the mobile telephone provides bimodal users with significant improvement in word recognition in quiet and in noise over the mobile telephone.

  18. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    NASA Astrophysics Data System (ADS)

    Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.

    2009-12-01

    This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  19. Automatic speech recognition in air traffic control

    NASA Technical Reports Server (NTRS)

    Karlsson, Joakim

    1990-01-01

    Automatic Speech Recognition (ASR) technology and its application to the Air Traffic Control system are described. The advantages of applying ASR to Air Traffic Control, as well as criteria for choosing a suitable ASR system are presented. Results from previous research and directions for future work at the Flight Transportation Laboratory are outlined.

  20. Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students

    ERIC Educational Resources Information Center

    Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice

    2012-01-01

    Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…

  1. Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation

    ERIC Educational Resources Information Center

    Kim, In-Seok

    2006-01-01

    This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…

  2. Speech Recognition Technology for Disabilities Education

    ERIC Educational Resources Information Center

    Tang, K. Wendy; Kamoua, Ridha; Sutan, Victor; Farooq, Omer; Eng, Gilbert; Chu, Wei Chern; Hou, Guofeng

    2005-01-01

    Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace or reduce the reliability on standard keyboard and mouse input. This can especially assist dyslexic students who have problems with character or word use and manipulation in a textual…

  3. Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation

    ERIC Educational Resources Information Center

    Elimat, Amal Khalil; AbuSeileek, Ali Farhan

    2014-01-01

    This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…

  4. An Investigation of the Compensatory Effectiveness of Assistive Technology on Postsecondary Students with Learning Disabilities. Final Report.

    ERIC Educational Resources Information Center

    Murphy, Harry; Higgins, Eleanor

    This final report describes the activities and accomplishments of a 3-year study on the compensatory effectiveness of three assistive technologies, optical character recognition, speech synthesis, and speech recognition, on postsecondary students (N=140) with learning disabilities. These technologies were investigated relative to: (1) immediate…

  5. Effects of Cognitive Load on Speech Recognition

    ERIC Educational Resources Information Center

    Mattys, Sven L.; Wiget, Lukas

    2011-01-01

    The effect of cognitive load (CL) on speech recognition has received little attention despite the prevalence of CL in everyday life, e.g., dual-tasking. To assess the effect of CL on the interaction between lexically-mediated and acoustically-mediated processes, we measured the magnitude of the "Ganong effect" (i.e., lexical bias on phoneme…

  6. Auditory Training with Multiple Talkers and Passage-Based Semantic Cohesion

    ERIC Educational Resources Information Center

    Casserly, Elizabeth D.; Barney, Erin C.

    2017-01-01

    Purpose: Current auditory training methods typically result in improvements to speech recognition abilities in quiet, but learner gains may not extend to other domains in speech (e.g., recognition in noise) or self-assessed benefit. This study examined the potential of training involving multiple talkers and training emphasizing discourse-level…

  7. Implicit Processing of Phonotactic Cues: Evidence from Electrophysiological and Vascular Responses

    ERIC Educational Resources Information Center

    Rossi, Sonja; Jurgenson, Ina B.; Hanulikova, Adriana; Telkemeyer, Silke; Wartenburger, Isabell; Obrig, Hellmuth

    2011-01-01

    Spoken word recognition is achieved via competition between activated lexical candidates that match the incoming speech input. The competition is modulated by prelexical cues that are important for segmenting the auditory speech stream into linguistic units. One such prelexical cue that listeners rely on in spoken word recognition is phonotactics.…

  8. Spoken Word Recognition of Chinese Words in Continuous Speech

    ERIC Educational Resources Information Center

    Yip, Michael C. W.

    2015-01-01

    The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…

  9. Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech

    ERIC Educational Resources Information Center

    Yip, Michael C.

    2016-01-01

    Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…

  10. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    ERIC Educational Resources Information Center

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  11. Motor Speech Disorders Associated with Primary Progressive Aphasia

    PubMed Central

    Duffy, Joseph R.; Strand, Edythe A.; Josephs, Keith A.

    2014-01-01

    Background Primary progressive aphasia (PPA) and conditions that overlap with it can be accompanied by motor speech disorders. Recognition and understanding of motor speech disorders can contribute to a fuller clinical understanding of PPA and its management as well as its localization and underlying pathology. Aims To review the types of motor speech disorders that may occur with PPA, its primary variants, and its overlap syndromes (progressive supranuclear palsy syndrome, corticobasal syndrome, motor neuron disease), as well as with primary progressive apraxia of speech. Main Contribution The review should assist clinicians' and researchers' understanding of the relationship between motor speech disorders and PPA and its major variants. It also highlights the importance of recognizing neurodegenerative apraxia of speech as a condition that can occur with little or no evidence of aphasia. Conclusion Motor speech disorders can occur with PPA. Their recognition can contribute to clinical diagnosis and management of PPA and to understanding and predicting the localization and pathology associated with PPA variants and conditions that can overlap with them. PMID:25309017

  12. Quadcopter Control Using Speech Recognition

    NASA Astrophysics Data System (ADS)

    Malik, H.; Darma, S.; Soekirno, S.

    2018-04-01

    This research reported a comparison from a success rate of speech recognition systems that used two types of databases they were existing databases and new databases, that were implemented into quadcopter as motion control. Speech recognition system was using Mel frequency cepstral coefficient method (MFCC) as feature extraction that was trained using recursive neural network method (RNN). MFCC method was one of the feature extraction methods that most used for speech recognition. This method has a success rate of 80% - 95%. Existing database was used to measure the success rate of RNN method. The new database was created using Indonesian language and then the success rate was compared with results from an existing database. Sound input from the microphone was processed on a DSP module with MFCC method to get the characteristic values. Then, the characteristic values were trained using the RNN which result was a command. The command became a control input to the single board computer (SBC) which result was the movement of the quadcopter. On SBC, we used robot operating system (ROS) as the kernel (Operating System).

  13. Analysis of error type and frequency in apraxia of speech among Portuguese speakers.

    PubMed

    Cera, Maysa Luchesi; Minett, Thaís Soares Cianciarullo; Ortiz, Karin Zazo

    2010-01-01

    Most studies characterizing errors in the speech of patients with apraxia involve English language. To analyze the types and frequency of errors produced by patients with apraxia of speech whose mother tongue was Brazilian Portuguese. 20 adults with apraxia of speech caused by stroke were assessed. The types of error committed by patients were analyzed both quantitatively and qualitatively, and frequencies compared. We observed the presence of substitution, omission, trial-and-error, repetition, self-correction, anticipation, addition, reiteration and metathesis, in descending order of frequency, respectively. Omission type errors were one of the most commonly occurring whereas addition errors were infrequent. These findings differed to those reported in English speaking patients, probably owing to differences in the methodologies used for classifying error types; the inclusion of speakers with apraxia secondary to aphasia; and the difference in the structure of Portuguese language to English in terms of syllable onset complexity and effect on motor control. The frequency of omission and addition errors observed differed to the frequency reported for speakers of English.

  14. Tone classification of syllable-segmented Thai speech based on multilayer perception

    NASA Astrophysics Data System (ADS)

    Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman

    2002-05-01

    Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.

  15. Multisensory speech perception in autism spectrum disorder: From phoneme to whole-word perception.

    PubMed

    Stevenson, Ryan A; Baum, Sarah H; Segers, Magali; Ferber, Susanne; Barense, Morgan D; Wallace, Mark T

    2017-07-01

    Speech perception in noisy environments is boosted when a listener can see the speaker's mouth and integrate the auditory and visual speech information. Autistic children have a diminished capacity to integrate sensory information across modalities, which contributes to core symptoms of autism, such as impairments in social communication. We investigated the abilities of autistic and typically-developing (TD) children to integrate auditory and visual speech stimuli in various signal-to-noise ratios (SNR). Measurements of both whole-word and phoneme recognition were recorded. At the level of whole-word recognition, autistic children exhibited reduced performance in both the auditory and audiovisual modalities. Importantly, autistic children showed reduced behavioral benefit from multisensory integration with whole-word recognition, specifically at low SNRs. At the level of phoneme recognition, autistic children exhibited reduced performance relative to their TD peers in auditory, visual, and audiovisual modalities. However, and in contrast to their performance at the level of whole-word recognition, both autistic and TD children showed benefits from multisensory integration for phoneme recognition. In accordance with the principle of inverse effectiveness, both groups exhibited greater benefit at low SNRs relative to high SNRs. Thus, while autistic children showed typical multisensory benefits during phoneme recognition, these benefits did not translate to typical multisensory benefit of whole-word recognition in noisy environments. We hypothesize that sensory impairments in autistic children raise the SNR threshold needed to extract meaningful information from a given sensory input, resulting in subsequent failure to exhibit behavioral benefits from additional sensory information at the level of whole-word recognition. Autism Res 2017. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. Autism Res 2017, 10: 1280-1290. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. © 2017 International Society for Autism Research, Wiley Periodicals, Inc.

  16. Acoustic diagnosis of pulmonary hypertension: automated speech- recognition-inspired classification algorithm outperforms physicians

    NASA Astrophysics Data System (ADS)

    Kaddoura, Tarek; Vadlamudi, Karunakar; Kumar, Shine; Bobhate, Prashant; Guo, Long; Jain, Shreepal; Elgendi, Mohamed; Coe, James Y.; Kim, Daniel; Taylor, Dylan; Tymchak, Wayne; Schuurmans, Dale; Zemp, Roger J.; Adatia, Ian

    2016-09-01

    We hypothesized that an automated speech- recognition-inspired classification algorithm could differentiate between the heart sounds in subjects with and without pulmonary hypertension (PH) and outperform physicians. Heart sounds, electrocardiograms, and mean pulmonary artery pressures (mPAp) were recorded simultaneously. Heart sound recordings were digitized to train and test speech-recognition-inspired classification algorithms. We used mel-frequency cepstral coefficients to extract features from the heart sounds. Gaussian-mixture models classified the features as PH (mPAp ≥ 25 mmHg) or normal (mPAp < 25 mmHg). Physicians blinded to patient data listened to the same heart sound recordings and attempted a diagnosis. We studied 164 subjects: 86 with mPAp ≥ 25 mmHg (mPAp 41 ± 12 mmHg) and 78 with mPAp < 25 mmHg (mPAp 17 ± 5 mmHg) (p  < 0.005). The correct diagnostic rate of the automated speech-recognition-inspired algorithm was 74% compared to 56% by physicians (p = 0.005). The false positive rate for the algorithm was 34% versus 50% (p = 0.04) for clinicians. The false negative rate for the algorithm was 23% and 68% (p = 0.0002) for physicians. We developed an automated speech-recognition-inspired classification algorithm for the acoustic diagnosis of PH that outperforms physicians that could be used to screen for PH and encourage earlier specialist referral.

  17. Lexical-Access Ability and Cognitive Predictors of Speech Recognition in Noise in Adult Cochlear Implant Users

    PubMed Central

    Smits, Cas; Merkus, Paul; Festen, Joost M.; Goverts, S. Theo

    2017-01-01

    Not all of the variance in speech-recognition performance of cochlear implant (CI) users can be explained by biographic and auditory factors. In normal-hearing listeners, linguistic and cognitive factors determine most of speech-in-noise performance. The current study explored specifically the influence of visually measured lexical-access ability compared with other cognitive factors on speech recognition of 24 postlingually deafened CI users. Speech-recognition performance was measured with monosyllables in quiet (consonant-vowel-consonant [CVC]), sentences-in-noise (SIN), and digit-triplets in noise (DIN). In addition to a composite variable of lexical-access ability (LA), measured with a lexical-decision test (LDT) and word-naming task, vocabulary size, working-memory capacity (Reading Span test [RSpan]), and a visual analogue of the SIN test (text reception threshold test) were measured. The DIN test was used to correct for auditory factors in SIN thresholds by taking the difference between SIN and DIN: SRTdiff. Correlation analyses revealed that duration of hearing loss (dHL) was related to SIN thresholds. Better working-memory capacity was related to SIN and SRTdiff scores. LDT reaction time was positively correlated with SRTdiff scores. No significant relationships were found for CVC or DIN scores with the predictor variables. Regression analyses showed that together with dHL, RSpan explained 55% of the variance in SIN thresholds. When controlling for auditory performance, LA, LDT, and RSpan separately explained, together with dHL, respectively 37%, 36%, and 46% of the variance in SRTdiff outcome. The results suggest that poor verbal working-memory capacity and to a lesser extent poor lexical-access ability limit speech-recognition ability in listeners with a CI. PMID:29205095

  18. A new time-adaptive discrete bionic wavelet transform for enhancing speech from adverse noise environment

    NASA Astrophysics Data System (ADS)

    Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui

    2012-04-01

    Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.

  19. Relating hearing loss and executive functions to hearing aid users' preference for, and speech recognition with, different combinations of binaural noise reduction and microphone directionality

    PubMed Central

    Neher, Tobias

    2014-01-01

    Knowledge of how executive functions relate to preferred hearing aid (HA) processing is sparse and seemingly inconsistent with related knowledge for speech recognition outcomes. This study thus aimed to find out if (1) performance on a measure of reading span (RS) is related to preferred binaural noise reduction (NR) strength, (2) similar relations exist for two different, non-verbal measures of executive function, (3) pure-tone average hearing loss (PTA), signal-to-noise ratio (SNR), and microphone directionality (DIR) also influence preferred NR strength, and (4) preference and speech recognition outcomes are similar. Sixty elderly HA users took part. Six HA conditions consisting of omnidirectional or cardioid microphones followed by inactive, moderate, or strong binaural NR as well as linear amplification were tested. Outcome was assessed at fixed SNRs using headphone simulations of a frontal target talker in a busy cafeteria. Analyses showed positive effects of active NR and DIR on preference, and negative and positive effects of, respectively, strong NR and DIR on speech recognition. Also, while moderate NR was the most preferred NR setting overall, preference for strong NR increased with SNR. No relation between RS and preference was found. However, larger PTA was related to weaker preference for inactive NR and stronger preference for strong NR for both microphone modes. Equivalent (but weaker) relations between worse performance on one non-verbal measure of executive function and the HA conditions without DIR were found. For speech recognition, there were relations between HA condition, PTA, and RS, but their pattern differed from that for preference. Altogether, these results indicate that, while moderate NR works well in general, a notable proportion of HA users prefer stronger NR. Furthermore, PTA and executive functions can account for some of the variability in preference for, and speech recognition with, different binaural NR and DIR settings. PMID:25538547

  20. Cleft audit protocol for speech (CAPS-A): a comprehensive training package for speech analysis.

    PubMed

    Sell, D; John, A; Harding-Bell, A; Sweeney, T; Hegarty, F; Freeman, J

    2009-01-01

    The previous literature has largely focused on speech analysis systems and ignored process issues, such as the nature of adequate speech samples, data acquisition, recording and playback. Although there has been recognition of the need for training on tools used in speech analysis associated with cleft palate, little attention has been paid to this issue. To design, execute, and evaluate a training programme for speech and language therapists on the systematic and reliable use of the Cleft Audit Protocol for Speech-Augmented (CAPS-A), addressing issues of standardized speech samples, data acquisition, recording, playback, and listening guidelines. Thirty-six specialist speech and language therapists undertook the training programme over four days. This consisted of two days' training on the CAPS-A tool followed by a third day, making independent ratings and transcriptions on ten new cases which had been previously recorded during routine audit data collection. This task was repeated on day 4, a minimum of one month later. Ratings were made using the CAPS-A record form with the CAPS-A definition table. An analysis was made of the speech and language therapists' CAPS-A ratings at occasion 1 and occasion 2 and the intra- and inter-rater reliability calculated. Trained therapists showed consistency in individual judgements on specific sections of the tool. Intraclass correlation coefficients were calculated for each section with good agreement on eight of 13 sections. There were only fair levels of agreement on anterior oral cleft speech characteristics, non-cleft errors/immaturities and voice. This was explained, at least in part, by their low prevalence which affects the calculation of the intraclass correlation coefficient statistic. Speech and language therapists benefited from training on the CAPS-A, focusing on specific aspects of speech using definitions of parameters and scalar points, in order to apply the tool systematically and reliably. Ratings are enhanced by ensuring a high degree of attention to the nature of the data, standardizing the speech sample, data acquisition, the listening process together with the use of high-quality recording and playback equipment. In addition, a method is proposed for maintaining listening skills following training as part of an individual's continuing education.

  1. Lessons Learned in Part-of-Speech Tagging of Conversational Speech

    DTIC Science & Technology

    2010-10-01

    for conversational speech recognition. In Plenary Meeting and Symposium on Prosody and Speech Processing. Slav Petrov and Dan Klein. 2007. Improved...inference for unlexicalized parsing. In HLT-NAACL. Slav Petrov. 2010. Products of random latent variable grammars. In HLT-NAACL. Brian Roark, Yang Liu

  2. Recognition and Localization of Speech by Adult Cochlear Implant Recipients Wearing a Digital Hearing Aid in the Nonimplanted Ear (Bimodal Hearing)

    PubMed Central

    Potts, Lisa G.; Skinner, Margaret W.; Litovsky, Ruth A.; Strube, Michael J; Kuk, Francis

    2010-01-01

    Background The use of bilateral amplification is now common clinical practice for hearing aid users but not for cochlear implant recipients. In the past, most cochlear implant recipients were implanted in one ear and wore only a monaural cochlear implant processor. There has been recent interest in benefits arising from bilateral stimulation that may be present for cochlear implant recipients. One option for bilateral stimulation is the use of a cochlear implant in one ear and a hearing aid in the opposite nonimplanted ear (bimodal hearing). Purpose This study evaluated the effect of wearing a cochlear implant in one ear and a digital hearing aid in the opposite ear on speech recognition and localization. Research Design A repeated-measures correlational study was completed. Study Sample Nineteen adult Cochlear Nucleus 24 implant recipients participated in the study. Intervention The participants were fit with a Widex Senso Vita 38 hearing aid to achieve maximum audibility and comfort within their dynamic range. Data Collection and Analysis Soundfield thresholds, loudness growth, speech recognition, localization, and subjective questionnaires were obtained six–eight weeks after the hearing aid fitting. Testing was completed in three conditions: hearing aid only, cochlear implant only, and cochlear implant and hearing aid (bimodal). All tests were repeated four weeks after the first test session. Repeated-measures analysis of variance was used to analyze the data. Significant effects were further examined using pairwise comparison of means or in the case of continuous moderators, regression analyses. The speech-recognition and localization tasks were unique, in that a speech stimulus presented from a variety of roaming azimuths (140 degree loudspeaker array) was used. Results Performance in the bimodal condition was significantly better for speech recognition and localization compared to the cochlear implant–only and hearing aid–only conditions. Performance was also different between these conditions when the location (i.e., side of the loudspeaker array that presented the word) was analyzed. In the bimodal condition, the speech-recognition and localization tasks were equal regardless of which side of the loudspeaker array presented the word, while performance was significantly poorer for the monaural conditions (hearing aid only and cochlear implant only) when the words were presented on the side with no stimulation. Binaural loudness summation of 1–3 dB was seen in soundfield thresholds and loudness growth in the bimodal condition. Measures of the audibility of sound with the hearing aid, including unaided thresholds, soundfield thresholds, and the Speech Intelligibility Index, were significant moderators of speech recognition and localization. Based on the questionnaire responses, participants showed a strong preference for bimodal stimulation. Conclusions These findings suggest that a well-fit digital hearing aid worn in conjunction with a cochlear implant is beneficial to speech recognition and localization. The dynamic test procedures used in this study illustrate the importance of bilateral hearing for locating, identifying, and switching attention between multiple speakers. It is recommended that unilateral cochlear implant recipients, with measurable unaided hearing thresholds, be fit with a hearing aid. PMID:19594084

  3. Acoustic evidence for phonologically mismatched speech errors.

    PubMed

    Gormley, Andrea

    2015-04-01

    Speech errors are generally said to accommodate to their new phonological context. This accommodation has been validated by several transcription studies. The transcription methodology is not the best choice for detecting errors at this level, however, as this type of error can be difficult to perceive. This paper presents an acoustic analysis of speech errors that uncovers non-accommodated or mismatch errors. A mismatch error is a sub-phonemic error that results in an incorrect surface phonology. This type of error could arise during the processing of phonological rules or they could be made at the motor level of implementation. The results of this work have important implications for both experimental and theoretical research. For experimentalists, it validates the tools used for error induction and the acoustic determination of errors free of the perceptual bias. For theorists, this methodology can be used to test the nature of the processes proposed in language production.

  4. Cleft Audit Protocol for Speech (CAPS-A): A Comprehensive Training Package for Speech Analysis

    ERIC Educational Resources Information Center

    Sell, D.; John, A.; Harding-Bell, A.; Sweeney, T.; Hegarty, F.; Freeman, J.

    2009-01-01

    Background: The previous literature has largely focused on speech analysis systems and ignored process issues, such as the nature of adequate speech samples, data acquisition, recording and playback. Although there has been recognition of the need for training on tools used in speech analysis associated with cleft palate, little attention has been…

  5. Staff Report to the Senior Department Official on Recognition Compliance Issues. Recommendation Page: American Speech-Language-Hearing Association

    ERIC Educational Resources Information Center

    US Department of Education, 2010

    2010-01-01

    The American Speech-Language-Hearing Association, Council on Academic Accreditation in Audiology and Speech-Language Pathology (CAA) is a national accrediting agency of graduate education programs in audiology or speech-language pathology. The CAA currently accredits or or preaccredits 319 programs (247 in speech-language pathology and 72 in…

  6. Prediction Errors but Not Sharpened Signals Simulate Multivoxel fMRI Patterns during Speech Perception

    PubMed Central

    Davis, Matthew H.

    2016-01-01

    Successful perception depends on combining sensory input with prior knowledge. However, the underlying mechanism by which these two sources of information are combined is unknown. In speech perception, as in other domains, two functionally distinct coding schemes have been proposed for how expectations influence representation of sensory evidence. Traditional models suggest that expected features of the speech input are enhanced or sharpened via interactive activation (Sharpened Signals). Conversely, Predictive Coding suggests that expected features are suppressed so that unexpected features of the speech input (Prediction Errors) are processed further. The present work is aimed at distinguishing between these two accounts of how prior knowledge influences speech perception. By combining behavioural, univariate, and multivariate fMRI measures of how sensory detail and prior expectations influence speech perception with computational modelling, we provide evidence in favour of Prediction Error computations. Increased sensory detail and informative expectations have additive behavioural and univariate neural effects because they both improve the accuracy of word report and reduce the BOLD signal in lateral temporal lobe regions. However, sensory detail and informative expectations have interacting effects on speech representations shown by multivariate fMRI in the posterior superior temporal sulcus. When prior knowledge was absent, increased sensory detail enhanced the amount of speech information measured in superior temporal multivoxel patterns, but with informative expectations, increased sensory detail reduced the amount of measured information. Computational simulations of Sharpened Signals and Prediction Errors during speech perception could both explain these behavioural and univariate fMRI observations. However, the multivariate fMRI observations were uniquely simulated by a Prediction Error and not a Sharpened Signal model. The interaction between prior expectation and sensory detail provides evidence for a Predictive Coding account of speech perception. Our work establishes methods that can be used to distinguish representations of Prediction Error and Sharpened Signals in other perceptual domains. PMID:27846209

  7. Working Memory and Speech Recognition in Noise Under Ecologically Relevant Listening Conditions: Effects of Visual Cues and Noise Type Among Adults With Hearing Loss

    PubMed Central

    Stewart, Erin K.; Wu, Yu-Hsiang; Bishop, Christopher; Bentler, Ruth A.; Tremblay, Kelly

    2017-01-01

    Purpose This study evaluated the relationship between working memory (WM) and speech recognition in noise with different noise types as well as in the presence of visual cues. Method Seventy-six adults with bilateral, mild to moderately severe sensorineural hearing loss (mean age: 69 years) participated. Using a cross-sectional design, 2 measures of WM were taken: a reading span measure, and Word Auditory Recognition and Recall Measure (Smith, Pichora-Fuller, & Alexander, 2016). Speech recognition was measured with the Multi-Modal Lexical Sentence Test for Adults (Kirk et al., 2012) in steady-state noise and 4-talker babble, with and without visual cues. Testing was under unaided conditions. Results A linear mixed model revealed visual cues and pure-tone average as the only significant predictors of Multi-Modal Lexical Sentence Test outcomes. Neither WM measure nor noise type showed a significant effect. Conclusion The contribution of WM in explaining unaided speech recognition in noise was negligible and not influenced by noise type or visual cues. We anticipate that with audibility partially restored by hearing aids, the effects of WM will increase. For clinical practice to be affected, more significant effect sizes are needed. PMID:28744550

  8. Influence of auditory attention on sentence recognition captured by the neural phase.

    PubMed

    Müller, Jana Annina; Kollmeier, Birger; Debener, Stefan; Brand, Thomas

    2018-03-07

    The aim of this study was to investigate whether attentional influences on speech recognition are reflected in the neural phase entrained by an external modulator. Sentences were presented in 7 Hz sinusoidally modulated noise while the neural response to that modulation frequency was monitored by electroencephalogram (EEG) recordings in 21 participants. We implemented a selective attention paradigm including three different attention conditions while keeping physical stimulus parameters constant. The participants' task was either to repeat the sentence as accurately as possible (speech recognition task), to count the number of decrements implemented in modulated noise (decrement detection task), or to do both (dual task), while the EEG was recorded. Behavioural analysis revealed reduced performance in the dual task condition for decrement detection, possibly reflecting limited cognitive resources. EEG analysis revealed no significant differences in power for the 7 Hz modulation frequency, but an attention-dependent phase difference between tasks. Further phase analysis revealed a significant difference 500 ms after sentence onset between trials with correct and incorrect responses for speech recognition, indicating that speech recognition performance and the neural phase are linked via selective attention mechanisms, at least shortly after sentence onset. However, the neural phase effects identified were small and await further investigation. © 2018 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.

  9. Hybrid simulated annealing and its application to optimization of hidden Markov models for visual speech recognition.

    PubMed

    Lee, Jong-Seok; Park, Cheol Hoon

    2010-08-01

    We propose a novel stochastic optimization algorithm, hybrid simulated annealing (SA), to train hidden Markov models (HMMs) for visual speech recognition. In our algorithm, SA is combined with a local optimization operator that substitutes a better solution for the current one to improve the convergence speed and the quality of solutions. We mathematically prove that the sequence of the objective values converges in probability to the global optimum in the algorithm. The algorithm is applied to train HMMs that are used as visual speech recognizers. While the popular training method of HMMs, the expectation-maximization algorithm, achieves only local optima in the parameter space, the proposed method can perform global optimization of the parameters of HMMs and thereby obtain solutions yielding improved recognition performance. The superiority of the proposed algorithm to the conventional ones is demonstrated via isolated word recognition experiments.

  10. Inducing Speech Errors in Dysarthria Using Tongue Twisters

    ERIC Educational Resources Information Center

    Kember, Heather; Connaghan, Kathryn; Patel, Rupal

    2017-01-01

    Although tongue twisters have been widely use to study speech production in healthy speakers, few studies have employed this methodology for individuals with speech impairment. The present study compared tongue twister errors produced by adults with dysarthria and age-matched healthy controls. Eight speakers (four female, four male; mean age =…

  11. Acoustic Evidence for Phonologically Mismatched Speech Errors

    ERIC Educational Resources Information Center

    Gormley, Andrea

    2015-01-01

    Speech errors are generally said to accommodate to their new phonological context. This accommodation has been validated by several transcription studies. The transcription methodology is not the best choice for detecting errors at this level, however, as this type of error can be difficult to perceive. This paper presents an acoustic analysis of…

  12. Speech-Enabled Interfaces for Travel Information Systems with Large Grammars

    NASA Astrophysics Data System (ADS)

    Zhao, Baoli; Allen, Tony; Bargiela, Andrzej

    This paper introduces three grammar-segmentation methods capable of handling the large grammar issues associated with producing a real-time speech-enabled VXML bus travel application for London. Large grammars tend to produce relatively slow recognition interfaces and this work shows how this limitation can be successfully addressed. Comparative experimental results show that the novel last-word recognition based grammar segmentation method described here achieves an optimal balance between recognition rate, speed of processing and naturalness of interaction.

  13. Memristive Computational Architecture of an Echo State Network for Real-Time Speech Emotion Recognition

    DTIC Science & Technology

    2015-05-28

    recognition is simpler and requires less computational resources compared to other inputs such as facial expressions . The Berlin database of Emotional ...Processing Magazine, IEEE, vol. 18, no. 1, pp. 32– 80, 2001. [15] K. R. Scherer, T. Johnstone, and G. Klasmeyer, “Vocal expression of emotion ...Network for Real-Time Speech- Emotion Recognition 5a. CONTRACT NUMBER IN-HOUSE 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 62788F 6. AUTHOR(S) Q

  14. Inferring Speaker Affect in Spoken Natural Language Communication

    ERIC Educational Resources Information Center

    Pon-Barry, Heather Roberta

    2013-01-01

    The field of spoken language processing is concerned with creating computer programs that can understand human speech and produce human-like speech. Regarding the problem of understanding human speech, there is currently growing interest in moving beyond speech recognition (the task of transcribing the words in an audio stream) and towards…

  15. Automated Assessment of Speech Fluency for L2 English Learners

    ERIC Educational Resources Information Center

    Yoon, Su-Youn

    2009-01-01

    This dissertation provides an automated scoring method of speech fluency for second language learners of English (L2 learners) based that uses speech recognition technology. Non-standard pronunciation, frequent disfluencies, faulty grammar, and inappropriate lexical choices are crucial characteristics of L2 learners' speech. Due to the ease of…

  16. Use of Computer Speech Technologies To Enhance Learning.

    ERIC Educational Resources Information Center

    Ferrell, Joe

    1999-01-01

    Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)

  17. Enhancing Speech Intelligibility: Interactions among Context, Modality, Speech Style, and Masker

    ERIC Educational Resources Information Center

    Van Engen, Kristin J.; Phelps, Jasmine E. B.; Smiljanic, Rajka; Chandrasekaran, Bharath

    2014-01-01

    Purpose: The authors sought to investigate interactions among intelligibility-enhancing speech cues (i.e., semantic context, clearly produced speech, and visual information) across a range of masking conditions. Method: Sentence recognition in noise was assessed for 29 normal-hearing listeners. Testing included semantically normal and anomalous…

  18. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    ERIC Educational Resources Information Center

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  19. The Effect of Automatic Speech Recognition Eyespeak Software on Iraqi Students' English Pronunciation: A Pilot Study

    ERIC Educational Resources Information Center

    Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani

    2017-01-01

    The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…

  20. Machine Learning Through Signature Trees. Applications to Human Speech.

    ERIC Educational Resources Information Center

    White, George M.

    A signature tree is a binary decision tree used to classify unknown patterns. An attempt was made to develop a computer program for manipulating signature trees as a general research tool for exploring machine learning and pattern recognition. The program was applied to the problem of speech recognition to test its effectiveness for a specific…

  1. Apps, iPads, and Literacy: Examining the Feasibility of Speech Recognition in a First-Grade Classroom

    ERIC Educational Resources Information Center

    Baker, Elizabeth A.

    2017-01-01

    Informed by sociocultural and systems theory tenets, this study used ethnographic research methods to examine the feasibility of using speech recognition (SR) technology to support struggling readers in an early elementary classroom setting. Observations of eight first graders were conducted as they participated in a structured SR-supported…

  2. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    ERIC Educational Resources Information Center

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  3. [The role of temporal fine structure in tone recognition and music perception].

    PubMed

    Zhou, Q; Gu, X; Liu, B

    2017-11-07

    The sound signal can be decomposed into temporal envelope and temporal fine structure information. The temporal envelope information is crucial for speech perception in quiet environment, and the temporal fine structure information plays an important role in speech perception in noise, Mandarin tone recognition and music perception, especially the pitch and melody perception.

  4. EduSpeak[R]: A Speech Recognition and Pronunciation Scoring Toolkit for Computer-Aided Language Learning Applications

    ERIC Educational Resources Information Center

    Franco, Horacio; Bratt, Harry; Rossier, Romain; Rao Gadde, Venkata; Shriberg, Elizabeth; Abrash, Victor; Precoda, Kristin

    2010-01-01

    SRI International's EduSpeak[R] system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to…

  5. Investigating an Innovative Computer Application to Improve L2 Word Recognition from Speech

    ERIC Educational Resources Information Center

    Matthews, Joshua; O'Toole, John Mitchell

    2015-01-01

    The ability to recognise words from the aural modality is a critical aspect of successful second language (L2) listening comprehension. However, little research has been reported on computer-mediated development of L2 word recognition from speech in L2 learning contexts. This report describes the development of an innovative computer application…

  6. Using Automatic Speech Recognition Technology with Elicited Oral Response Testing

    ERIC Educational Resources Information Center

    Cox, Troy L.; Davies, Randall S.

    2012-01-01

    This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…

  7. Performance Evaluation of Speech Recognition Systems as a Next-Generation Pilot-Vehicle Interface Technology

    NASA Technical Reports Server (NTRS)

    Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.

    2016-01-01

    During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.

  8. The effect of hearing aid technologies on listening in an automobile

    PubMed Central

    Wu, Yu-Hsiang; Stangl, Elizabeth; Bentler, Ruth A.; Stanziola, Rachel W.

    2014-01-01

    Background Communication while traveling in an automobile often is very difficult for hearing aid users. This is because the automobile /road noise level is usually high, and listeners/drivers often do not have access to visual cues. Since the talker of interest usually is not located in front of the driver/listener, conventional directional processing that places the directivity beam toward the listener’s front may not be helpful, and in fact, could have a negative impact on speech recognition (when compared to omnidirectional processing). Recently, technologies have become available in commercial hearing aids that are designed to improve speech recognition and/or listening effort in noisy conditions where talkers are located behind or beside the listener. These technologies include (1) a directional microphone system that uses a backward-facing directivity pattern (Back-DIR processing), (2) a technology that transmits audio signals from the ear with the better signal-to-noise ratio (SNR) to the ear with the poorer SNR (Side-Transmission processing), and (3) a signal processing scheme that suppresses the noise at the ear with the poorer SNR (Side-Suppression processing). Purpose The purpose of the current study was to determine the effect of (1) conventional directional microphones and (2) newer signal processing schemes (Back-DIR, Side-Transmission, and Side-Suppression) on listener’s speech recognition performance and preference for communication in a traveling automobile. Research design A single-blinded, repeated-measures design was used. Study Sample Twenty-five adults with bilateral symmetrical sensorineural hearing loss aged 44 through 84 years participated in the study. Data Collection and Analysis The automobile/road noise and sentences of the Connected Speech Test (CST) were recorded through hearing aids in a standard van moving at a speed of 70 miles/hour on a paved highway. The hearing aids were programmed to omnidirectional microphone, conventional adaptive directional microphone, and the three newer schemes. CST sentences were presented from the side and back of the hearing aids, which were placed on the ears of a manikin. The recorded stimuli were presented to listeners via earphones in a sound treated booth to assess speech recognition performance and preference with each programmed condition. Results Compared to omnidirectional microphones, conventional adaptive directional processing had a detrimental effect on speech recognition when speech was presented from the back or side of the listener. Back-DIR and Side-Transmission processing improved speech recognition performance (relative to both omnidirectional and adaptive directional processing) when speech was from the back and side, respectively. The performance with Side-Suppression processing was better than with adaptive directional processing when speech was from the side. The participants’ preferences for a given processing scheme were generally consistent with speech recognition results. Conclusions The finding that performance with adaptive directional processing was poorer than with omnidirectional microphones demonstrates the importance of selecting the correct microphone technology for different listening situations. The results also suggest the feasibility of using hearing aid technologies to provide a better listening experience for hearing aid users in automobiles. PMID:23886425

  9. Structure and Processing in Tunisian Arabic: Speech Error Data

    ERIC Educational Resources Information Center

    Hamrouni, Nadia

    2010-01-01

    This dissertation presents experimental research on speech errors in Tunisian Arabic. The nonconcatenative morphology of Arabic shows interesting interactions of phrasal and lexical constraints with morphological structure during language production. The central empirical questions revolve around properties of "exchange errors". These…

  10. "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing.

    PubMed

    Xiao, Bo; Imel, Zac E; Georgiou, Panayiotis G; Atkins, David C; Narayanan, Shrikanth S

    2015-01-01

    The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.

  11. Syntax-directed content analysis of videotext: application to a map detection recognition system

    NASA Astrophysics Data System (ADS)

    Aradhye, Hrishikesh; Herson, James A.; Myers, Gregory

    2003-01-01

    Video is an increasingly important and ever-growing source of information to the intelligence and homeland defense analyst. A capability to automatically identify the contents of video imagery would enable the analyst to index relevant foreign and domestic news videos in a convenient and meaningful way. To this end, the proposed system aims to help determine the geographic focus of a news story directly from video imagery by detecting and geographically localizing political maps from news broadcasts, using the results of videotext recognition in lieu of a computationally expensive, scale-independent shape recognizer. Our novel method for the geographic localization of a map is based on the premise that the relative placement of text superimposed on a map roughly corresponds to the geographic coordinates of the locations the text represents. Our scheme extracts and recognizes videotext, and iteratively identifies the geographic area, while allowing for OCR errors and artistic freedom. The fast and reliable recognition of such maps by our system may provide valuable context and supporting evidence for other sources, such as speech recognition transcripts. The concepts of syntax-directed content analysis of videotext presented here can be extended to other content analysis systems.

  12. Relationship between consonant recognition in noise and hearing threshold.

    PubMed

    Yoon, Yang-soo; Allen, Jont B; Gooler, David M

    2012-04-01

    Although poorer understanding of speech in noise by listeners who are hearing-impaired (HI) is known not to be directly related to audiometric hearing threshold, HT (f), grouping HI listeners with HT (f) is widely practiced. In this article, the relationship between consonant recognition and HT (f) is considered over a range of signal-to-noise ratios (SNRs). Confusion matrices (CMs) from 25 HI ears were generated in response to 16 consonant-vowel syllables presented at 6 different SNRs. Individual differences scaling (INDSCAL) was applied to both feature-based matrices and CMs in order to evaluate the relationship between HT (f) and consonant recognition among HI listeners. The results showed no predictive relationship between the percent error scores (Pe) and HT (f) across SNRs. The multiple regression models showed that the HT (f) accounted for 39% of the total variance of the slopes of the Pe. Feature-based INDSCAL analysis showed consistent grouping of listeners across SNRs, but not in terms of HT (f). Systematic relationship between measures was also not defined by CM-based INDSCAL analysis across SNRs. HT (f) did not account for the majority of the variance (39%) in consonant recognition in noise when the complete body of the CM was considered.

  13. Automatic speech recognition using a predictive echo state network classifier.

    PubMed

    Skowronski, Mark D; Harris, John G

    2007-04-01

    We have combined an echo state network (ESN) with a competitive state machine framework to create a classification engine called the predictive ESN classifier. We derive the expressions for training the predictive ESN classifier and show that the model was significantly more noise robust compared to a hidden Markov model in noisy speech classification experiments by 8+/-1 dB signal-to-noise ratio. The simple training algorithm and noise robustness of the predictive ESN classifier make it an attractive classification engine for automatic speech recognition.

  14. A Mis-recognized Medical Vocabulary Correction System for Speech-based Electronic Medical Record

    PubMed Central

    Seo, Hwa Jeong; Kim, Ju Han; Sakabe, Nagamasa

    2002-01-01

    Speech recognition as an input tool for electronic medical record (EMR) enables efficient data entry at the point of care. However, the recognition accuracy for medical vocabulary is much poorer than that for doctor-patient dialogue. We developed a mis-recognized medical vocabulary correction system based on syllable-by-syllable comparison of speech text against medical vocabulary database. Using specialty medical vocabulary, the algorithm detects and corrects mis-recognized medical vocabularies in narrative text. Our preliminary evaluation showed 94% of accuracy in mis-recognized medical vocabulary correction.

  15. Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition

    NASA Astrophysics Data System (ADS)

    Malcangi, Mario

    2009-08-01

    Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.

  16. Speech outcomes in Cantonese patients after glossectomy.

    PubMed

    Wong, Ripley Kit; Poon, Esther Sok-Man; Woo, Cynthia Yuen-Man; Chan, Sabina Ching-Shun; Wong, Elsa Siu-Ping; Chu, Ada Wai-Sze

    2007-08-01

    We sought to determine the major factors affecting speech production of Cantonese-speaking glossectomized patients. Error pattern was analyzed. Forty-one Cantonese-speaking subjects who had undergone glossectomy > or = 6 months previously were recruited. Speech production evaluation included (1) phonetic error analysis in nonsense syllable; (2) speech intelligibility in sentences evaluated by naive listeners; (3) overall speech intelligibility in conversation evaluated by experienced speech therapists. Patients receiving adjuvant radiotherapy had significantly poorer segmental and connected speech production. Total or subtotal glossectomy also resulted in poor speech outcomes. Patients having free flap reconstruction showed the best speech outcomes. Patients without lymph node metastasis had significantly better speech scores when compared with patients with lymph node metastasis. Initial consonant production had the worst scores, while vowel production was the least affected. Speech outcomes of Cantonese-speaking glossectomized patients depended on the severity of the disease. Initial consonants had the greatest effect on speech intelligibility.

  17. Technological evaluation of gesture and speech interfaces for enabling dismounted soldier-robot dialogue

    NASA Astrophysics Data System (ADS)

    Kattoju, Ravi Kiran; Barber, Daniel J.; Abich, Julian; Harris, Jonathan

    2016-05-01

    With increasing necessity for intuitive Soldier-robot communication in military operations and advancements in interactive technologies, autonomous robots have transitioned from assistance tools to functional and operational teammates able to service an array of military operations. Despite improvements in gesture and speech recognition technologies, their effectiveness in supporting Soldier-robot communication is still uncertain. The purpose of the present study was to evaluate the performance of gesture and speech interface technologies to facilitate Soldier-robot communication during a spatial-navigation task with an autonomous robot. Gesture and speech semantically based spatial-navigation commands leveraged existing lexicons for visual and verbal communication from the U.S Army field manual for visual signaling and a previously established Squad Level Vocabulary (SLV). Speech commands were recorded by a Lapel microphone and Microsoft Kinect, and classified by commercial off-the-shelf automatic speech recognition (ASR) software. Visual signals were captured and classified using a custom wireless gesture glove and software. Participants in the experiment commanded a robot to complete a simulated ISR mission in a scaled down urban scenario by delivering a sequence of gesture and speech commands, both individually and simultaneously, to the robot. Performance and reliability of gesture and speech hardware interfaces and recognition tools were analyzed and reported. Analysis of experimental results demonstrated the employed gesture technology has significant potential for enabling bidirectional Soldier-robot team dialogue based on the high classification accuracy and minimal training required to perform gesture commands.

  18. Accuracy of cochlear implant recipients in speech reception in the presence of background music.

    PubMed

    Gfeller, Kate; Turner, Christopher; Oleson, Jacob; Kliethermes, Stephanie; Driscoll, Virginia

    2012-12-01

    This study examined speech recognition abilities of cochlear implant (CI) recipients in the spectrally complex listening condition of 3 contrasting types of background music, and compared performance based upon listener groups: CI recipients using conventional long-electrode devices, Hybrid CI recipients (acoustic plus electric stimulation), and normal-hearing adults. We tested 154 long-electrode CI recipients using varied devices and strategies, 21 Hybrid CI recipients, and 49 normal-hearing adults on closed-set recognition of spondees presented in 3 contrasting forms of background music (piano solo, large symphony orchestra, vocal solo with small combo accompaniment) in an adaptive test. Signal-to-noise ratio thresholds for speech in music were examined in relation to measures of speech recognition in background noise and multitalker babble, pitch perception, and music experience. The signal-to-noise ratio thresholds for speech in music varied as a function of category of background music, group membership (long-electrode, Hybrid, normal-hearing), and age. The thresholds for speech in background music were significantly correlated with measures of pitch perception and thresholds for speech in background noise; auditory status was an important predictor. Evidence suggests that speech reception thresholds in background music change as a function of listener age (with more advanced age being detrimental), structural characteristics of different types of music, and hearing status (residual hearing). These findings have implications for everyday listening conditions such as communicating in social or commercial situations in which there is background music.

  19. Hidden Markov models in automatic speech recognition

    NASA Astrophysics Data System (ADS)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  20. Speech-on-speech masking with variable access to the linguistic content of the masker speech for native and non-native speakers of English

    PubMed Central

    Calandruccio, Lauren; Bradlow, Ann R.; Dhar, Sumitrajit

    2013-01-01

    Background Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared to native-accented English speech was reported in Calandruccio, Dhar and Bradlow (2010). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. Purpose The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech affect masking release. Research Design A mixed model design with within- (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech, and high-intelligibility, moderate-intelligibility and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Study Sample Three listener groups were tested including monolingual English speakers with normal hearing, non-native speakers of English with normal hearing, and monolingual speakers of English with hearing loss. The non-native speakers of English were from various native-language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetrical, mild sloping to moderate sensorineural hearing loss. Data Collection and Analysis Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the keywords within the sentences (100 keywords/masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and the listener groups. Results Monolingual speakers of English with normal hearing benefited when the competing speech signal was foreign-accented compared to native-accented allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the non-native English listeners with normal hearing, nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Conclusions Slight modifications between the target and the masker speech allowed monolingual speakers of English with normal hearing to improve their recognition of native-accented English even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker, or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. PMID:25126683

  1. The role of consolidation in learning context-dependent phonotactic patterns in speech and digital sequence production.

    PubMed

    Anderson, Nathaniel D; Dell, Gary S

    2018-04-03

    Speakers implicitly learn novel phonotactic patterns by producing strings of syllables. The learning is revealed in their speech errors. First-order patterns, such as "/f/ must be a syllable onset," can be distinguished from contingent, or second-order, patterns, such as "/f/ must be an onset if the vowel is /a/, but a coda if the vowel is /o/." A metaanalysis of 19 experiments clearly demonstrated that first-order patterns affect speech errors to a very great extent in a single experimental session, but second-order vowel-contingent patterns only affect errors on the second day of testing, suggesting the need for a consolidation period. Two experiments tested an analogue to these studies involving sequences of button pushes, with fingers as "consonants" and thumbs as "vowels." The button-push errors revealed two of the key speech-error findings: first-order patterns are learned quickly, but second-order thumb-contingent patterns are only strongly revealed in the errors on the second day of testing. The influence of computational complexity on the implicit learning of phonotactic patterns in speech production may be a general feature of sequence production.

  2. Fifty years of progress in acoustic phonetics

    NASA Astrophysics Data System (ADS)

    Stevens, Kenneth N.

    2004-10-01

    Three events that occurred 50 or 60 years ago shaped the study of acoustic phonetics, and in the following few decades these events influenced research and applications in speech disorders, speech development, speech synthesis, speech recognition, and other subareas in speech communication. These events were: (1) the source-filter theory of speech production (Chiba and Kajiyama; Fant); (2) the development of the sound spectrograph and its interpretation (Potter, Kopp, and Green; Joos); and (3) the birth of research that related distinctive features to acoustic patterns (Jakobson, Fant, and Halle). Following these events there has been systematic exploration of the articulatory, acoustic, and perceptual bases of phonological categories, and some quantification of the sources of variability in the transformation of this phonological representation of speech into its acoustic manifestations. This effort has been enhanced by studies of how children acquire language in spite of this variability and by research on speech disorders. Gaps in our knowledge of this inherent variability in speech have limited the directions of applications such as synthesis and recognition of speech, and have led to the implementation of data-driven techniques rather than theoretical principles. Some examples of advances in our knowledge, and limitations of this knowledge, are reviewed.

  3. Restoring the missing features of the corrupted speech using linear interpolation methods

    NASA Astrophysics Data System (ADS)

    Rassem, Taha H.; Makbol, Nasrin M.; Hasan, Ali Muttaleb; Zaki, Siti Syazni Mohd; Girija, P. N.

    2017-10-01

    One of the main challenges in the Automatic Speech Recognition (ASR) is the noise. The performance of the ASR system reduces significantly if the speech is corrupted by noise. In spectrogram representation of a speech signal, after deleting low Signal to Noise Ratio (SNR) elements, the incomplete spectrogram is obtained. In this case, the speech recognizer should make modifications to the spectrogram in order to restore the missing elements, which is one direction. In another direction, speech recognizer should be able to restore the missing elements due to deleting low SNR elements before performing the recognition. This is can be done using different spectrogram reconstruction methods. In this paper, the geometrical spectrogram reconstruction methods suggested by some researchers are implemented as a toolbox. In these geometrical reconstruction methods, the linear interpolation along time or frequency methods are used to predict the missing elements between adjacent observed elements in the spectrogram. Moreover, a new linear interpolation method using time and frequency together is presented. The CMU Sphinx III software is used in the experiments to test the performance of the linear interpolation reconstruction method. The experiments are done under different conditions such as different lengths of the window and different lengths of utterances. Speech corpus consists of 20 males and 20 females; each one has two different utterances are used in the experiments. As a result, 80% recognition accuracy is achieved with 25% SNR ratio.

  4. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    NASA Technical Reports Server (NTRS)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  5. Is comprehension necessary for error detection? A conflict-based account of monitoring in speech production

    PubMed Central

    Nozari, Nazbanou; Dell, Gary S.; Schwartz, Myrna F.

    2011-01-01

    Despite the existence of speech errors, verbal communication is successful because speakers can detect (and correct) their errors. The standard theory of speech-error detection, the perceptual-loop account, posits that the comprehension system monitors production output for errors. Such a comprehension-based monitor, however, cannot explain the double dissociation between comprehension and error-detection ability observed in the aphasic patients. We propose a new theory of speech-error detection which is instead based on the production process itself. The theory borrows from studies of forced-choice-response tasks the notion that error detection is accomplished by monitoring response conflict via a frontal brain structure, such as the anterior cingulate cortex. We adapt this idea to the two-step model of word production, and test the model-derived predictions on a sample of aphasic patients. Our results show a strong correlation between patients’ error-detection ability and the model’s characterization of their production skills, and no significant correlation between error detection and comprehension measures, thus supporting a production-based monitor, generally, and the implemented conflict-based monitor in particular. The successful application of the conflict-based theory to error-detection in linguistic, as well as non-linguistic domains points to a domain-general monitoring system. PMID:21652015

  6. Evaluation of Mandarin Chinese Speech Recognition in Adults with Cochlear Implants Using the Spectral Ripple Discrimination Test

    PubMed Central

    Dai, Chuanfu; Zhao, Zeqi; Zhang, Duo; Lei, Guanxiong

    2018-01-01

    Background The aim of this study was to explore the value of the spectral ripple discrimination test in speech recognition evaluation among a deaf (post-lingual) Mandarin-speaking population in China following cochlear implantation. Material/Methods The study included 23 Mandarin-speaking adult subjects with normal hearing (normal-hearing group) and 17 deaf adults who were former Mandarin-speakers, with cochlear implants (cochlear implantation group). The normal-hearing subjects were divided into men (n=10) and women (n=13). The spectral ripple discrimination thresholds between the groups were compared. The correlation between spectral ripple discrimination thresholds and Mandarin speech recognition rates in the cochlear implantation group were studied. Results Spectral ripple discrimination thresholds did not correlate with age (r=−0.19; p=0.22), and there was no significant difference in spectral ripple discrimination thresholds between the male and female groups (p=0.654). Spectral ripple discrimination thresholds of deaf adults with cochlear implants were significantly correlated with monosyllabic recognition rates (r=0.84; p=0.000). Conclusions In a Mandarin Chinese speaking population, spectral ripple discrimination thresholds of normal-hearing individuals were unaffected by both gender and age. Spectral ripple discrimination thresholds were correlated with Mandarin monosyllabic recognition rates of Mandarin-speaking in post-lingual deaf adults with cochlear implants. The spectral ripple discrimination test is a promising method for speech recognition evaluation in adults following cochlear implantation in China. PMID:29806954

  7. Evaluation of Mandarin Chinese Speech Recognition in Adults with Cochlear Implants Using the Spectral Ripple Discrimination Test.

    PubMed

    Dai, Chuanfu; Zhao, Zeqi; Shen, Weidong; Zhang, Duo; Lei, Guanxiong; Qiao, Yuehua; Yang, Shiming

    2018-05-28

    BACKGROUND The aim of this study was to explore the value of the spectral ripple discrimination test in speech recognition evaluation among a deaf (post-lingual) Mandarin-speaking population in China following cochlear implantation. MATERIAL AND METHODS The study included 23 Mandarin-speaking adult subjects with normal hearing (normal-hearing group) and 17 deaf adults who were former Mandarin-speakers, with cochlear implants (cochlear implantation group). The normal-hearing subjects were divided into men (n=10) and women (n=13). The spectral ripple discrimination thresholds between the groups were compared. The correlation between spectral ripple discrimination thresholds and Mandarin speech recognition rates in the cochlear implantation group were studied. RESULTS Spectral ripple discrimination thresholds did not correlate with age (r=-0.19; p=0.22), and there was no significant difference in spectral ripple discrimination thresholds between the male and female groups (p=0.654). Spectral ripple discrimination thresholds of deaf adults with cochlear implants were significantly correlated with monosyllabic recognition rates (r=0.84; p=0.000). CONCLUSIONS In a Mandarin Chinese speaking population, spectral ripple discrimination thresholds of normal-hearing individuals were unaffected by both gender and age. Spectral ripple discrimination thresholds were correlated with Mandarin monosyllabic recognition rates of Mandarin-speaking in post-lingual deaf adults with cochlear implants. The spectral ripple discrimination test is a promising method for speech recognition evaluation in adults following cochlear implantation in China.

  8. An innovative multimodal virtual platform for communication with devices in a natural way

    NASA Astrophysics Data System (ADS)

    Kinkar, Chhayarani R.; Golash, Richa; Upadhyay, Akhilesh R.

    2012-03-01

    As technology grows people are diverted and are more interested in communicating with machine or computer naturally. This will make machine more compact and portable by avoiding remote, keyboard etc. also it will help them to live in an environment free from electromagnetic waves. This thought has made 'recognition of natural modality in human computer interaction' a most appealing and promising research field. Simultaneously it has been observed that using single mode of interaction limit the complete utilization of commands as well as data flow. In this paper a multimodal platform, where out of many natural modalities like eye gaze, speech, voice, face etc. human gestures are combined with human voice is proposed which will minimize the mean square error. This will loosen the strict environment needed for accurate and robust interaction while using single mode. Gesture complement Speech, gestures are ideal for direct object manipulation and natural language is used for descriptive tasks. Human computer interaction basically requires two broad sections recognition and interpretation. Recognition and interpretation of natural modality in complex binary instruction is a tough task as it integrate real world to virtual environment. The main idea of the paper is to develop a efficient model for data fusion coming from heterogeneous sensors, camera and microphone. Through this paper we have analyzed that the efficiency is increased if heterogeneous data (image & voice) is combined at feature level using artificial intelligence. The long term goal of this paper is to design a robust system for physically not able or having less technical knowledge.

  9. Music Training Can Improve Music and Speech Perception in Pediatric Mandarin-Speaking Cochlear Implant Users.

    PubMed

    Cheng, Xiaoting; Liu, Yangwenyi; Shu, Yilai; Tao, Duo-Duo; Wang, Bing; Yuan, Yasheng; Galvin, John J; Fu, Qian-Jie; Chen, Bing

    2018-01-01

    Due to limited spectral resolution, cochlear implants (CIs) do not convey pitch information very well. Pitch cues are important for perception of music and tonal language; it is possible that music training may improve performance in both listening tasks. In this study, we investigated music training outcomes in terms of perception of music, lexical tones, and sentences in 22 young (4.8 to 9.3 years old), prelingually deaf Mandarin-speaking CI users. Music perception was measured using a melodic contour identification (MCI) task. Speech perception was measured for lexical tones and sentences presented in quiet. Subjects received 8 weeks of MCI training using pitch ranges not used for testing. Music and speech perception were measured at 2, 4, and 8 weeks after training was begun; follow-up measures were made 4 weeks after training was stopped. Mean baseline performance was 33.2%, 76.9%, and 45.8% correct for MCI, lexical tone recognition, and sentence recognition, respectively. After 8 weeks of MCI training, mean performance significantly improved by 22.9, 14.4, and 14.5 percentage points for MCI, lexical tone recognition, and sentence recognition, respectively ( p < .05 in all cases). Four weeks after training was stopped, there was no significant change in posttraining music and speech performance. The results suggest that music training can significantly improve pediatric Mandarin-speaking CI users' music and speech perception.

  10. Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners.

    PubMed

    Bradlow, Ann R; Alexander, Jennifer A

    2007-04-01

    Previous research has shown that speech recognition differences between native and proficient non-native listeners emerge under suboptimal conditions. Current evidence has suggested that the key deficit that underlies this disproportionate effect of unfavorable listening conditions for non-native listeners is their less effective use of compensatory information at higher levels of processing to recover from information loss at the phoneme identification level. The present study investigated whether this non-native disadvantage could be overcome if enhancements at various levels of processing were presented in combination. Native and non-native listeners were presented with English sentences in which the final word varied in predictability and which were produced in either plain or clear speech. Results showed that, relative to the low-predictability-plain-speech baseline condition, non-native listener final word recognition improved only when both semantic and acoustic enhancements were available (high-predictability-clear-speech). In contrast, the native listeners benefited from each source of enhancement separately and in combination. These results suggests that native and non-native listeners apply similar strategies for speech-in-noise perception: The crucial difference is in the signal clarity required for contextual information to be effective, rather than in an inability of non-native listeners to take advantage of this contextual information per se.

  11. Dynamic action units slip in speech production errors ☆

    PubMed Central

    Goldstein, Louis; Pouplier, Marianne; Chen, Larissa; Saltzman, Elliot; Byrd, Dani

    2008-01-01

    In the past, the nature of the compositional units proposed for spoken language has largely diverged from the types of control units pursued in the domains of other skilled motor tasks. A classic source of evidence as to the units structuring speech has been patterns observed in speech errors – “slips of the tongue”. The present study reports, for the first time, on kinematic data from tongue and lip movements during speech errors elicited in the laboratory using a repetition task. Our data are consistent with the hypothesis that speech production results from the assembly of dynamically defined action units – gestures – in a linguistically structured environment. The experimental results support both the presence of gestural units and the dynamical properties of these units and their coordination. This study of speech articulation shows that it is possible to develop a principled account of spoken language within a more general theory of action. PMID:16822494

  12. Liberated Learning: Analysis of University Students' Perceptions and Experiences with Continuous Automated Speech Recognition

    ERIC Educational Resources Information Center

    Ryba, Ken; McIvor, Tom; Shakir, Maha; Paez, Di

    2006-01-01

    This study examined continuous automated speech recognition in the university lecture theatre. The participants were both native speakers of English (L1) and English as a second language students (L2) enrolled in an information systems course (Total N=160). After an initial training period, an L2 lecturer in information systems delivered three…

  13. Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

    ERIC Educational Resources Information Center

    Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

    2015-01-01

    One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…

  14. The Compensatory Effectiveness of Optical Character Recognition/Speech Synthesis on Reading Comprehension of Postsecondary Students with Learning Disabilities.

    ERIC Educational Resources Information Center

    Higgins, Eleanor L.; Raskind, Marshall H.

    1997-01-01

    Thirty-seven college students with learning disabilities were given a reading comprehension task under the following conditions: (1) using an optical character recognition/speech synthesis system; (2) having the text read aloud by a human reader; or (3) reading silently without assistance. Findings indicated that the greater the disability, the…

  15. The Use of an Autonomous Pedagogical Agent and Automatic Speech Recognition for Teaching Sight Words to Students with Autism Spectrum Disorder

    ERIC Educational Resources Information Center

    Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.

    2017-01-01

    In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…

  16. Review of Speech-to-Text Recognition Technology for Enhancing Learning

    ERIC Educational Resources Information Center

    Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min

    2014-01-01

    This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…

  17. A Freely-Available Authoring System for Browser-Based CALL with Speech Recognition

    ERIC Educational Resources Information Center

    O'Brien, Myles

    2017-01-01

    A system for authoring browser-based CALL material incorporating Google speech recognition has been developed and made freely available for download. The system provides a teacher with a simple way to set up CALL material, including an optional image, sound or video, which will elicit spoken (and/or typed) answers from the user and check them…

  18. Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech

    ERIC Educational Resources Information Center

    Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell

    2015-01-01

    This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…

  19. Applications of Speech-to-Text Recognition and Computer-Aided Translation for Facilitating Cross-Cultural Learning through a Learning Activity: Issues and Their Solutions

    ERIC Educational Resources Information Center

    Shadiev, Rustam; Wu, Ting-Ting; Sun, Ai; Huang, Yueh-Min

    2018-01-01

    In this study, 21 university students, who represented thirteen nationalities, participated in an online cross-cultural learning activity. The participants were engaged in interactions and exchanges carried out on Facebook® and Skype® platforms, and their multilingual communications were supported by speech-to-text recognition (STR) and…

  20. Investigating the Effectiveness of Speech-To-Text Recognition Applications on Learning Performance, Attention, and Meditation

    ERIC Educational Resources Information Center

    Shadiev, Rustam; Huang, Yueh-Min; Hwang, Jan-Pan

    2017-01-01

    In this study, the effectiveness of the application of speech-to-text recognition (STR) technology on enhancing learning and concentration in a calm state of mind, hereafter referred to as meditation (An intentional and self-regulated focusing of attention in order to relax and calm the mind), was investigated. This effectiveness was further…

  1. Speech Characteristics and Intelligibility in Adults with Mild and Moderate Intellectual Disabilities

    PubMed Central

    Coppens-Hofman, Marjolein C.; Terband, Hayo; Snik, Ad F.M.; Maassen, Ben A.M.

    2017-01-01

    Purpose Adults with intellectual disabilities (ID) often show reduced speech intelligibility, which affects their social interaction skills. This study aims to establish the main predictors of this reduced intelligibility in order to ultimately optimise management. Method Spontaneous speech and picture naming tasks were recorded in 36 adults with mild or moderate ID. Twenty-five naïve listeners rated the intelligibility of the spontaneous speech samples. Performance on the picture-naming task was analysed by means of a phonological error analysis based on expert transcriptions. Results The transcription analyses showed that the phonemic and syllabic inventories of the speakers were complete. However, multiple errors at the phonemic and syllabic level were found. The frequencies of specific types of errors were related to intelligibility and quality ratings. Conclusions The development of the phonemic and syllabic repertoire appears to be completed in adults with mild-to-moderate ID. The charted speech difficulties can be interpreted to indicate speech motor control and planning difficulties. These findings may aid the development of diagnostic tests and speech therapies aimed at improving speech intelligibility in this specific group. PMID:28118637

  2. Non-native Listeners’ Recognition of High-Variability Speech Using PRESTO

    PubMed Central

    Tamati, Terrin N.; Pisoni, David B.

    2015-01-01

    Background Natural variability in speech is a significant challenge to robust successful spoken word recognition. In everyday listening environments, listeners must quickly adapt and adjust to multiple sources of variability in both the signal and listening environments. High-variability speech may be particularly difficult to understand for non-native listeners, who have less experience with the second language (L2) phonological system and less detailed knowledge of sociolinguistic variation of the L2. Purpose The purpose of this study was to investigate the effects of high-variability sentences on non-native speech recognition and to explore the underlying sources of individual differences in speech recognition abilities of non-native listeners. Research Design Participants completed two sentence recognition tasks involving high-variability and low-variability sentences. They also completed a battery of behavioral tasks and self-report questionnaires designed to assess their indexical processing skills, vocabulary knowledge, and several core neurocognitive abilities. Study Sample Native speakers of Mandarin (n = 25) living in the United States recruited from the Indiana University community participated in the current study. A native comparison group consisted of scores obtained from native speakers of English (n = 21) in the Indiana University community taken from an earlier study. Data Collection and Analysis Speech recognition in high-variability listening conditions was assessed with a sentence recognition task using sentences from PRESTO (Perceptually Robust English Sentence Test Open-Set) mixed in 6-talker multitalker babble. Speech recognition in low-variability listening conditions was assessed using sentences from HINT (Hearing In Noise Test) mixed in 6-talker multitalker babble. Indexical processing skills were measured using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Vocabulary knowledge was assessed with the WordFam word familiarity test, and executive functioning was assessed with the BRIEF-A (Behavioral Rating Inventory of Executive Function – Adult Version) self-report questionnaire. Scores from the non-native listeners on behavioral tasks and self-report questionnaires were compared with scores obtained from native listeners tested in a previous study and were examined for individual differences. Results Non-native keyword recognition scores were significantly lower on PRESTO sentences than on HINT sentences. Non-native listeners’ keyword recognition scores were also lower than native listeners’ scores on both sentence recognition tasks. Differences in performance on the sentence recognition tasks between non-native and native listeners were larger on PRESTO than on HINT, although group differences varied by signal-to-noise ratio. The non-native and native groups also differed in the ability to categorize talkers by region of origin and in vocabulary knowledge. Individual non-native word recognition accuracy on PRESTO sentences in multitalker babble at more favorable signal-to-noise ratios was found to be related to several BRIEF-A subscales and composite scores. However, non-native performance on PRESTO was not related to regional dialect categorization, talker and gender discrimination, or vocabulary knowledge. Conclusions High-variability sentences in multitalker babble were particularly challenging for non-native listeners. Difficulty under high-variability testing conditions was related to lack of experience with the L2, especially L2 sociolinguistic information, compared with native listeners. Individual differences among the non-native listeners were related to weaknesses in core neurocognitive abilities affecting behavioral control in everyday life. PMID:25405842

  3. Perception of Filtered Speech by Children with Developmental Dyslexia and Children with Specific Language Impairments

    PubMed Central

    Goswami, Usha; Cumming, Ruth; Chait, Maria; Huss, Martina; Mead, Natasha; Wilson, Angela M.; Barnes, Lisa; Fosker, Tim

    2016-01-01

    Here we use two filtered speech tasks to investigate children’s processing of slow (<4 Hz) versus faster (∼33 Hz) temporal modulations in speech. We compare groups of children with either developmental dyslexia (Experiment 1) or speech and language impairments (SLIs, Experiment 2) to groups of typically-developing (TD) children age-matched to each disorder group. Ten nursery rhymes were filtered so that their modulation frequencies were either low-pass filtered (<4 Hz) or band-pass filtered (22 – 40 Hz). Recognition of the filtered nursery rhymes was tested in a picture recognition multiple choice paradigm. Children with dyslexia aged 10 years showed equivalent recognition overall to TD controls for both the low-pass and band-pass filtered stimuli, but showed significantly impaired acoustic learning during the experiment from low-pass filtered targets. Children with oral SLIs aged 9 years showed significantly poorer recognition of band pass filtered targets compared to their TD controls, and showed comparable acoustic learning effects to TD children during the experiment. The SLI samples were also divided into children with and without phonological difficulties. The children with both SLI and phonological difficulties were impaired in recognizing both kinds of filtered speech. These data are suggestive of impaired temporal sampling of the speech signal at different modulation rates by children with different kinds of developmental language disorder. Both SLI and dyslexic samples showed impaired discrimination of amplitude rise times. Implications of these findings for a temporal sampling framework for understanding developmental language disorders are discussed. PMID:27303348

  4. Recognizing speech under a processing load: dissociating energetic from informational factors.

    PubMed

    Mattys, Sven L; Brooks, Joanna; Cooke, Martin

    2009-11-01

    Effects of perceptual and cognitive loads on spoken-word recognition have so far largely escaped investigation. This study lays the foundations of a psycholinguistic approach to speech recognition in adverse conditions that draws upon the distinction between energetic masking, i.e., listening environments leading to signal degradation, and informational masking, i.e., listening environments leading to depletion of higher-order, domain-general processing resources, independent of signal degradation. We show that severe energetic masking, such as that produced by background speech or noise, curtails reliance on lexical-semantic knowledge and increases relative reliance on salient acoustic detail. In contrast, informational masking, induced by a resource-depleting competing task (divided attention or a memory load), results in the opposite pattern. Based on this clear dissociation, we propose a model of speech recognition that addresses not only the mapping between sensory input and lexical representations, as traditionally advocated, but also the way in which this mapping interfaces with general cognition and non-linguistic processes.

  5. Audiovisual speech perception development at varying levels of perceptual processing

    PubMed Central

    Lalonde, Kaylah; Holt, Rachael Frush

    2016-01-01

    This study used the auditory evaluation framework [Erber (1982). Auditory Training (Alexander Graham Bell Association, Washington, DC)] to characterize the influence of visual speech on audiovisual (AV) speech perception in adults and children at multiple levels of perceptual processing. Six- to eight-year-old children and adults completed auditory and AV speech perception tasks at three levels of perceptual processing (detection, discrimination, and recognition). The tasks differed in the level of perceptual processing required to complete them. Adults and children demonstrated visual speech influence at all levels of perceptual processing. Whereas children demonstrated the same visual speech influence at each level of perceptual processing, adults demonstrated greater visual speech influence on tasks requiring higher levels of perceptual processing. These results support previous research demonstrating multiple mechanisms of AV speech processing (general perceptual and speech-specific mechanisms) with independent maturational time courses. The results suggest that adults rely on both general perceptual mechanisms that apply to all levels of perceptual processing and speech-specific mechanisms that apply when making phonetic decisions and/or accessing the lexicon. Six- to eight-year-old children seem to rely only on general perceptual mechanisms across levels. As expected, developmental differences in AV benefit on this and other recognition tasks likely reflect immature speech-specific mechanisms and phonetic processing in children. PMID:27106318

  6. Audiovisual speech perception development at varying levels of perceptual processing.

    PubMed

    Lalonde, Kaylah; Holt, Rachael Frush

    2016-04-01

    This study used the auditory evaluation framework [Erber (1982). Auditory Training (Alexander Graham Bell Association, Washington, DC)] to characterize the influence of visual speech on audiovisual (AV) speech perception in adults and children at multiple levels of perceptual processing. Six- to eight-year-old children and adults completed auditory and AV speech perception tasks at three levels of perceptual processing (detection, discrimination, and recognition). The tasks differed in the level of perceptual processing required to complete them. Adults and children demonstrated visual speech influence at all levels of perceptual processing. Whereas children demonstrated the same visual speech influence at each level of perceptual processing, adults demonstrated greater visual speech influence on tasks requiring higher levels of perceptual processing. These results support previous research demonstrating multiple mechanisms of AV speech processing (general perceptual and speech-specific mechanisms) with independent maturational time courses. The results suggest that adults rely on both general perceptual mechanisms that apply to all levels of perceptual processing and speech-specific mechanisms that apply when making phonetic decisions and/or accessing the lexicon. Six- to eight-year-old children seem to rely only on general perceptual mechanisms across levels. As expected, developmental differences in AV benefit on this and other recognition tasks likely reflect immature speech-specific mechanisms and phonetic processing in children.

  7. Do Older Listeners With Hearing Loss Benefit From Dynamic Pitch for Speech Recognition in Noise?

    PubMed

    Shen, Jing; Souza, Pamela E

    2017-10-12

    Dynamic pitch, the variation in the fundamental frequency of speech, aids older listeners' speech perception in noise. It is unclear, however, whether some older listeners with hearing loss benefit from strengthened dynamic pitch cues for recognizing speech in certain noise scenarios and how this relative benefit may be associated with individual factors. We first examined older individuals' relative benefit between natural and strong dynamic pitches for better speech recognition in noise. Further, we reported the individual factors of the 2 groups of listeners who benefit differently from natural and strong dynamic pitches. Speech reception thresholds of 13 older listeners with mild-moderate hearing loss were measured using target speech with 3 levels of dynamic pitch strength. Individuals' ability to benefit from dynamic pitch was defined as the speech reception threshold difference between speeches with and without dynamic pitch cues. The relative benefit of natural versus strong dynamic pitch varied across individuals. However, this relative benefit remained consistent for the same individuals across those background noises with temporal modulation. Those listeners who benefited more from strong dynamic pitch reported better subjective speech perception abilities. Strong dynamic pitch may be more beneficial than natural dynamic pitch for some older listeners to recognize speech better in noise, particularly when the noise has temporal modulation.

  8. The Cortical Organization of Speech Processing: Feedback Control and Predictive Coding the Context of a Dual-Stream Model

    ERIC Educational Resources Information Center

    Hickok, Gregory

    2012-01-01

    Speech recognition is an active process that involves some form of predictive coding. This statement is relatively uncontroversial. What is less clear is the source of the prediction. The dual-stream model of speech processing suggests that there are two possible sources of predictive coding in speech perception: the motor speech system and the…

  9. Error biases in inner and overt speech: evidence from tongue twisters.

    PubMed

    Corley, Martin; Brocklehurst, Paul H; Moat, H Susannah

    2011-01-01

    To compare the properties of inner and overt speech, Oppenheim and Dell (2008) counted participants' self-reported speech errors when reciting tongue twisters either overtly or silently and found a bias toward substituting phonemes that resulted in words in both conditions, but a bias toward substituting similar phonemes only when speech was overt. Here, we report 3 experiments revisiting their conclusion that inner speech remains underspecified at the subphonemic level, which they simulated within an activation-feedback framework. In 2 experiments, participants recited tongue twisters that could result in the errorful substitutions of similar or dissimilar phonemes to form real words or nonwords. Both experiments included an auditory masking condition, to gauge the possible impact of loss of auditory feedback on the accuracy of self-reporting of speech errors. In Experiment 1, the stimuli were composed entirely from real words, whereas, in Experiment 2, half the tokens used were nonwords. Although masking did not have any effects, participants were more likely to report substitutions of similar phonemes in both experiments, in inner as well as overt speech. This pattern of results was confirmed in a 3rd experiment using the real-word materials from Oppenheim and Dell (in press). In addition to these findings, a lexical bias effect found in Experiments 1 and 3 disappeared in Experiment 2. Our findings support a view in which plans for inner speech are indeed specified at the feature level, even when there is no intention to articulate words overtly, and in which editing of the plan for errors is implicated. (PsycINFO Database Record (c) 2010 APA, all rights reserved).

  10. STANFORD ARTIFICIAL INTELLIGENCE PROJECT.

    DTIC Science & Technology

    ARTIFICIAL INTELLIGENCE , GAME THEORY, DECISION MAKING, BIONICS, AUTOMATA, SPEECH RECOGNITION, GEOMETRIC FORMS, LEARNING MACHINES, MATHEMATICAL MODELS, PATTERN RECOGNITION, SERVOMECHANISMS, SIMULATION, BIBLIOGRAPHIES.

  11. Effects of WDRC release time and number of channels on output SNR and speech recognition

    PubMed Central

    Alexander, Joshua M.; Masterson, Katie

    2014-01-01

    Objectives The purpose of this study was to investigate the joint effects that wide dynamic range compression (WDRC) release time (RT) and number of channels have on recognition of sentences in the presence of steady and modulated maskers at different signal-to-noise ratios (SNRs). How the different combinations of WDRC parameters affect output SNR and the role this plays in the observed findings was also investigated. Design Twenty-four listeners with mild to moderate sensorineural hearing loss identified sentences mixed with steady or modulated maskers at 3 SNRs (−5, 0, +5 dB) that had been processed using a hearing aid simulator with 6 combinations of RT (40 and 640 ms) and number of channels (4, 8, and 16). Compression parameters were set using the Desired Sensation Level v5.0a prescriptive fitting method. For each condition, amplified speech and masker levels and the resultant long-term output SNR were measured. Results Speech recognition with WDRC depended on the combination of RT and number of channels, with the greatest effects observed at 0 dB input SNR, in which mean speech recognition scores varied by 10–12% across WDRC manipulations. Overall, effect sizes were generally small. Across both masker types and the three SNRs tested, the best speech recognition was obtained with 8 channels, regardless of RT. Increased speech levels, which favor audibility, were associated with the short RT and with an increase in the number of channels. These same conditions also increased masker levels by an even greater amount, for a net decrease in the long-term output SNR. Changes in long-term SNR across WDRC conditions were found to be strongly associated with changes in the temporal envelope shape as quantified by the Envelope Difference Index, however, neither of these factors fully explained the observed differences in speech recognition. Conclusions A primary finding of this study was that the number of channels had a modest effect when analyzed at each level of RT, with results suggesting that selecting 8 channels for a given RT might be the safest choice. Effects were smaller for RT, with results suggesting that short RT was slightly better when only 4 channels were used and that long RT was better when 16 channels were used. Individual differences in how listeners were influenced by audibility, output SNR, temporal distortion, and spectral distortion may have contributed to the size of the effects found in this study. Because only general suppositions could made for how each of these factors may have influenced the overall results of this study, future research would benefit from exploring the predictive value of these and other factors in selecting the processing parameters that maximize speech recognition for individuals. PMID:25470368

  12. A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy.

    PubMed

    Kell, Alexander J E; Yamins, Daniel L K; Shook, Erica N; Norman-Haignere, Sam V; McDermott, Josh H

    2018-05-02

    A core goal of auditory neuroscience is to build quantitative models that predict cortical responses to natural sounds. Reasoning that a complete model of auditory cortex must solve ecologically relevant tasks, we optimized hierarchical neural networks for speech and music recognition. The best-performing network contained separate music and speech pathways following early shared processing, potentially replicating human cortical organization. The network performed both tasks as well as humans and exhibited human-like errors despite not being optimized to do so, suggesting common constraints on network and human performance. The network predicted fMRI voxel responses substantially better than traditional spectrotemporal filter models throughout auditory cortex. It also provided a quantitative signature of cortical representational hierarchy-primary and non-primary responses were best predicted by intermediate and late network layers, respectively. The results suggest that task optimization provides a powerful set of tools for modeling sensory systems. Copyright © 2018 Elsevier Inc. All rights reserved.

  13. Improving medical imaging report turnaround times: the role of technolgy.

    PubMed

    Marquez, Luis O; Stewart, Howard

    2005-01-01

    At Southern Ohio Medical Center (SOMC), the medical imaging department and the radiologists expressed a strong desire to improve workflow. The improved workflow was a major motivating factor toward implementing a new RIS and speech recognition technology. The need to monitor workflow in a real-time fashion and to evaluate productivity and resources necessitated that a new solution be found. A decision was made to roll out both the new RIS product and speech recognition to maximize the resources to interface and implement the new solution. Prior to implementation of the new RIS, the medical imaging department operated in a conventional electronic-order-entry to paper request manner. The paper request followed the study through exam completion to the radiologist. SOMC entered into a contract with its PACS vendor to participate in beta testing and clinical trials for a new RIS product for the US market. Backup plans were created in the event the product failed to function as planned--either during the beta testing period or during clinical trails. The last piece of the technology puzzle to improve report turnaround time was voice recognition technology. Speech recognition enhanced the RIS technology as soon as it was implemented. The results show that the project has been a success. The new RIS, combined with speech recognition and the PACS, makes for a very effective solution to patient, exam, and results management in the medical imaging department.

  14. Mimicking Aphasic Semantic Errors in Normal Speech Production: Evidence from a Novel Experimental Paradigm

    ERIC Educational Resources Information Center

    Hodgson, Catherine; Lambon Ralph, Matthew A.

    2008-01-01

    Semantic errors are commonly found in semantic dementia (SD) and some forms of stroke aphasia and provide insights into semantic processing and speech production. Low error rates are found in standard picture naming tasks in normal controls. In order to increase error rates and thus provide an experimental model of aphasic performance, this study…

  15. The Downside of Greater Lexical Influences: Selectively Poorer Speech Perception in Noise

    PubMed Central

    Xie, Zilong; Tessmer, Rachel; Chandrasekaran, Bharath

    2017-01-01

    Purpose Although lexical information influences phoneme perception, the extent to which reliance on lexical information enhances speech processing in challenging listening environments is unclear. We examined the extent to which individual differences in lexical influences on phonemic processing impact speech processing in maskers containing varying degrees of linguistic information (2-talker babble or pink noise). Method Twenty-nine monolingual English speakers were instructed to ignore the lexical status of spoken syllables (e.g., gift vs. kift) and to only categorize the initial phonemes (/g/ vs. /k/). The same participants then performed speech recognition tasks in the presence of 2-talker babble or pink noise in audio-only and audiovisual conditions. Results Individuals who demonstrated greater lexical influences on phonemic processing experienced greater speech processing difficulties in 2-talker babble than in pink noise. These selective difficulties were present across audio-only and audiovisual conditions. Conclusion Individuals with greater reliance on lexical processes during speech perception exhibit impaired speech recognition in listening conditions in which competing talkers introduce audible linguistic interferences. Future studies should examine the locus of lexical influences/interferences on phonemic processing and speech-in-speech processing. PMID:28586824

  16. Phonological Awareness and Types of Sound Errors in Preschoolers with Speech Sound Disorders

    ERIC Educational Resources Information Center

    Preston, Jonathan; Edwards, Mary Louise

    2010-01-01

    Purpose: Some children with speech sound disorders (SSD) have difficulty with literacy-related skills, particularly phonological awareness (PA). This study investigates the PA skills of preschoolers with SSD by using a regression model to evaluate the degree to which PA can be concurrently predicted by types of speech sound errors. Method:…

  17. Prepopulated radiology report templates: a prospective analysis of error rate and turnaround time.

    PubMed

    Hawkins, C M; Hall, S; Hardin, J; Salisbury, S; Towbin, A J

    2012-08-01

    Current speech recognition software allows exam-specific standard reports to be prepopulated into the dictation field based on the radiology information system procedure code. While it is thought that prepopulating reports can decrease the time required to dictate a study and the overall number of errors in the final report, this hypothesis has not been studied in a clinical setting. A prospective study was performed. During the first week, radiologists dictated all studies using prepopulated standard reports. During the second week, all studies were dictated after prepopulated reports had been disabled. Final radiology reports were evaluated for 11 different types of errors. Each error within a report was classified individually. The median time required to dictate an exam was compared between the 2 weeks. There were 12,387 reports dictated during the study, of which, 1,173 randomly distributed reports were analyzed for errors. There was no difference in the number of errors per report between the 2 weeks; however, radiologists overwhelmingly preferred using a standard report both weeks. Grammatical errors were by far the most common error type, followed by missense errors and errors of omission. There was no significant difference in the median dictation time when comparing studies performed each week. The use of prepopulated reports does not alone affect the error rate or dictation time of radiology reports. While it is a useful feature for radiologists, it must be coupled with other strategies in order to decrease errors.

  18. Computational validation of the motor contribution to speech perception.

    PubMed

    Badino, Leonardo; D'Ausilio, Alessandro; Fadiga, Luciano; Metta, Giorgio

    2014-07-01

    Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion information onto observers' motor representations. This process of matching the actions of others onto our own sensorimotor repertoire is thought to be important for action recognition, providing a non-mediated "motor perception" based on a bidirectional flow of information along the mirror parieto-frontal circuits. State-of-the-art machine learning strategies for hand action identification have shown better performances when sensorimotor data, as opposed to visual information only, are available during learning. As speech is a particular type of action (with acoustic targets), it is expected to activate a mirror neuron mechanism. Indeed, in speech perception, motor centers have been shown to be causally involved in the discrimination of speech sounds. In this paper, we review recent neurophysiological and machine learning-based studies showing (a) the specific contribution of the motor system to speech perception and (b) that automatic phone recognition is significantly improved when motor data are used during training of classifiers (as opposed to learning from purely auditory data). Copyright © 2014 Cognitive Science Society, Inc.

  19. Effects of Familiarity and Feeding on Newborn Speech-Voice Recognition

    ERIC Educational Resources Information Center

    Valiante, A. Grace; Barr, Ronald G.; Zelazo, Philip R.; Brant, Rollin; Young, Simon N.

    2013-01-01

    Newborn infants preferentially orient to familiar over unfamiliar speech sounds. They are also better at remembering unfamiliar speech sounds for short periods of time if learning and retention occur after a feed than before. It is unknown whether short-term memory for speech is enhanced when the sound is familiar (versus unfamiliar) and, if so,…

  20. Speech Recognition Scores of White and Black Student-Teacher Listeners for Black and White First Grade Speakers. Final Technical Report.

    ERIC Educational Resources Information Center

    Nober, E. Harris; Seymour, Harry N.

    In order to investigate the possible consequences of dialectical differences in the classroom setting relative to the low income black and white first grade child and the prospective white middle-class teacher, 25 black and 25 white university listeners yielded speech recognition scores for 48 black and 48 white five-year-old urban school-children…

Top