Integrating hidden Markov model and PRAAT: a toolbox for robust automatic speech transcription
NASA Astrophysics Data System (ADS)
Kabir, A.; Barker, J.; Giurgiu, M.
2010-09-01
An automatic time-aligned phone transcription toolbox of English speech corpora has been developed. Especially the toolbox would be very useful to generate robust automatic transcription and able to produce phone level transcription using speaker independent models as well as speaker dependent models without manual intervention. The system is based on standard Hidden Markov Models (HMM) approach and it was successfully experimented over a large audiovisual speech corpus namely GRID corpus. One of the most powerful features of the toolbox is the increased flexibility in speech processing where the speech community would be able to import the automatic transcription generated by HMM Toolkit (HTK) into a popular transcription software, PRAAT, and vice-versa. The toolbox has been evaluated through statistical analysis on GRID data which shows that automatic transcription deviates by an average of 20 ms with respect to manual transcription.
Automatic speech recognition using a predictive echo state network classifier.
Skowronski, Mark D; Harris, John G
2007-04-01
We have combined an echo state network (ESN) with a competitive state machine framework to create a classification engine called the predictive ESN classifier. We derive the expressions for training the predictive ESN classifier and show that the model was significantly more noise robust compared to a hidden Markov model in noisy speech classification experiments by 8+/-1 dB signal-to-noise ratio. The simple training algorithm and noise robustness of the predictive ESN classifier make it an attractive classification engine for automatic speech recognition.
Automatic detection of Parkinson's disease in running speech spoken in three different languages.
Orozco-Arroyave, J R; Hönig, F; Arias-Londoño, J D; Vargas-Bonilla, J F; Daqrouq, K; Skodda, S; Rusz, J; Nöth, E
2016-01-01
The aim of this study is the analysis of continuous speech signals of people with Parkinson's disease (PD) considering recordings in different languages (Spanish, German, and Czech). A method for the characterization of the speech signals, based on the automatic segmentation of utterances into voiced and unvoiced frames, is addressed here. The energy content of the unvoiced sounds is modeled using 12 Mel-frequency cepstral coefficients and 25 bands scaled according to the Bark scale. Four speech tasks comprising isolated words, rapid repetition of the syllables /pa/-/ta/-/ka/, sentences, and read texts are evaluated. The method proves to be more accurate than classical approaches in the automatic classification of speech of people with PD and healthy controls. The accuracies range from 85% to 99% depending on the language and the speech task. Cross-language experiments are also performed confirming the robustness and generalization capability of the method, with accuracies ranging from 60% to 99%. This work comprises a step forward for the development of computer aided tools for the automatic assessment of dysarthric speech signals in multiple languages.
Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil
2006-01-01
The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent proximity to "normal" articulatory patterns. Severe dysarthric output may also be characterized by a small mass of distinguishable phonetic tokens making the acoustic differentiation of target words difficult. Speaker dependent computer speech recognition using Hidden Markov Models was achieved by the identification of robust phonetic elements within the individual speaker output patterns. A new system of speech training using computer generated visual and auditory feedback reduced the inconsistent production of key phonetic tokens over time.
Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor
NASA Astrophysics Data System (ADS)
Heracleous, Panikos; Kaino, Tomomi; Saruwatari, Hiroshi; Shikano, Kiyohiro
2006-12-01
We present the use of stethoscope and silicon NAM (nonaudible murmur) microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (nonaudible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc.) for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a[InlineEquation not available: see fulltext.] word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.
ERIC Educational Resources Information Center
Esch, Barbara E.; Carr, James E.; Grow, Laura L.
2009-01-01
Evidence to support stimulus-stimulus pairing (SSP) in speech acquisition is less than robust, calling into question the ability of SSP to reliably establish automatically reinforcing properties of speech and limiting the procedure's clinical utility for increasing vocalizations. We evaluated the effects of a modified SSP procedure on…
An efficient robust sound classification algorithm for hearing aids.
Nordqvist, Peter; Leijon, Arne
2004-06-01
An efficient robust sound classification algorithm based on hidden Markov models is presented. The system would enable a hearing aid to automatically change its behavior for differing listening environments according to the user's preferences. This work attempts to distinguish between three listening environment categories: speech in traffic noise, speech in babble, and clean speech, regardless of the signal-to-noise ratio. The classifier uses only the modulation characteristics of the signal. The classifier ignores the absolute sound pressure level and the absolute spectrum shape, resulting in an algorithm that is robust against irrelevant acoustic variations. The measured classification hit rate was 96.7%-99.5% when the classifier was tested with sounds representing one of the three environment categories included in the classifier. False-alarm rates were 0.2%-1.7% in these tests. The algorithm is robust and efficient and consumes a small amount of instructions and memory. It is fully possible to implement the classifier in a DSP-based hearing instrument.
Noise-robust speech recognition through auditory feature detection and spike sequence decoding.
Schafer, Phillip B; Jin, Dezhe Z
2014-03-01
Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.
Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting.
Wöllmer, Martin; Marchi, Erik; Squartini, Stefano; Schuller, Björn
2011-09-01
Highly spontaneous, conversational, and potentially emotional and noisy speech is known to be a challenge for today's automatic speech recognition (ASR) systems, which highlights the need for advanced algorithms that improve speech features and models. Histogram Equalization is an efficient method to reduce the mismatch between clean and noisy conditions by normalizing all moments of the probability distribution of the feature vector components. In this article, we propose to combine histogram equalization and multi-condition training for robust keyword detection in noisy speech. To better cope with conversational speaking styles, we show how contextual information can be effectively exploited in a multi-stream ASR framework that dynamically models context-sensitive phoneme estimates generated by a long short-term memory neural network. The proposed techniques are evaluated on the SEMAINE database-a corpus containing emotionally colored conversations with a cognitive system for "Sensitive Artificial Listening".
[The endpoint detection of cough signal in continuous speech].
Yang, Guoqing; Mo, Hongqiang; Li, Wen; Lian, Lianfang; Zheng, Zeguang
2010-06-01
The endpoint detection of cough signal in continuous speech has been researched in order to improve the efficiency and veracity of manual recognition or computer-based automatic recognition. First, using the short time zero crossing ratio(ZCR) for identifying the suspicious coughs and getting the threshold of short time energy based on acoustic characteristics of cough. Then, the short time energy is combined with short time ZCR in order to implement the endpoint detection of cough in continuous speech. To evaluate the effect of the method, first, the virtual number of coughs in each recording was identified by two experienced doctors using the graphical user interface (GUI). Second, the recordings were analyzed by automatic endpoint detection program under Matlab7.0. Finally, the comparison between these two results showed: The error rate of undetected cough is 2.18%, and 98.13% of noise, silence and speech were removed. The way of setting short time energy threshold is robust. The endpoint detection program can remove most speech and noise, thus maintaining a lower rate of error.
Acoustic Event Detection and Classification
NASA Astrophysics Data System (ADS)
Temko, Andrey; Nadeu, Climent; Macho, Dušan; Malkin, Robert; Zieger, Christian; Omologo, Maurizio
The human activity that takes place in meeting rooms or classrooms is reflected in a rich variety of acoustic events (AE), produced either by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity. Indeed, speech is usually the most informative sound, but other kinds of AEs may also carry useful information, for example, clapping or laughing inside a speech, a strong yawn in the middle of a lecture, a chair moving or a door slam when the meeting has just started. Additionally, detection and classification of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition.
Robust Recognition of Loud and Lombard speech in the Fighter Cockpit Environment
1988-08-01
the latter as inter-speaker variability. According to Zue [Z85j, inter-speaker variabilities can be attributed to sociolinguistic background, dialect...34 Journal of the Acoustical Society of America , Vol 50, 1971. [At74I B. S. Atal, "Linear prediction for speaker identification," Journal of the Acoustical...Society of America , Vol 55, 1974. [B771 B. Beek, E. P. Neuberg, and D. C. Hodge, "An Assessment of the Technology of Automatic Speech Recognition for
Xiao, Bo; Imel, Zac E; Georgiou, Panayiotis G; Atkins, David C; Narayanan, Shrikanth S
2015-01-01
The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.
Schädler, Marc René; Kollmeier, Birger
2015-04-01
To test if simultaneous spectral and temporal processing is required to extract robust features for automatic speech recognition (ASR), the robust spectro-temporal two-dimensional-Gabor filter bank (GBFB) front-end from Schädler, Meyer, and Kollmeier [J. Acoust. Soc. Am. 131, 4134-4151 (2012)] was de-composed into a spectral one-dimensional-Gabor filter bank and a temporal one-dimensional-Gabor filter bank. A feature set that is extracted with these separate spectral and temporal modulation filter banks was introduced, the separate Gabor filter bank (SGBFB) features, and evaluated on the CHiME (Computational Hearing in Multisource Environments) keywords-in-noise recognition task. From the perspective of robust ASR, the results showed that spectral and temporal processing can be performed independently and are not required to interact with each other. Using SGBFB features permitted the signal-to-noise ratio (SNR) to be lowered by 1.2 dB while still performing as well as the GBFB-based reference system, which corresponds to a relative improvement of the word error rate by 12.8%. Additionally, the real time factor of the spectro-temporal processing could be reduced by more than an order of magnitude. Compared to human listeners, the SNR needed to be 13 dB higher when using Mel-frequency cepstral coefficient features, 11 dB higher when using GBFB features, and 9 dB higher when using SGBFB features to achieve the same recognition performance.
Xiao, Bo; Imel, Zac E.; Georgiou, Panayiotis G.; Atkins, David C.; Narayanan, Shrikanth S.
2015-01-01
The technology for evaluating patient-provider interactions in psychotherapy–observational coding–has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies. PMID:26630392
Automatic initial and final segmentation in cleft palate speech of Mandarin speakers
Liu, Yin; Yin, Heng; Zhang, Junpeng; Zhang, Jing; Zhang, Jiang
2017-01-01
The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with “quasi-unvoiced” or with “quasi-voiced” initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the syllables is 91.24%. PMID:28926572
Automatic initial and final segmentation in cleft palate speech of Mandarin speakers.
He, Ling; Liu, Yin; Yin, Heng; Zhang, Junpeng; Zhang, Jing; Zhang, Jiang
2017-01-01
The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with "quasi-unvoiced" or with "quasi-voiced" initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the syllables is 91.24%.
Histogram equalization with Bayesian estimation for noise robust speech recognition.
Suh, Youngjoo; Kim, Hoirin
2018-02-01
The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histogram equalization approach has been proposed. Although this approach showed substantial performance improvement under noise environments, it still suffers from performance degradation due to the overfitting problem when test data are insufficient. To address this issue, the proposed histogram equalization technique employs the Bayesian estimation method in the test cumulative distribution function estimation. It was reported in a previous study conducted on the Aurora-4 task that the proposed approach provided substantial performance gains in speech recognition systems based on the acoustic modeling of the Gaussian mixture model-hidden Markov model. In this work, the proposed approach was examined in speech recognition systems with deep neural network-hidden Markov model (DNN-HMM), the current mainstream speech recognition approach where it also showed meaningful performance improvement over the conventional maximum likelihood estimation-based method. The fusion of the proposed features with the mel-frequency cepstral coefficients provided additional performance gains in DNN-HMM systems, which otherwise suffer from performance degradation in the clean test condition.
Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy
2016-04-01
Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing. Copyright © 2016 Elsevier B.V. All rights reserved.
Estimating psycho-physiological state of a human by speech analysis
NASA Astrophysics Data System (ADS)
Ronzhin, A. L.
2005-05-01
Adverse effects of intoxication, fatigue and boredom could degrade performance of highly trained operators of complex technical systems with potentially catastrophic consequences. Existing physiological fitness for duty tests are time consuming, costly, invasive, and highly unpopular. Known non-physiological tests constitute a secondary task and interfere with the busy workload of the tested operator. Various attempts to assess the current status of the operator by processing of "normal operational data" often lead to excessive amount of computations, poorly justified metrics, and ambiguity of results. At the same time, speech analysis presents a natural, non-invasive approach based upon well-established efficient data processing. In addition, it supports both behavioral and physiological biometric. This paper presents an approach facilitating robust speech analysis/understanding process in spite of natural speech variability and background noise. Automatic speech recognition is suggested as a technique for the detection of changes in the psycho-physiological state of a human that typically manifest themselves by changes of characteristics of voice tract and semantic-syntactic connectivity of conversation. Preliminary tests have confirmed that the statistically significant correlation between the error rate of automatic speech recognition and the extent of alcohol intoxication does exist. In addition, the obtained data allowed exploring some interesting correlations and establishing some quantitative models. It is proposed to utilize this approach as a part of fitness for duty test and compare its efficiency with analyses of iris, face geometry, thermography and other popular non-invasive biometric techniques.
ERIC Educational Resources Information Center
Young, Victoria; Mihailidis, Alex
2010-01-01
Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…
Kharlamov, Viktor; Campbell, Kenneth; Kazanina, Nina
2011-11-01
Speech sounds are not always perceived in accordance with their acoustic-phonetic content. For example, an early and automatic process of perceptual repair, which ensures conformity of speech inputs to the listener's native language phonology, applies to individual input segments that do not exist in the native inventory or to sound sequences that are illicit according to the native phonotactic restrictions on sound co-occurrences. The present study with Russian and Canadian English speakers shows that listeners may perceive phonetically distinct and licit sound sequences as equivalent when the native language system provides robust evidence for mapping multiple phonetic forms onto a single phonological representation. In Russian, due to an optional but productive t-deletion process that affects /stn/ clusters, the surface forms [sn] and [stn] may be phonologically equivalent and map to a single phonological form /stn/. In contrast, [sn] and [stn] clusters are usually phonologically distinct in (Canadian) English. Behavioral data from identification and discrimination tasks indicated that [sn] and [stn] clusters were more confusable for Russian than for English speakers. The EEG experiment employed an oddball paradigm with nonwords [asna] and [astna] used as the standard and deviant stimuli. A reliable mismatch negativity response was elicited approximately 100 msec postchange in the English group but not in the Russian group. These findings point to a perceptual repair mechanism that is engaged automatically at a prelexical level to ensure immediate encoding of speech inputs in phonological terms, which in turn enables efficient access to the meaning of a spoken utterance.
Human phoneme recognition depending on speech-intrinsic variability.
Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger
2010-11-01
The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).
Speech systems research at Texas Instruments
NASA Technical Reports Server (NTRS)
Doddington, George R.
1977-01-01
An assessment of automatic speech processing technology is presented. Fundamental problems in the development and the deployment of automatic speech processing systems are defined and a technology forecast for speech systems is presented.
ERIC Educational Resources Information Center
Wigmore, Angela; Hunter, Gordon; Pflugel, Eckhard; Denholm-Price, James; Binelli, Vincent
2009-01-01
Speech technology--especially automatic speech recognition--has now advanced to a level where it can be of great benefit both to able-bodied people and those with various disabilities. In this paper we describe an application "TalkMaths" which, using the output from a commonly-used conventional automatic speech recognition system,…
Shin, Young Hoon; Seo, Jiwon
2016-01-01
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867
Shin, Young Hoon; Seo, Jiwon
2016-10-29
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.
Rapid tuning shifts in human auditory cortex enhance speech intelligibility
Holdgraf, Christopher R.; de Heer, Wendy; Pasley, Brian; Rieger, Jochem; Crone, Nathan; Lin, Jack J.; Knight, Robert T.; Theunissen, Frédéric E.
2016-01-01
Experience shapes our perception of the world on a moment-to-moment basis. This robust perceptual effect of experience parallels a change in the neural representation of stimulus features, though the nature of this representation and its plasticity are not well-understood. Spectrotemporal receptive field (STRF) mapping describes the neural response to acoustic features, and has been used to study contextual effects on auditory receptive fields in animal models. We performed a STRF plasticity analysis on electrophysiological data from recordings obtained directly from the human auditory cortex. Here, we report rapid, automatic plasticity of the spectrotemporal response of recorded neural ensembles, driven by previous experience with acoustic and linguistic information, and with a neurophysiological effect in the sub-second range. This plasticity reflects increased sensitivity to spectrotemporal features, enhancing the extraction of more speech-like features from a degraded stimulus and providing the physiological basis for the observed ‘perceptual enhancement' in understanding speech. PMID:27996965
Lancioni, Giulio E; Singh, Nirbhay N; O'Reilly, Mark F; Green, Vanessa A; Alberti, Gloria; Boccasini, Adele; Smaldone, Angela; Oliva, Doretta; Bosco, Andrea
2014-08-01
Assessing automatic feedback technologies to promote safe travel and speech loudness control in two men with multiple disabilities, respectively. The men were involved in two single-case studies. In Study I, the technology involved a microprocessor, two photocells, and a verbal feedback device. The man received verbal alerting/feedback when the photocells spotted an obstacle in front of him. In Study II, the technology involved a sound-detecting unit connected to a throat and an airborne microphone, and to a vibration device. Vibration occurred when the man's speech loudness exceeded a preset level. The man included in Study I succeeded in using the automatic feedback in substitution of caregivers' alerting/feedback for safe travel. The man of Study II used the automatic feedback to successfully reduce his speech loudness. Automatic feedback can be highly effective in helping persons with multiple disabilities improve their travel and speech performance.
Automatic detection of articulation disorders in children with cleft lip and palate.
Maier, Andreas; Hönig, Florian; Bocklet, Tobias; Nöth, Elmar; Stelzle, Florian; Nkenke, Emeka; Schuster, Maria
2009-11-01
Speech of children with cleft lip and palate (CLP) is sometimes still disordered even after adequate surgical and nonsurgical therapies. Such speech shows complex articulation disorders, which are usually assessed perceptually, consuming time and manpower. Hence, there is a need for an easy to apply and reliable automatic method. To create a reference for an automatic system, speech data of 58 children with CLP were assessed perceptually by experienced speech therapists for characteristic phonetic disorders at the phoneme level. The first part of the article aims to detect such characteristics by a semiautomatic procedure and the second to evaluate a fully automatic, thus simple, procedure. The methods are based on a combination of speech processing algorithms. The semiautomatic method achieves moderate to good agreement (kappa approximately 0.6) for the detection of all phonetic disorders. On a speaker level, significant correlations between the perceptual evaluation and the automatic system of 0.89 are obtained. The fully automatic system yields a correlation on the speaker level of 0.81 to the perceptual evaluation. This correlation is in the range of the inter-rater correlation of the listeners. The automatic speech evaluation is able to detect phonetic disorders at an experts'level without any additional human postprocessing.
NASA Astrophysics Data System (ADS)
Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis
2003-12-01
This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.
Zekveld, Adriana A; Kramer, Sophia E; Kessens, Judith M; Vlaming, Marcel S M G; Houtgast, Tammo
2009-04-01
The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that speech comprehension in noise by young listeners with normal hearing improves when presenting partly incorrect, automatically generated subtitles. The current study focused on the effects of age, hearing loss, visual working memory capacity, and linguistic skills on the benefit obtained from automatically generated subtitles during listening to speech in noise. In order to investigate the effects of age and hearing loss, three groups of participants were included: 22 young persons with normal hearing (YNH, mean age = 21 years), 22 middle-aged adults with normal hearing (MA-NH, mean age = 55 years) and 30 middle-aged adults with hearing impairment (MA-HI, mean age = 57 years). The benefit from automatic subtitling was measured by Speech Reception Threshold (SRT) tests (Plomp & Mimpen, 1979). Both unimodal auditory and bimodal audiovisual SRT tests were performed. In the audiovisual tests, the subtitles were presented simultaneously with the speech, whereas in the auditory test, only speech was presented. The difference between the auditory and audiovisual SRT was defined as the audiovisual benefit. Participants additionally rated the listening effort. We examined the influences of ASR accuracy level and text delay on the audiovisual benefit and the listening effort using a repeated measures General Linear Model analysis. In a correlation analysis, we evaluated the relationships between age, auditory SRT, visual working memory capacity and the audiovisual benefit and listening effort. The automatically generated subtitles improved speech comprehension in noise for all ASR accuracies and delays covered by the current study. Higher ASR accuracy levels resulted in more benefit obtained from the subtitles. Speech comprehension improved even for relatively low ASR accuracy levels; for example, participants obtained about 2 dB SNR audiovisual benefit for ASR accuracies around 74%. Delaying the presentation of the text reduced the benefit and increased the listening effort. Participants with relatively low unimodal speech comprehension obtained greater benefit from the subtitles than participants with better unimodal speech comprehension. We observed an age-related decline in the working-memory capacity of the listeners with normal hearing. A higher age and a lower working memory capacity were associated with increased effort required to use the subtitles to improve speech comprehension. Participants were able to use partly incorrect and delayed subtitles to increase their comprehension of speech in noise, regardless of age and hearing loss. This supports the further development and evaluation of an assistive listening system that displays automatically recognized speech to aid speech comprehension by listeners with hearing impairment.
Native language shapes automatic neural processing of speech.
Intartaglia, Bastien; White-Schwoch, Travis; Meunier, Christine; Roman, Stéphane; Kraus, Nina; Schön, Daniele
2016-08-01
The development of the phoneme inventory is driven by the acoustic-phonetic properties of one's native language. Neural representation of speech is known to be shaped by language experience, as indexed by cortical responses, and recent studies suggest that subcortical processing also exhibits this attunement to native language. However, most work to date has focused on the differences between tonal and non-tonal languages that use pitch variations to convey phonemic categories. The aim of this cross-language study is to determine whether subcortical encoding of speech sounds is sensitive to language experience by comparing native speakers of two non-tonal languages (French and English). We hypothesized that neural representations would be more robust and fine-grained for speech sounds that belong to the native phonemic inventory of the listener, and especially for the dimensions that are phonetically relevant to the listener such as high frequency components. We recorded neural responses of American English and French native speakers, listening to natural syllables of both languages. Results showed that, independently of the stimulus, American participants exhibited greater neural representation of the fundamental frequency compared to French participants, consistent with the importance of the fundamental frequency to convey stress patterns in English. Furthermore, participants showed more robust encoding and more precise spectral representations of the first formant when listening to the syllable of their native language as compared to non-native language. These results align with the hypothesis that language experience shapes sensory processing of speech and that this plasticity occurs as a function of what is meaningful to a listener. Copyright © 2016 Elsevier Ltd. All rights reserved.
Automatic Speech Recognition from Neural Signals: A Focused Review.
Herff, Christian; Schultz, Tanja
2016-01-01
Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.
Military applications of automatic speech recognition and future requirements
NASA Technical Reports Server (NTRS)
Beek, Bruno; Cupples, Edward J.
1977-01-01
An updated summary of the state-of-the-art of automatic speech recognition and its relevance to military applications is provided. A number of potential systems for military applications are under development. These include: (1) digital narrowband communication systems; (2) automatic speech verification; (3) on-line cartographic processing unit; (4) word recognition for militarized tactical data system; and (5) voice recognition and synthesis for aircraft cockpit.
Automatic speech recognition technology development at ITT Defense Communications Division
NASA Technical Reports Server (NTRS)
White, George M.
1977-01-01
An assessment of the applications of automatic speech recognition to defense communication systems is presented. Future research efforts include investigations into the following areas: (1) dynamic programming; (2) recognition of speech degraded by noise; (3) speaker independent recognition; (4) large vocabulary recognition; (5) word spotting and continuous speech recognition; and (6) isolated word recognition.
Automatic Method of Pause Measurement for Normal and Dysarthric Speech
ERIC Educational Resources Information Center
Rosen, Kristin; Murdoch, Bruce; Folker, Joanne; Vogel, Adam; Cahill, Louise; Delatycki, Martin; Corben, Louise
2010-01-01
This study proposes an automatic method for the detection of pauses and identification of pause types in conversational speech for the purpose of measuring the effects of Friedreich's Ataxia (FRDA) on speech. Speech samples of [approximately] 3 minutes were recorded from 13 speakers with FRDA and 18 healthy controls. Pauses were measured from the…
Rapid and automatic speech-specific learning mechanism in human neocortex.
Kimppa, Lilli; Kujala, Teija; Leminen, Alina; Vainio, Martti; Shtyrov, Yury
2015-09-01
A unique feature of human communication system is our ability to rapidly acquire new words and build large vocabularies. However, its neurobiological foundations remain largely unknown. In an electrophysiological study optimally designed to probe this rapid formation of new word memory circuits, we employed acoustically controlled novel word-forms incorporating native and non-native speech sounds, while manipulating the subjects' attention on the input. We found a robust index of neurolexical memory-trace formation: a rapid enhancement of the brain's activation elicited by novel words during a short (~30min) perceptual exposure, underpinned by fronto-temporal cortical networks, and, importantly, correlated with behavioural learning outcomes. Crucially, this neural memory trace build-up took place regardless of focused attention on the input or any pre-existing or learnt semantics. Furthermore, it was found only for stimuli with native-language phonology, but not for acoustically closely matching non-native words. These findings demonstrate a specialised cortical mechanism for rapid, automatic and phonology-dependent formation of neural word memory circuits. Copyright © 2015. Published by Elsevier Inc.
ERIC Educational Resources Information Center
Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian
2017-01-01
Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist…
Speech Processing and Recognition (SPaRe)
2011-01-01
results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing ( NLP ), and...Processing ( NLP ), Information Retrieval (IR) 16. SECURITY CLASSIFICATION OF: UNCLASSIFED 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...Figure 9, the IOC was only expected to provide document submission and search; automatic speech recognition (ASR) for English, Spanish, Arabic , and
ERIC Educational Resources Information Center
Chen, Howard Hao-Jan
2011-01-01
Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…
Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks
ERIC Educational Resources Information Center
Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya
2016-01-01
This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…
ERIC Educational Resources Information Center
Savage, Robert S.; Frederickson, Norah; Goodwin, Roz; Patni, Ulla; Smith, Nicola; Tuersley, Louise
2005-01-01
In this article, we explore the relationship between rapid automatized naming (RAN) and other cognitive processes among below-average, average, and above-average readers and spellers. Nonsense word reading, phonological awareness, RAN, automaticity of balance, speech perception, and verbal short-term and working memory were measured. Factor…
Assessing Children's Home Language Environments Using Automatic Speech Recognition Technology
ERIC Educational Resources Information Center
Greenwood, Charles R.; Thiemann-Bourque, Kathy; Walker, Dale; Buzhardt, Jay; Gilkerson, Jill
2011-01-01
The purpose of this research was to replicate and extend some of the findings of Hart and Risley using automatic speech processing instead of human transcription of language samples. The long-term goal of this work is to make the current approach to speech processing possible by researchers and clinicians working on a daily basis with families and…
ERIC Educational Resources Information Center
Wald, Mike
2006-01-01
The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search…
NASA Astrophysics Data System (ADS)
Karam, Walid; Mokbel, Chafic; Greige, Hanna; Chollet, Gerard
2006-05-01
A GMM based audio visual speaker verification system is described and an Active Appearance Model with a linear speaker transformation system is used to evaluate the robustness of the verification. An Active Appearance Model (AAM) is used to automatically locate and track a speaker's face in a video recording. A Gaussian Mixture Model (GMM) based classifier (BECARS) is used for face verification. GMM training and testing is accomplished on DCT based extracted features of the detected faces. On the audio side, speech features are extracted and used for speaker verification with the GMM based classifier. Fusion of both audio and video modalities for audio visual speaker verification is compared with face verification and speaker verification systems. To improve the robustness of the multimodal biometric identity verification system, an audio visual imposture system is envisioned. It consists of an automatic voice transformation technique that an impostor may use to assume the identity of an authorized client. Features of the transformed voice are then combined with the corresponding appearance features and fed into the GMM based system BECARS for training. An attempt is made to increase the acceptance rate of the impostor and to analyzing the robustness of the verification system. Experiments are being conducted on the BANCA database, with a prospect of experimenting on the newly developed PDAtabase developed within the scope of the SecurePhone project.
Multilevel Analysis in Analyzing Speech Data
ERIC Educational Resources Information Center
Guddattu, Vasudeva; Krishna, Y.
2011-01-01
The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…
Vitikainen, Anne-Mari; Mäkelä, Elina; Lioumis, Pantelis; Jousmäki, Veikko; Mäkelä, Jyrki P
2015-09-30
The use of navigated repetitive transcranial magnetic stimulation (rTMS) in mapping of speech-related brain areas has recently shown to be useful in preoperative workflow of epilepsy and tumor patients. However, substantial inter- and intraobserver variability and non-optimal replicability of the rTMS results have been reported, and a need for additional development of the methodology is recognized. In TMS motor cortex mappings the evoked responses can be quantitatively monitored by electromyographic recordings; however, no such easily available setup exists for speech mappings. We present an accelerometer-based setup for detection of vocalization-related larynx vibrations combined with an automatic routine for voice onset detection for rTMS speech mapping applying naming. The results produced by the automatic routine were compared with the manually reviewed video-recordings. The new method was applied in the routine navigated rTMS speech mapping for 12 consecutive patients during preoperative workup for epilepsy or tumor surgery. The automatic routine correctly detected 96% of the voice onsets, resulting in 96% sensitivity and 71% specificity. Majority (63%) of the misdetections were related to visible throat movements, extra voices before the response, or delayed naming of the previous stimuli. The no-response errors were correctly detected in 88% of events. The proposed setup for automatic detection of voice onsets provides quantitative additional data for analysis of the rTMS-induced speech response modifications. The objectively defined speech response latencies increase the repeatability, reliability and stratification of the rTMS results. Copyright © 2015 Elsevier B.V. All rights reserved.
ERIC Educational Resources Information Center
Pattamadilok, Chotiga; Nelis, Aubéline; Kolinsky, Régine
2014-01-01
Studies on proficient readers showed that speech processing is affected by knowledge of the orthographic code. Yet, the automaticity of the orthographic influence depends on task demand. Here, we addressed this automaticity issue in normal and dyslexic adult readers by comparing the orthographic effects obtained in two speech processing tasks that…
Nixon, C; Anderson, T; Morris, L; McCavitt, A; McKinley, R; Yeager, D; McDaniel, M
1998-11-01
The intelligibility of female and male speech is equivalent under most ordinary living conditions. However, due to small differences between their acoustic speech signals, called speech spectra, one can be more or less intelligible than the other in certain situations such as high levels of noise. Anecdotal information, supported by some empirical observations, suggests that some of the high intensity noise spectra of military aircraft cockpits may degrade the intelligibility of female speech more than that of male speech. In an applied research study, the intelligibility of female and male speech was measured in several high level aircraft cockpit noise conditions experienced in military aviation. In Part I, (Nixon CW, et al. Aviat Space Environ Med 1998; 69:675-83) female speech intelligibility measured in the spectra and levels of aircraft cockpit noises and with noise-canceling microphones was lower than that of the male speech in all conditions. However, the differences were small and only those at some of the highest noise levels were significant. Although speech intelligibility of both genders was acceptable during normal cruise noises, improvements are required in most of the highest levels of noise created during maximum aircraft operating conditions. These results are discussed in a Part I technical report. This Part II report examines the intelligibility in the same aircraft cockpit noises of vocoded female and male speech and the accuracy with which female and male speech in some of the cockpit noises were understood by automatic speech recognition systems. The intelligibility of vocoded female speech was generally the same as that of vocoded male speech. No significant differences were measured between the recognition accuracy of male and female speech by the automatic speech recognition systems. The intelligibility of female and male speech was equivalent for these conditions.
Haderlein, Tino; Döllinger, Michael; Matoušek, Václav; Nöth, Elmar
2016-10-01
Automatic voice assessment is often performed using sustained vowels. In contrast, speech analysis of read-out texts can be applied to voice and speech assessment. Automatic speech recognition and prosodic analysis were used to find regression formulae between automatic and perceptual assessment of four voice and four speech criteria. The regression was trained with 21 men and 62 women (average age 49.2 years) and tested with another set of 24 men and 49 women (48.3 years), all suffering from chronic hoarseness. They read the text 'Der Nordwind und die Sonne' ('The North Wind and the Sun'). Five voice and speech therapists evaluated the data on 5-point Likert scales. Ten prosodic and recognition accuracy measures (features) were identified which describe all the examined criteria. Inter-rater correlation within the expert group was between r = 0.63 for the criterion 'match of breath and sense units' and r = 0.87 for the overall voice quality. Human-machine correlation was between r = 0.40 for the match of breath and sense units and r = 0.82 for intelligibility. The perceptual ratings of different criteria were highly correlated with each other. Likewise, the feature sets modeling the criteria were very similar. The automatic method is suitable for assessing chronic hoarseness in general and for subgroups of functional and organic dysphonia. In its current version, it is almost as reliable as a randomly picked rater from a group of voice and speech therapists.
Tóth, László; Hoffmann, Ildikó; Gosztolya, Gábor; Vincze, Veronika; Szatlóczki, Gréta; Bánréti, Zoltán; Pákáski, Magdolna; Kálmán, János
2018-01-01
Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer’s disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive de-cline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech sig-nals, first manually (using the Praat software), and then automatically, with an automatic speech recogni-tion (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. Conclusion: The temporal analysis of spontaneous speech can be exploited in implementing a new, auto-matic detection-based tool for screening MCI for the community. PMID:29165085
Toth, Laszlo; Hoffmann, Ildiko; Gosztolya, Gabor; Vincze, Veronika; Szatloczki, Greta; Banreti, Zoltan; Pakaski, Magdolna; Kalman, Janos
2018-01-01
Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. The temporal analysis of spontaneous speech can be exploited in implementing a new, automatic detection-based tool for screening MCI for the community. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
On the context dependency of implicit self-esteem in social anxiety disorder.
Hiller, Thomas S; Steffens, Melanie C; Ritter, Viktoria; Stangier, Ulrich
2017-12-01
Cognitive models assume that negative self-evaluations are automatically activated in individuals with Social Anxiety Disorder (SAD) during social situations, increasing their individual level of anxiety. This study examined automatic self-evaluations (i.e., implicit self-esteem) and state anxiety in a group of individuals with SAD (n = 45) and a non-clinical comparison group (NC; n = 46). Participants were randomly assigned to either a speech condition with social threat induction (giving an impromptu speech) or to a no-speech condition without social threat induction. We measured implicit self-esteem with an Implicit Association Test (IAT). Implicit self-esteem differed significantly between SAD and NC groups under the speech condition but not under the no-speech condition. The SAD group showed lower implicit self-esteem than the NC group under the speech-condition. State anxiety was significantly higher under the speech condition than under the no-speech condition in the SAD group but not in the NC group. Mediation analyses supported the idea that for the SAD group, the effect of experimental condition on state anxiety was mediated by implicit self-esteem. The causal relation between implicit self-esteem and state anxiety could not be determined. The findings corroborate hypotheses derived from cognitive models of SAD: Automatic self-evaluations were negatively biased in individuals with SAD facing social threat and showed an inverse relationship to levels of state anxiety. However, automatic self-evaluations in individuals with SAD can be unbiased (similar to NC) in situations without social threat. Copyright © 2017 Elsevier Ltd. All rights reserved.
Automatic Title Generation for Spoken Broadcast News
2001-01-01
degrades much less with speech -recognized transcripts. Meanwhile, even though KNN performance not as well as TF.IDF and NBL in terms of F1 metric, it...test corpus of 1006 broadcast news documents, comparing the results over manual transcription to the results over automatically recognized speech . We...use both F1 and the average number of correct title words in the correct order as metric. Overall, the results show that title generation for speech
Popular song and lyrics synchronization and its application to music information retrieval
NASA Astrophysics Data System (ADS)
Chen, Kai; Gao, Sheng; Zhu, Yongwei; Sun, Qibin
2006-01-01
An automatic synchronization system of the popular song and its lyrics is presented in the paper. The system includes two main components: a) automatically detecting vocal/non-vocal in the audio signal and b) automatically aligning the acoustic signal of the song with its lyric using speech recognition techniques and positioning the boundaries of the lyrics in its acoustic realization at the multiple levels simultaneously (e.g. the word / syllable level and phrase level). The GMM models and a set of HMM-based acoustic model units are carefully designed and trained for the detection and alignment. To eliminate the severe mismatch due to the diversity of musical signal and sparse training data available, the unsupervised adaptation technique such as maximum likelihood linear regression (MLLR) is exploited for tailoring the models to the real environment, which improves robustness of the synchronization system. To further reduce the effect of the missed non-vocal music on alignment, a novel grammar net is build to direct the alignment. As we know, this is the first automatic synchronization system only based on the low-level acoustic feature such as MFCC. We evaluate the system on a Chinese song dataset collecting from 3 popular singers. We obtain 76.1% for the boundary accuracy at the syllable level (BAS) and 81.5% for the boundary accuracy at the phrase level (BAP) using fully automatic vocal/non-vocal detection and alignment. The synchronization system has many applications such as multi-modality (audio and textual) content-based popular song browsing and retrieval. Through the study, we would like to open up the discussion of some challenging problems when developing a robust synchronization system for largescale database.
Automatic lip reading by using multimodal visual features
NASA Astrophysics Data System (ADS)
Takahashi, Shohei; Ohya, Jun
2013-12-01
Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.
An investigation of articulatory setting using real-time magnetic resonance imaging
Ramanarayanan, Vikram; Goldstein, Louis; Byrd, Dani; Narayanan, Shrikanth S.
2013-01-01
This paper presents an automatic procedure to analyze articulatory setting in speech production using real-time magnetic resonance imaging of the moving human vocal tract. The procedure extracts frames corresponding to inter-speech pauses, speech-ready intervals and absolute rest intervals from magnetic resonance imaging sequences of read and spontaneous speech elicited from five healthy speakers of American English and uses automatically extracted image features to quantify vocal tract posture during these intervals. Statistical analyses show significant differences between vocal tract postures adopted during inter-speech pauses and those at absolute rest before speech; the latter also exhibits a greater variability in the adopted postures. In addition, the articulatory settings adopted during inter-speech pauses in read and spontaneous speech are distinct. The results suggest that adopted vocal tract postures differ on average during rest positions, ready positions and inter-speech pauses, and might, in that order, involve an increasing degree of active control by the cognitive speech planning mechanism. PMID:23862826
An acoustic feature-based similarity scoring system for speech rehabilitation assistance.
Syauqy, Dahnial; Wu, Chao-Min; Setyawati, Onny
2016-08-01
The purpose of this study is to develop a tool to assist speech therapy and rehabilitation, which focused on automatic scoring based on the comparison of the patient's speech with another normal speech on several aspects including pitch, vowel, voiced-unvoiced segments, strident fricative and sound intensity. The pitch estimation employed the use of cepstrum-based algorithm for its robustness; the vowel classification used multilayer perceptron (MLP) to classify vowel from pitch and formants; and the strident fricative detection was based on the major peak spectral intensity, location and the pitch existence in the segment. In order to evaluate the performance of the system, this study analyzed eight patient's speech recordings (four males, four females; 4-58-years-old), which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Taoyuan General Hospital. The experiment result on pitch algorithm showed that the cepstrum method had 5.3% of gross pitch error from a total of 2086 frames. On the vowel classification algorithm, MLP method provided 93% accuracy (men), 87% (women) and 84% (children). In total, the overall results showed that 156 tool's grading results (81%) were consistent compared to 192 audio and visual observations done by four experienced respondents. Implication for Rehabilitation Difficulties in communication may limit the ability of a person to transfer and exchange information. The fact that speech is one of the primary means of communication has encouraged the needs of speech diagnosis and rehabilitation. The advances of technology in computer-assisted speech therapy (CAST) improve the quality, time efficiency of the diagnosis and treatment of the disorders. The present study attempted to develop tool to assist speech therapy and rehabilitation, which provided simple interface to let the assessment be done even by the patient himself without the need of particular knowledge of speech processing while at the same time, also provided further deep analysis of the speech, which can be useful for the speech therapist.
Time-Warp–Invariant Neuronal Processing
Gütig, Robert; Sompolinsky, Haim
2009-01-01
Fluctuations in the temporal durations of sensory signals constitute a major source of variability within natural stimulus ensembles. The neuronal mechanisms through which sensory systems can stabilize perception against such fluctuations are largely unknown. An intriguing instantiation of such robustness occurs in human speech perception, which relies critically on temporal acoustic cues that are embedded in signals with highly variable duration. Across different instances of natural speech, auditory cues can undergo temporal warping that ranges from 2-fold compression to 2-fold dilation without significant perceptual impairment. Here, we report that time-warp–invariant neuronal processing can be subserved by the shunting action of synaptic conductances that automatically rescales the effective integration time of postsynaptic neurons. We propose a novel spike-based learning rule for synaptic conductances that adjusts the degree of synaptic shunting to the temporal processing requirements of a given task. Applying this general biophysical mechanism to the example of speech processing, we propose a neuronal network model for time-warp–invariant word discrimination and demonstrate its excellent performance on a standard benchmark speech-recognition task. Our results demonstrate the important functional role of synaptic conductances in spike-based neuronal information processing and learning. The biophysics of temporal integration at neuronal membranes can endow sensory pathways with powerful time-warp–invariant computational capabilities. PMID:19582146
Kraaijenga, S A C; Oskam, I M; van Son, R J J H; Hamming-Vrieze, O; Hilgers, F J M; van den Brekel, M W M; van der Molen, L
2016-04-01
Assessment of long-term objective and subjective voice, speech, articulation, and quality of life in patients with head and neck cancer (HNC) treated with concurrent chemoradiotherapy (CRT) for advanced, stage IV disease. Twenty-two disease-free survivors, treated with cisplatin-based CRT for inoperable HNC (1999-2004), were evaluated at 10-years post-treatment. A standard Dutch text was recorded. Perceptual analysis of voice, speech, and articulation was conducted by two expert listeners (SLPs). Also an experimental expert system based on automatic speech recognition was used. Patients' perception of voice and speech and related quality of life was assessed with the Voice Handicap Index (VHI) and Speech Handicap Index (SHI) questionnaires. At a median follow-up of 11-years, perceptual evaluation showed abnormal scores in up to 64% of cases, depending on the outcome parameter analyzed. Automatic assessment of voice and speech parameters correlated moderate to strong with perceptual outcome scores. Patient-reported problems with voice (VHI>15) and speech (SHI>6) in daily life were present in 68% and 77% of patients, respectively. Patients treated with IMRT showed significantly less impairment compared to those treated with conventional radiotherapy. More than 10-years after organ-preservation treatment, voice and speech problems are common in this patient cohort, as assessed with perceptual evaluation, automatic speech recognition, and with validated structured questionnaires. There were fewer complaints in patients treated with IMRT than with conventional radiotherapy. Copyright © 2016 Elsevier Ltd. All rights reserved.
Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review
NASA Astrophysics Data System (ADS)
Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH
2017-09-01
This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.
Speech fluency profile on different tasks for individuals with Parkinson's disease.
Juste, Fabiola Staróbole; Andrade, Claudia Regina Furquim de
2017-07-20
To characterize the speech fluency profile of patients with Parkinson's disease. Study participants were 40 individuals of both genders aged 40 to 80 years divided into 2 groups: Research Group - RG (20 individuals with diagnosis of Parkinson's disease) and Control Group - CG (20 individuals with no communication or neurological disorders). For all of the participants, three speech samples involving different tasks were collected: monologue, individual reading, and automatic speech. The RG presented a significant larger number of speech disruptions, both stuttering-like and typical dysfluencies, and higher percentage of speech discontinuity in the monologue and individual reading tasks compared with the CG. Both groups presented reduced number of speech disruptions (stuttering-like and typical dysfluencies) in the automatic speech task; the groups presented similar performance in this task. Regarding speech rate, individuals in the RG presented lower number of words and syllables per minute compared with those in the CG in all speech tasks. Participants of the RG presented altered parameters of speech fluency compared with those of the CG; however, this change in fluency cannot be considered a stuttering disorder.
Automatic speech recognition in air traffic control
NASA Technical Reports Server (NTRS)
Karlsson, Joakim
1990-01-01
Automatic Speech Recognition (ASR) technology and its application to the Air Traffic Control system are described. The advantages of applying ASR to Air Traffic Control, as well as criteria for choosing a suitable ASR system are presented. Results from previous research and directions for future work at the Flight Transportation Laboratory are outlined.
Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation
ERIC Educational Resources Information Center
Kim, In-Seok
2006-01-01
This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…
Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation
ERIC Educational Resources Information Center
Elimat, Amal Khalil; AbuSeileek, Ali Farhan
2014-01-01
This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…
Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.
ERIC Educational Resources Information Center
Makhoul, John I.
The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…
Investigating Prompt Difficulty in an Automatically Scored Speaking Performance Assessment
ERIC Educational Resources Information Center
Cox, Troy L.
2013-01-01
Speaking assessments for second language learners have traditionally been expensive to administer because of the cost of rating the speech samples. To reduce the cost, many researchers are investigating the potential of using automatic speech recognition (ASR) as a means to score examinee responses to open-ended prompts. This study examined the…
Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study
ERIC Educational Resources Information Center
van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer
2016-01-01
The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…
Searchfield, Grant D; Linford, Tania; Kobayashi, Kei; Crowhen, David; Latzel, Matthias
2018-03-01
To compare preference for and performance of manually selected programmes to an automatic sound classifier, the Phonak AutoSense OS. A single blind repeated measures study. Participants were fit with Phonak Virto V90 ITE aids; preferences for different listening programmes were compared across four different sound scenarios (speech in: quiet, noise, loud noise and a car). Following a 4-week trial preferences were reassessed and the users preferred programme was compared to the automatic classifier for sound quality and hearing in noise (HINT test) using a 12 loudspeaker array. Twenty-five participants with symmetrical moderate-severe sensorineural hearing loss. Participant preferences of manual programme for scenarios varied considerably between and within sessions. A HINT Speech Reception Threshold (SRT) advantage was observed for the automatic classifier over participant's manual selection for speech in quiet, loud noise and car noise. Sound quality ratings were similar for both manual and automatic selections. The use of a sound classifier is a viable alternative to manual programme selection.
Howell, Peter; Sackin, Stevie; Glenn, Kazan
2007-01-01
This program of work is intended to develop automatic recognition procedures to locate and assess stuttered dysfluencies. This and the following article together, develop and test recognizers for repetitions and prolongations. The automatic recognizers classify the speech in two stages: In the first, the speech is segmented and in the second the segments are categorized. The units that are segmented are words. Here assessments by human judges on the speech of 12 children who stutter are described using a corresponding procedure. The accuracy of word boundary placement across judges, categorization of the words as fluent, repetition or prolongation, and duration of the different fluency categories are reported. These measures allow reliable instances of repetitions and prolongations to be selected for training and assessing the recognizers in the subsequent paper. PMID:9328878
NASA Astrophysics Data System (ADS)
Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.
2009-12-01
This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.
Vasconcelos, Maria J M; Ventura, Sandra M R; Freitas, Diamantino R S; Tavares, João Manuel R S
2012-03-01
The morphological and dynamic characterisation of the vocal tract during speech production has been gaining greater attention due to the motivation of the latest improvements in magnetic resonance (MR) imaging; namely, with the use of higher magnetic fields, such as 3.0 Tesla. In this work, the automatic study of the vocal tract from 3.0 Tesla MR images was assessed through the application of statistical deformable models. Therefore, the primary goal focused on the analysis of the shape of the vocal tract during the articulation of European Portuguese sounds, followed by the evaluation of the results concerning the automatic segmentation, i.e. identification of the vocal tract in new MR images. In what concerns speech production, this is the first attempt to automatically characterise and reconstruct the vocal tract shape of 3.0 Tesla MR images by using deformable models; particularly, by using active and appearance shape models. The achieved results clearly evidence the adequacy and advantage of the automatic analysis of the 3.0 Tesla MR images of these deformable models in order to extract the vocal tract shape and assess the involved articulatory movements. These achievements are mostly required, for example, for a better knowledge of speech production, mainly of patients suffering from articulatory disorders, and to build enhanced speech synthesizer models.
Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn
2016-01-01
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486
Sridhar, Vivek Kumar Rangarajan; Bangalore, Srinivas; Narayanan, Shrikanth S.
2009-01-01
In this paper, we describe a maximum entropy-based automatic prosody labeling framework that exploits both language and speech information. We apply the proposed framework to both prominence and phrase structure detection within the Tones and Break Indices (ToBI) annotation scheme. Our framework utilizes novel syntactic features in the form of supertags and a quantized acoustic–prosodic feature representation that is similar to linear parameterizations of the prosodic contour. The proposed model is trained discriminatively and is robust in the selection of appropriate features for the task of prosody detection. The proposed maximum entropy acoustic–syntactic model achieves pitch accent and boundary tone detection accuracies of 86.0% and 93.1% on the Boston University Radio News corpus, and, 79.8% and 90.3% on the Boston Directions corpus. The phrase structure detection through prosodic break index labeling provides accuracies of 84% and 87% on the two corpora, respectively. The reported results are significantly better than previously reported results and demonstrate the strength of maximum entropy model in jointly modeling simple lexical, syntactic, and acoustic features for automatic prosody labeling. PMID:19603083
Speech Intelligibility Predicted from Neural Entrainment of the Speech Envelope.
Vanthornhout, Jonas; Decruy, Lien; Wouters, Jan; Simon, Jonathan Z; Francart, Tom
2018-04-01
Speech intelligibility is currently measured by scoring how well a person can identify a speech signal. The results of such behavioral measures reflect neural processing of the speech signal, but are also influenced by language processing, motivation, and memory. Very often, electrophysiological measures of hearing give insight in the neural processing of sound. However, in most methods, non-speech stimuli are used, making it hard to relate the results to behavioral measures of speech intelligibility. The use of natural running speech as a stimulus in electrophysiological measures of hearing is a paradigm shift which allows to bridge the gap between behavioral and electrophysiological measures. Here, by decoding the speech envelope from the electroencephalogram, and correlating it with the stimulus envelope, we demonstrate an electrophysiological measure of neural processing of running speech. We show that behaviorally measured speech intelligibility is strongly correlated with our electrophysiological measure. Our results pave the way towards an objective and automatic way of assessing neural processing of speech presented through auditory prostheses, reducing confounds such as attention and cognitive capabilities. We anticipate that our electrophysiological measure will allow better differential diagnosis of the auditory system, and will allow the development of closed-loop auditory prostheses that automatically adapt to individual users.
ERIC Educational Resources Information Center
Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani
2017-01-01
The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…
Using Automatic Speech Recognition Technology with Elicited Oral Response Testing
ERIC Educational Resources Information Center
Cox, Troy L.; Davies, Randall S.
2012-01-01
This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…
Automatic evaluation of hypernasality based on a cleft palate speech database.
He, Ling; Zhang, Jing; Liu, Qi; Yin, Heng; Lech, Margaret; Huang, Yunzhi
2015-05-01
The hypernasality is one of the most typical characteristics of cleft palate (CP) speech. The evaluation outcome of hypernasality grading decides the necessity of follow-up surgery. Currently, the evaluation of CP speech is carried out by experienced speech therapists. However, the result strongly depends on their clinical experience and subjective judgment. This work aims to propose an automatic evaluation system for hypernasality grading in CP speech. The database tested in this work is collected by the Hospital of Stomatology, Sichuan University, which has the largest number of CP patients in China. Based on the production process of hypernasality, source sound pulse and vocal tract filter features are presented. These features include pitch, the first and second energy amplified frequency bands, cepstrum based features, MFCC, short-time energy in the sub-bands features. These features combined with KNN classier are applied to automatically classify four grades of hypernasality: normal, mild, moderate and severe. The experiment results show that the proposed system achieves a good performance. The classification rates for four hypernasality grades reach up to 80.4%. The sensitivity of proposed features to the gender is also discussed.
Primary Progressive Speech Abulia.
Milano, Nicholas J; Heilman, Kenneth M
2015-01-01
Primary progressive aphasia (PPA) is a neurodegenerative disorder characterized by progressive language impairment. The three variants of PPA include the nonfluent/agrammatic, semantic, and logopenic types. The goal of this report is to describe two patients with a loss of speech initiation that was associated with bilateral medial frontal atrophy. Two patients with progressive speech deficits were evaluated and their examinations revealed a paucity of spontaneous speech; however their naming, repetition, reading, and writing were all normal. The patients had no evidence of agrammatism or apraxia of speech but did have impaired speech fluency. In addition to impaired production of propositional spontaneous speech, these patients had impaired production of automatic speech (e.g., reciting the Lord's Prayer) and singing. Structural brain imaging revealed bilateral medial frontal atrophy in both patients. These patients' language deficits are consistent with a PPA, but they are in the pattern of a dynamic aphasia. Whereas the signs-symptoms of dynamic aphasia have been previously described, to our knowledge these are the first cases associated with predominantly bilateral medial frontal atrophy that impaired both propositional and automatic speech. Thus, this profile may represent a new variant of PPA.
Use of Computer Speech Technologies To Enhance Learning.
ERIC Educational Resources Information Center
Ferrell, Joe
1999-01-01
Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)
Visual speech influences speech perception immediately but not automatically.
Mitterer, Holger; Reinisch, Eva
2017-02-01
Two experiments examined the time course of the use of auditory and visual speech cues to spoken word recognition using an eye-tracking paradigm. Results of the first experiment showed that the use of visual speech cues from lipreading is reduced if concurrently presented pictures require a division of attentional resources. This reduction was evident even when listeners' eye gaze was on the speaker rather than the (static) pictures. Experiment 2 used a deictic hand gesture to foster attention to the speaker. At the same time, the visual processing load was reduced by keeping the visual display constant over a fixed number of successive trials. Under these conditions, the visual speech cues from lipreading were used. Moreover, the eye-tracking data indicated that visual information was used immediately and even earlier than auditory information. In combination, these data indicate that visual speech cues are not used automatically, but if they are used, they are used immediately.
Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers
NASA Astrophysics Data System (ADS)
Caballero Morales, Santiago Omar; Cox, Stephen J.
2009-12-01
Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.
Automatic concept extraction from spoken medical reports.
Happe, André; Pouliquen, Bruno; Burgun, Anita; Cuggia, Marc; Le Beux, Pierre
2003-07-01
The objective of this project is to investigate methods whereby a combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices. We based our study on existing speech recognition software programs and on NOMINDEX, a tool that extracts MeSH concepts from medical text in natural language and that is mainly based on a French medical lexicon and on the UMLS. For each document, the process consists of three steps: (1) dictation and digital audio recording, (2) speech recognition, (3) automatic indexing. The evaluation consisted of a comparison between the set of concepts extracted by NOMINDEX after the speech recognition phase and the set of keywords manually extracted from the initial document. The method was evaluated on a set of 28 patient discharge summaries extracted from the MENELAS corpus in French, corresponding to in-patients admitted for coronarography. The overall precision was 73% and the overall recall was 90%. Indexing errors were mainly due to word sense ambiguity and abbreviations. A specific issue was the fact that the standard French translation of MeSH terms lacks diacritics. A preliminary evaluation of speech recognition tools showed that the rate of accurate recognition was higher than 98%. Only 3% of the indexing errors were generated by inadequate speech recognition. We discuss several areas to focus on to improve this prototype. However, the very low rate of indexing errors due to speech recognition errors highlights the potential benefits of combining speech recognition techniques and automatic indexing.
ERIC Educational Resources Information Center
Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.
2017-01-01
In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…
Li, Kan; Príncipe, José C.
2018-01-01
This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime. PMID:29666568
Li, Kan; Príncipe, José C
2018-01-01
This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.
A probabilistic union model with automatic order selection for noisy speech recognition.
Jancovic, P; Ming, J
2001-09-01
A critical issue in exploiting the potential of the sub-band-based approach to robust speech recognition is the method of combining the sub-band observations, for selecting the bands unaffected by noise. A new method for this purpose, i.e., the probabilistic union model, was recently introduced. This model has been shown to be capable of dealing with band-limited corruption, requiring no knowledge about the band position and statistical distribution of the noise. A parameter within the model, which we call its order, gives the best results when it equals the number of noisy bands. Since this information may not be available in practice, in this paper we introduce an automatic algorithm for selecting the order, based on the state duration pattern generated by the hidden Markov model (HMM). The algorithm has been tested on the TIDIGITS database corrupted by various types of additive band-limited noise with unknown noisy bands. The results have shown that the union model equipped with the new algorithm can achieve a recognition performance similar to that achieved when the number of noisy bands is known. The results show a very significant improvement over the traditional full-band model, without requiring prior information on either the position or the number of noisy bands. The principle of the algorithm for selecting the order based on state duration may also be applied to other sub-band combination methods.
Performance of wavelet analysis and neural networks for pathological voices identification
NASA Astrophysics Data System (ADS)
Salhi, Lotfi; Talbi, Mourad; Abid, Sabeur; Cherif, Adnane
2011-09-01
Within the medical environment, diverse techniques exist to assess the state of the voice of the patient. The inspection technique is inconvenient for a number of reasons, such as its high cost, the duration of the inspection, and above all, the fact that it is an invasive technique. This study focuses on a robust, rapid and accurate system for automatic identification of pathological voices. This system employs non-invasive, non-expensive and fully automated method based on hybrid approach: wavelet transform analysis and neural network classifier. First, we present the results obtained in our previous study while using classic feature parameters. These results allow visual identification of pathological voices. Second, quantified parameters drifting from the wavelet analysis are proposed to characterise the speech sample. On the other hand, a system of multilayer neural networks (MNNs) has been developed which carries out the automatic detection of pathological voices. The developed method was evaluated using voice database composed of recorded voice samples (continuous speech) from normophonic or dysphonic speakers. The dysphonic speakers were patients of a National Hospital 'RABTA' of Tunis Tunisia and a University Hospital in Brussels, Belgium. Experimental results indicate a success rate ranging between 75% and 98.61% for discrimination of normal and pathological voices using the proposed parameters and neural network classifier. We also compared the average classification rate based on the MNN, Gaussian mixture model and support vector machines.
The Suitability of Cloud-Based Speech Recognition Engines for Language Learning
ERIC Educational Resources Information Center
Daniels, Paul; Iwago, Koji
2017-01-01
As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…
NASA Technical Reports Server (NTRS)
Wolf, Jared J.
1977-01-01
The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.
Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition
NASA Astrophysics Data System (ADS)
Malcangi, Mario
2009-08-01
Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.
The Speech multi features fusion perceptual hash algorithm based on tensor decomposition
NASA Astrophysics Data System (ADS)
Huang, Y. B.; Fan, M. H.; Zhang, Q. Y.
2018-03-01
With constant progress in modern speech communication technologies, the speech data is prone to be attacked by the noise or maliciously tampered. In order to make the speech perception hash algorithm has strong robustness and high efficiency, this paper put forward a speech perception hash algorithm based on the tensor decomposition and multi features is proposed. This algorithm analyses the speech perception feature acquires each speech component wavelet packet decomposition. LPCC, LSP and ISP feature of each speech component are extracted to constitute the speech feature tensor. Speech authentication is done by generating the hash values through feature matrix quantification which use mid-value. Experimental results showing that the proposed algorithm is robust for content to maintain operations compared with similar algorithms. It is able to resist the attack of the common background noise. Also, the algorithm is highly efficiency in terms of arithmetic, and is able to meet the real-time requirements of speech communication and complete the speech authentication quickly.
Presentation video retrieval using automatically recovered slide and spoken text
NASA Astrophysics Data System (ADS)
Cooper, Matthew
2013-03-01
Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.
Zekveld, Adriana A.; Kramer, Sophia E.; Kessens, Judith M.; Vlaming, Marcel S. M. G.; Houtgast, Tammo
2009-01-01
This study examined the subjective benefit obtained from automatically generated captions during telephone-speech comprehension in the presence of babble noise. Short stories were presented by telephone either with or without captions that were generated offline by an automatic speech recognition (ASR) system. To simulate online ASR, the word accuracy (WA) level of the captions was 60% or 70% and the text was presented delayed to the speech. After each test, the hearing impaired participants (n = 20) completed the NASA-Task Load Index and several rating scales evaluating the support from the captions. Participants indicated that using the erroneous text in speech comprehension was difficult and the reported task load did not differ between the audio + text and audio-only conditions. In a follow-up experiment (n = 10), the perceived benefit of presenting captions increased with an increase of WA levels to 80% and 90%, and elimination of the text delay. However, in general, the task load did not decrease when captions were presented. These results suggest that the extra effort required to process the text could have been compensated for by less effort required to comprehend the speech. Future research should aim at reducing the complexity of the task to increase the willingness of hearing impaired persons to use an assistive communication system automatically providing captions. The current results underline the need for obtaining both objective and subjective measures of benefit when evaluating assistive communication systems. PMID:19126551
SNR-adaptive stream weighting for audio-MES ASR.
Lee, Ki-Seung
2008-08-01
Myoelectric signals (MESs) from the speaker's mouth region have been successfully shown to improve the noise robustness of automatic speech recognizers (ASRs), thus promising to extend their usability in implementing noise-robust ASR. In the recognition system presented herein, extracted audio and facial MES features were integrated by a decision fusion method, where the likelihood score of the audio-MES observation vector was given by a linear combination of class-conditional observation log-likelihoods of two classifiers, using appropriate weights. We developed a weighting process adaptive to SNRs. The main objective of the paper involves determining the optimal SNR classification boundaries and constructing a set of optimum stream weights for each SNR class. These two parameters were determined by a method based on a maximum mutual information criterion. Acoustic and facial MES data were collected from five subjects, using a 60-word vocabulary. Four types of acoustic noise including babble, car, aircraft, and white noise were acoustically added to clean speech signals with SNR ranging from -14 to 31 dB. The classification accuracy of the audio ASR was as low as 25.5%. Whereas, the classification accuracy of the MES ASR was 85.2%. The classification accuracy could be further improved by employing the proposed audio-MES weighting method, which was as high as 89.4% in the case of babble noise. A similar result was also found for the other types of noise.
Second Language Learners and Speech Act Comprehension
ERIC Educational Resources Information Center
Holtgraves, Thomas
2007-01-01
Recognizing the specific speech act ( Searle, 1969) that a speaker performs with an utterance is a fundamental feature of pragmatic competence. Past research has demonstrated that native speakers of English automatically recognize speech acts when they comprehend utterances (Holtgraves & Ashley, 2001). The present research examined whether this…
Advancements in robust algorithm formulation for speaker identification of whispered speech
NASA Astrophysics Data System (ADS)
Fan, Xing
Whispered speech is an alternative speech production mode from neutral speech, which is used by talkers intentionally in natural conversational scenarios to protect privacy and to avoid certain content from being overheard/made public. Due to the profound differences between whispered and neutral speech in production mechanism and the absence of whispered adaptation data, the performance of speaker identification systems trained with neutral speech degrades significantly. This dissertation therefore focuses on developing a robust closed-set speaker recognition system for whispered speech by using no or limited whispered adaptation data from non-target speakers. This dissertation proposes the concept of "High''/"Low'' performance whispered data for the purpose of speaker identification. A variety of acoustic properties are identified that contribute to the quality of whispered data. An acoustic analysis is also conducted to compare the phoneme/speaker dependency of the differences between whispered and neutral data in the feature domain. The observations from those acoustic analysis are new in this area and also serve as a guidance for developing robust speaker identification systems for whispered speech. This dissertation further proposes two systems for speaker identification of whispered speech. One system focuses on front-end processing. A two-dimensional feature space is proposed to search for "Low''-quality performance based whispered utterances and separate feature mapping functions are applied to vowels and consonants respectively in order to retain the speaker's information shared between whispered and neutral speech. The other system focuses on speech-mode-independent model training. The proposed method generates pseudo whispered features from neutral features by using the statistical information contained in a whispered Universal Background model (UBM) trained from extra collected whispered data from non-target speakers. Four modeling methods are proposed for the transformation estimation in order to generate the pseudo whispered features. Both of the above two systems demonstrate a significant improvement over the baseline system on the evaluation data. This dissertation has therefore contributed to providing a scientific understanding of the differences between whispered and neutral speech as well as improved front-end processing and modeling method for speaker identification of whispered speech. Such advancements will ultimately contribute to improve the robustness of speech processing systems.
NASA Astrophysics Data System (ADS)
Přibil, Jiří; Přibilová, Anna; Frollo, Ivan
2017-12-01
The paper focuses on two methods of evaluation of successfulness of speech signal enhancement recorded in the open-air magnetic resonance imager during phonation for the 3D human vocal tract modeling. The first approach enables to obtain a comparison based on statistical analysis by ANOVA and hypothesis tests. The second method is based on classification by Gaussian mixture models (GMM). The performed experiments have confirmed that the proposed ANOVA and GMM classifiers for automatic evaluation of the speech quality are functional and produce fully comparable results with the standard evaluation based on the listening test method.
Greene, Beth G; Logan, John S; Pisoni, David B
1986-03-01
We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.
ERIC Educational Resources Information Center
Loukina, Anastassia; Zechner, Klaus; Yoon, Su-Youn; Zhang, Mo; Tao, Jidong; Wang, Xinhao; Lee, Chong Min; Mulholland, Matthew
2017-01-01
This report presents an overview of the "SpeechRater"? automated scoring engine model building and evaluation process for several item types with a focus on a low-English-proficiency test-taker population. We discuss each stage of speech scoring, including automatic speech recognition, filtering models for nonscorable responses, and…
Quantifying repetitive speech in autism spectrum disorders and language impairment.
van Santen, Jan P H; Sproat, Richard W; Hill, Alison Presmanes
2013-10-01
We report on an automatic technique for quantifying two types of repetitive speech: repetitions of what the child says him/herself (self-repeats) and of what is uttered by an interlocutor (echolalia). We apply this technique to a sample of 111 children between the ages of four and eight: 42 typically developing children (TD), 19 children with specific language impairment (SLI), 25 children with autism spectrum disorders (ASD) plus language impairment (ALI), and 25 children with ASD with normal, non-impaired language (ALN). The results indicate robust differences in echolalia between the TD and ASD groups as a whole (ALN + ALI), and between TD and ALN children. There were no significant differences between ALI and SLI children for echolalia or self-repetitions. The results confirm previous findings that children with ASD repeat the language of others more than other populations of children. On the other hand, self-repetition does not appear to be significantly more frequent in ASD, nor does it matter whether the child's echolalia occurred within one (immediate) or two turns (near-immediate) of the adult's original utterance. Furthermore, non-significant differences between ALN and SLI, between TD and SLI, and between ALI and TD are suggestive that echolalia may not be specific to ALN or to ASD in general. One important innovation of this work is an objective fully automatic technique for assessing the amount of repetition in a transcript of a child's utterances. © 2013 International Society for Autism Research, Wiley Periodicals, Inc.
Quantifying Repetitive Speech in Autism Spectrum Disorders and Language Impairment
van Santen, Jan P. H.; Sproat, Richard W.; Hill, Alison Presmanes
2013-01-01
We report on an automatic technique for quantifying two types of repetitive speech: repetitions of what the child says him/herself (self-repeats) and of what is uttered by an interlocutor (echolalia). We apply this technique to a sample of 111 children between the ages of four and eight: 42 typically developing children (TD), 19 children with specific language impairment (SLI), 25 children with autism spectrum disorders (ASD) plus language impairment (ALI), and 25 children with ASD with normal, non-impaired language (ALN). The results indicate robust differences in echolalia between the TD and ASD groups as a whole (ALN + ALI), and between TD and ALN children. There were no significant differences between ALI and SLI children for echolalia or self-repetitions. The results confirm previous findings that children with ASD repeat the language of others more than other populations of children. On the other hand, self-repetition does not appear to be significantly more frequent in ASD, nor does it matter whether the child’s echolalia occurred within one (immediate) or two turns (near-immediate) of the adult’s original utterance. Furthermore, non-significant differences between ALN and SLI, between TD and SLI, and between ALI and TD are suggestive that echolalia may not be specific to ALN or to ASD in general. One important innovation of this work is an objective fully automatic technique for assessing the amount of repetition in a transcript of a child’s utterances. PMID:23661504
Deep Neural Networks for Speech Separation With Application to Robust Speech Recognition
acoustic -phonetic features. The second objective is integration of spectrotemporal context for improved separation performance. Conditional random fields...will be used to encode contextual constraints. The third objective is to achieve robust ASR in the DNN framework through integrated acoustic modeling
Advances in audio source seperation and multisource audio content retrieval
NASA Astrophysics Data System (ADS)
Vincent, Emmanuel
2012-06-01
Audio source separation aims to extract the signals of individual sound sources from a given recording. In this paper, we review three recent advances which improve the robustness of source separation in real-world challenging scenarios and enable its use for multisource content retrieval tasks, such as automatic speech recognition (ASR) or acoustic event detection (AED) in noisy environments. We present a Flexible Audio Source Separation Toolkit (FASST) and discuss its advantages compared to earlier approaches such as independent component analysis (ICA) and sparse component analysis (SCA). We explain how cues as diverse as harmonicity, spectral envelope, temporal fine structure or spatial location can be jointly exploited by this toolkit. We subsequently present the uncertainty decoding (UD) framework for the integration of audio source separation and audio content retrieval. We show how the uncertainty about the separated source signals can be accurately estimated and propagated to the features. Finally, we explain how this uncertainty can be efficiently exploited by a classifier, both at the training and the decoding stage. We illustrate the resulting performance improvements in terms of speech separation quality and speaker recognition accuracy.
Hands-free human-machine interaction with voice
NASA Astrophysics Data System (ADS)
Juang, B. H.
2004-05-01
Voice is natural communication interface between a human and a machine. The machine, when placed in today's communication networks, may be configured to provide automation to save substantial operating cost, as demonstrated in AT&T's VRCP (Voice Recognition Call Processing), or to facilitate intelligent services, such as virtual personal assistants, to enhance individual productivity. These intelligent services often need to be accessible anytime, anywhere (e.g., in cars when the user is in a hands-busy-eyes-busy situation or during meetings where constantly talking to a microphone is either undersirable or impossible), and thus call for advanced signal processing and automatic speech recognition techniques which support what we call ``hands-free'' human-machine communication. These techniques entail a broad spectrum of technical ideas, ranging from use of directional microphones and acoustic echo cancellatiion to robust speech recognition. In this talk, we highlight a number of key techniques that were developed for hands-free human-machine communication in the mid-1990s after Bell Labs became a unit of Lucent Technologies. A video clip will be played to demonstrate the accomplishement.
Automatic translation among spoken languages
NASA Technical Reports Server (NTRS)
Walter, Sharon M.; Costigan, Kelly
1994-01-01
The Machine Aided Voice Translation (MAVT) system was developed in response to the shortage of experienced military field interrogators with both foreign language proficiency and interrogation skills. Combining speech recognition, machine translation, and speech generation technologies, the MAVT accepts an interrogator's spoken English question and translates it into spoken Spanish. The spoken Spanish response of the potential informant can then be translated into spoken English. Potential military and civilian applications for automatic spoken language translation technology are discussed in this paper.
ERIC Educational Resources Information Center
Cordier, Deborah
2009-01-01
A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…
The Path to Language: Toward Bilingual Education for Deaf Children.
ERIC Educational Resources Information Center
Bouvet, Danielle
Discussion of speech instruction in bilingual education for deaf children refutes the assumption that speech is acquired automatically by hearing children and examines a program in which deaf children are taught alongside hearing children. The first part looks at how speech functions and how children acquire it: including the nature of the…
Cognitive Load in Voice Therapy Carry-Over Exercises
ERIC Educational Resources Information Center
Iwarsson, Jenny; Morris, David Jackson; Balling, Laura Winther
2017-01-01
Purpose: The cognitive load generated by online speech production may vary with the nature of the speech task. This article examines 3 speech tasks used in voice therapy carry-over exercises, in which a patient is required to adopt and automatize new voice behaviors, ultimately in daily spontaneous communication. Method: Twelve subjects produced…
Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems
NASA Technical Reports Server (NTRS)
Ye, Sherry
2015-01-01
NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.
Neuroscience-inspired computational systems for speech recognition under noisy conditions
NASA Astrophysics Data System (ADS)
Schafer, Phillip B.
Humans routinely recognize speech in challenging acoustic environments with background music, engine sounds, competing talkers, and other acoustic noise. However, today's automatic speech recognition (ASR) systems perform poorly in such environments. In this dissertation, I present novel methods for ASR designed to approach human-level performance by emulating the brain's processing of sounds. I exploit recent advances in auditory neuroscience to compute neuron-based representations of speech, and design novel methods for decoding these representations to produce word transcriptions. I begin by considering speech representations modeled on the spectrotemporal receptive fields of auditory neurons. These representations can be tuned to optimize a variety of objective functions, which characterize the response properties of a neural population. I propose an objective function that explicitly optimizes the noise invariance of the neural responses, and find that it gives improved performance on an ASR task in noise compared to other objectives. The method as a whole, however, fails to significantly close the performance gap with humans. I next consider speech representations that make use of spiking model neurons. The neurons in this method are feature detectors that selectively respond to spectrotemporal patterns within short time windows in speech. I consider a number of methods for training the response properties of the neurons. In particular, I present a method using linear support vector machines (SVMs) and show that this method produces spikes that are robust to additive noise. I compute the spectrotemporal receptive fields of the neurons for comparison with previous physiological results. To decode the spike-based speech representations, I propose two methods designed to work on isolated word recordings. The first method uses a classical ASR technique based on the hidden Markov model. The second method is a novel template-based recognition scheme that takes advantage of the neural representation's invariance in noise. The scheme centers on a speech similarity measure based on the longest common subsequence between spike sequences. The combined encoding and decoding scheme outperforms a benchmark system in extremely noisy acoustic conditions. Finally, I consider methods for decoding spike representations of continuous speech. To help guide the alignment of templates to words, I design a syllable detection scheme that robustly marks the locations of syllabic nuclei. The scheme combines SVM-based training with a peak selection algorithm designed to improve noise tolerance. By incorporating syllable information into the ASR system, I obtain strong recognition results in noisy conditions, although the performance in noiseless conditions is below the state of the art. The work presented here constitutes a novel approach to the problem of ASR that can be applied in the many challenging acoustic environments in which we use computer technologies today. The proposed spike-based processing methods can potentially be exploited in effcient hardware implementations and could significantly reduce the computational costs of ASR. The work also provides a framework for understanding the advantages of spike-based acoustic coding in the human brain.
Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications.
VanDam, Mark; Silbert, Noah H
2016-01-01
Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output.
Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications
2016-01-01
Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output. PMID:27529813
Massively-Parallel Architectures for Automatic Recognition of Visual Speech Signals
1988-10-12
Secusrity Clamifieation, Nlassively-Parallel Architectures for Automa ic Recognitio of Visua, Speech Signals 12. PERSONAL AUTHOR(S) Terrence J...characteristics of speech from tJhe, visual speech signals. Neural networks have been trained on a database of vowels. The rqw images of faces , aligned and...images of faces , aligned and preprocessed, were used as input to these network which were trained to estimate the corresponding envelope of the
ERIC Educational Resources Information Center
Stinson, Michael; Elliot, Lisa; McKee, Barbara; Coyne, Gina
This report discusses a project that adapted new automatic speech recognition (ASR) technology to provide real-time speech-to-text transcription as a support service for students who are deaf and hard of hearing (D/HH). In this system, as the teacher speaks, a hearing intermediary, or captionist, dictates into the speech recognition system in a…
GREENE, BETH G.; LOGAN, JOHN S.; PISONI, DAVID B.
2012-01-01
We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered. PMID:23225916
Audiovisual Asynchrony Detection in Human Speech
ERIC Educational Resources Information Center
Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta
2011-01-01
Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…
Ali, Zulfiqar; Alsulaiman, Mansour; Muhammad, Ghulam; Elamvazuthi, Irraivan; Al-Nasheri, Ahmed; Mesallam, Tamer A; Farahat, Mohamed; Malki, Khalid H
2017-05-01
A large population around the world has voice complications. Various approaches for subjective and objective evaluations have been suggested in the literature. The subjective approach strongly depends on the experience and area of expertise of a clinician, and human error cannot be neglected. On the other hand, the objective or automatic approach is noninvasive. Automatic developed systems can provide complementary information that may be helpful for a clinician in the early screening of a voice disorder. At the same time, automatic systems can be deployed in remote areas where a general practitioner can use them and may refer the patient to a specialist to avoid complications that may be life threatening. Many automatic systems for disorder detection have been developed by applying different types of conventional speech features such as the linear prediction coefficients, linear prediction cepstral coefficients, and Mel-frequency cepstral coefficients (MFCCs). This study aims to ascertain whether conventional speech features detect voice pathology reliably, and whether they can be correlated with voice quality. To investigate this, an automatic detection system based on MFCC was developed, and three different voice disorder databases were used in this study. The experimental results suggest that the accuracy of the MFCC-based system varies from database to database. The detection rate for the intra-database ranges from 72% to 95%, and that for the inter-database is from 47% to 82%. The results conclude that conventional speech features are not correlated with voice, and hence are not reliable in pathology detection. Copyright © 2017 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Reference-free automatic quality assessment of tracheoesophageal speech.
Huang, Andy; Falk, Tiago H; Chan, Wai-Yip; Parsa, Vijay; Doyle, Philip
2009-01-01
Evaluation of the quality of tracheoesophageal (TE) speech using machines instead of human experts can enhance the voice rehabilitation process for patients who have undergone total laryngectomy and voice restoration. Towards the goal of devising a reference-free TE speech quality estimation algorithm, we investigate the efficacy of speech signal features that are used in standard telephone-speech quality assessment algorithms, in conjunction with a recently introduced speech modulation spectrum measure. Tests performed on two TE speech databases demonstrate that the modulation spectral measure and a subset of features in the standard ITU-T P.563 algorithm estimate TE speech quality with better correlation (up to 0.9) than previously proposed features.
Voice-processing technologies--their application in telecommunications.
Wilpon, J G
1995-01-01
As the telecommunications industry evolves over the next decade to provide the products and services that people will desire, several key technologies will become commonplace. Two of these, automatic speech recognition and text-to-speech synthesis, will provide users with more freedom on when, where, and how they access information. While these technologies are currently in their infancy, their capabilities are rapidly increasing and their deployment in today's telephone network is expanding. The economic impact of just one application, the automation of operator services, is well over $100 million per year. Yet there still are many technical challenges that must be resolved before these technologies can be deployed ubiquitously in products and services throughout the worldwide telephone network. These challenges include: (i) High level of accuracy. The technology must be perceived by the user as highly accurate, robust, and reliable. (ii) Easy to use. Speech is only one of several possible input/output modalities for conveying information between a human and a machine, much like a computer terminal or Touch-Tone pad on a telephone. It is not the final product. Therefore, speech technologies must be hidden from the user. That is, the burden of using the technology must be on the technology itself. (iii) Quick prototyping and development of new products and services. The technology must support the creation of new products and services based on speech in an efficient and timely fashion. In this paper I present a vision of the voice-processing industry with a focus on the areas with the broadest base of user penetration: speech recognition, text-to-speech synthesis, natural language processing, and speaker recognition technologies. The current and future applications of these technologies in the telecommunications industry will be examined in terms of their strengths, limitations, and the degree to which user needs have been or have yet to be met. Although noteworthy gains have been made in areas with potentially small user bases and in the more mature speech-coding technologies, these subjects are outside the scope of this paper. Images Fig. 1 PMID:7479815
On the recognition of emotional vocal expressions: motivations for a holistic approach.
Esposito, Anna; Esposito, Antonietta M
2012-10-01
Human beings seem to be able to recognize emotions from speech very well and information communication technology aims to implement machines and agents that can do the same. However, to be able to automatically recognize affective states from speech signals, it is necessary to solve two main technological problems. The former concerns the identification of effective and efficient processing algorithms capable of capturing emotional acoustic features from speech sentences. The latter focuses on finding computational models able to classify, with an approximation as good as human listeners, a given set of emotional states. This paper will survey these topics and provide some insights for a holistic approach to the automatic analysis, recognition and synthesis of affective states.
Noise suppression methods for robust speech processing
NASA Astrophysics Data System (ADS)
Boll, S. F.; Ravindra, H.; Randall, G.; Armantrout, R.; Power, R.
1980-05-01
Robust speech processing in practical operating environments requires effective environmental and processor noise suppression. This report describes the technical findings and accomplishments during this reporting period for the research program funded to develop real time, compressed speech analysis synthesis algorithms whose performance in invariant under signal contamination. Fulfillment of this requirement is necessary to insure reliable secure compressed speech transmission within realistic military command and control environments. Overall contributions resulting from this research program include the understanding of how environmental noise degrades narrow band, coded speech, development of appropriate real time noise suppression algorithms, and development of speech parameter identification methods that consider signal contamination as a fundamental element in the estimation process. This report describes the current research and results in the areas of noise suppression using the dual input adaptive noise cancellation using the short time Fourier transform algorithms, articulation rate change techniques, and a description of an experiment which demonstrated that the spectral subtraction noise suppression algorithm can improve the intelligibility of 2400 bps, LPC 10 coded, helicopter speech by 10.6 point.
Spoken language achieves robustness and evolvability by exploiting degeneracy and neutrality.
Winter, Bodo
2014-10-01
As with biological systems, spoken languages are strikingly robust against perturbations. This paper shows that languages achieve robustness in a way that is highly similar to many biological systems. For example, speech sounds are encoded via multiple acoustically diverse, temporally distributed and functionally redundant cues, characteristics that bear similarities to what biologists call "degeneracy". Speech is furthermore adequately characterized by neutrality, with many different tongue configurations leading to similar acoustic outputs, and different acoustic variants understood as the same by recipients. This highlights the presence of a large neutral network of acoustic neighbors for every speech sound. Such neutrality ensures that a steady backdrop of variation can be maintained without impeding communication, assuring that there is "fodder" for subsequent evolution. Thus, studying linguistic robustness is not only important for understanding how linguistic systems maintain their functioning upon the background of noise, but also for understanding the preconditions for language evolution. © 2014 WILEY Periodicals, Inc.
Robust output tracking control of a laboratory helicopter for automatic landing
NASA Astrophysics Data System (ADS)
Liu, Hao; Lu, Geng; Zhong, Yisheng
2014-11-01
In this paper, robust output tracking control problem of a laboratory helicopter for automatic landing in high seas is investigated. The motion of the helicopter is required to synchronise with that of an oscillating platform, e.g. the deck of a vessel subject to wave-induced motions. A robust linear time-invariant output feedback controller consisting of a nominal controller and a robust compensator is designed. The robust compensator is introduced to restrain the influences of parametric uncertainties, nonlinearities and external disturbances. It is shown that robust stability and robust tracking property can be achieved simultaneously. Experimental results on the laboratory helicopter for automatic landing demonstrate the effectiveness of the designed control approach.
Stop consonant voicing in young children's speech: Evidence from a cross-sectional study
NASA Astrophysics Data System (ADS)
Ganser, Emily
There are intuitive reasons to believe that speech-sound acquisition and language acquisition should be related in development. Surprisingly, only recently has research begun to parse just how the two might be related. This study investigated possible correlations between speech-sound acquisition and language acquisition, as part of a large-scale, longitudinal study of the relationship between different types of phonological development and vocabulary growth in the preschool years. Productions of voiced and voiceless stop-initial words were recorded from 96 children aged 28-39 months. Voice Onset Time (VOT, in ms) for each token context was calculated. A mixed-model logistic regression was calculated which predicted whether the sound was intended to be voiced or voiceless based on its VOT. This model estimated the slopes of the logistic function for each child. This slope was referred to as Robustness of Contrast (based on Holliday, Reidy, Beckman, and Edwards, 2015), defined as being the degree of categorical differentiation between the production of two speech sounds or classes of sounds, in this case, voiced and voiceless stops. Results showed a wide range of slopes for individual children, suggesting that slope-derived Robustness of Contrast could be a viable means of measuring a child's acquisition of the voicing contrast. Robustness of Contrast was then compared to traditional measures of speech and language skills to investigate whether there was any correlation between the production of stop voicing and broader measures of speech and language development. The Robustness of Contrast measure was found to correlate with all individual measures of speech and language, suggesting that it might indeed be predictive of later language skills.
Wallesch, C W; Brunner, R J; Seemüller, E
1983-12-01
Repetitive phenomena in spontaneous speech were investigated in 30 patients with chronic infarctions of the left hemisphere which included Broca's and/or Wernicke's area and/or the basal ganglia. Perseverations, stereotypies, and echolalias occurred with all types of brain lesions, automatisms and recurring utterances only with those patients, whose infarctions involved Wernicke's area and basal ganglia. These patients also showed more echolalic responses. The results are discussed in view of the role of the basal ganglia as motor program generators.
Scaling and universality in the human voice.
Luque, Jordi; Luque, Bartolo; Lacasa, Lucas
2015-04-06
Speech is a distinctive complex feature of human capabilities. In order to understand the physics underlying speech production, in this work, we empirically analyse the statistics of large human speech datasets ranging several languages. We first show that during speech, the energy is unevenly released and power-law distributed, reporting a universal robust Gutenberg-Richter-like law in speech. We further show that such 'earthquakes in speech' show temporal correlations, as the interevent statistics are again power-law distributed. As this feature takes place in the intraphoneme range, we conjecture that the process responsible for this complex phenomenon is not cognitive, but it resides in the physiological (mechanical) mechanisms of speech production. Moreover, we show that these waiting time distributions are scale invariant under a renormalization group transformation, suggesting that the process of speech generation is indeed operating close to a critical point. These results are put in contrast with current paradigms in speech processing, which point towards low dimensional deterministic chaos as the origin of nonlinear traits in speech fluctuations. As these latter fluctuations are indeed the aspects that humanize synthetic speech, these findings may have an impact in future speech synthesis technologies. Results are robust and independent of the communication language or the number of speakers, pointing towards a universal pattern and yet another hint of complexity in human speech. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
The Mechanical Recognition of Speech: Prospects for Use in the Teaching of Languages.
ERIC Educational Resources Information Center
Pulliam, Robert
1970-01-01
This paper begins with a brief account of the development of automatic speech recogniton (ASR) and then proceeds to an examination of ASR systems typical of the kind now in operation. It is stressed that such systems, although highly developed, do not recognize speech in the same sense as the human being does, and that they can not deal with a…
Development and Perceptual Evaluation of Amplitude-Based F0 Control in Electrolarynx Speech
ERIC Educational Resources Information Center
Saikachi, Yoko; Stevens, Kenneth N.; Hillman, Robert E.
2009-01-01
Purpose: Current electrolarynx (EL) devices produce a mechanical speech quality that has been largely attributed to the lack of natural fundamental frequency (F0) variation. In order to improve the quality of EL speech, in the present study the authors aimed to develop and evaluate an automatic F0 control scheme, in which F0 was modulated based on…
NASA Astrophysics Data System (ADS)
Nishiura, Takanobu; Nakamura, Satoshi
2003-10-01
Humans communicate with each other through speech by focusing on the target speech among environmental sounds in real acoustic environments. We can easily identify the target sound from other environmental sounds. For hands-free speech recognition, the identification of the target speech from environmental sounds is imperative. This mechanism may also be important for a self-moving robot to sense the acoustic environments and communicate with humans. Therefore, this paper first proposes hidden Markov model (HMM)-based environmental sound source identification. Environmental sounds are modeled by three states of HMMs and evaluated using 92 kinds of environmental sounds. The identification accuracy was 95.4%. This paper also proposes a new HMM composition method that composes speech HMMs and an HMM of categorized environmental sounds for robust environmental sound-added speech recognition. As a result of the evaluation experiments, we confirmed that the proposed HMM composition outperforms the conventional HMM composition with speech HMMs and a noise (environmental sound) HMM trained using noise periods prior to the target speech in a captured signal. [Work supported by Ministry of Public Management, Home Affairs, Posts and Telecommunications of Japan.
Konig, Alexandra; Satt, Aharon; Sorin, Alex; Hoory, Ran; Derreumaux, Alexandre; David, Renaud; Robert, Phillippe H
2018-01-01
Various types of dementia and Mild Cognitive Impairment (MCI) are manifested as irregularities in human speech and language, which have proven to be strong predictors for the disease presence and progress ion. Therefore, automatic speech analytics provided by a mobile application may be a useful tool in providing additional indicators for assessment and detection of early stage dementia and MCI. 165 participants (subjects with subjective cognitive impairment (SCI), MCI patients, Alzheimer's disease (AD) and mixed dementia (MD) patients) were recorded with a mobile application while performing several short vocal cognitive tasks during a regular consultation. These tasks included verbal fluency, picture description, counting down and a free speech task. The voice recordings were processed in two steps: in the first step, vocal markers were extracted using speech signal processing techniques; in the second, the vocal markers were tested to assess their 'power' to distinguish between SCI, MCI, AD and MD. The second step included training automatic classifiers for detecting MCI and AD, based on machine learning methods, and testing the detection accuracy. The fluency and free speech tasks obtain the highest accuracy rates of classifying AD vs. MD vs. MCI vs. SCI. Using the data, we demonstrated classification accuracy as follows: SCI vs. AD = 92% accuracy; SCI vs. MD = 92% accuracy; SCI vs. MCI = 86% accuracy and MCI vs. AD = 86%. Our results indicate the potential value of vocal analytics and the use of a mobile application for accurate automatic differentiation between SCI, MCI and AD. This tool can provide the clinician with meaningful information for assessment and monitoring of people with MCI and AD based on a non-invasive, simple and low-cost method. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Non-Speech Oro-Motor Exercise Use in Acquired Dysarthria Management: Regimes and Rationales
ERIC Educational Resources Information Center
Mackenzie, Catherine; Muir, Margaret; Allen, Carolyn
2010-01-01
Background: Non-speech oro-motor exercises (NSOMExs) are described in speech and language therapy manuals and are thought to be much used in acquired dysarthria intervention, though there is no robust evidence of an influence on speech outcome. Opinions differ as to whether, and for which dysarthria presentations, NSOMExs are appropriate. Aims:…
Speech recognition for embedded automatic positioner for laparoscope
NASA Astrophysics Data System (ADS)
Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin
2014-07-01
In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.
Robust Speech Enhancement Using Two-Stage Filtered Minima Controlled Recursive Averaging
NASA Astrophysics Data System (ADS)
Ghourchian, Negar; Selouani, Sid-Ahmed; O'Shaughnessy, Douglas
In this paper we propose an algorithm for estimating noise in highly non-stationary noisy environments, which is a challenging problem in speech enhancement. This method is based on minima-controlled recursive averaging (MCRA) whereby an accurate, robust and efficient noise power spectrum estimation is demonstrated. We propose a two-stage technique to prevent the appearance of musical noise after enhancement. This algorithm filters the noisy speech to achieve a robust signal with minimum distortion in the first stage. Subsequently, it estimates the residual noise using MCRA and removes it with spectral subtraction. The proposed Filtered MCRA (FMCRA) performance is evaluated using objective tests on the Aurora database under various noisy environments. These measures indicate the higher output SNR and lower output residual noise and distortion.
Hidden Markov models in automatic speech recognition
NASA Astrophysics Data System (ADS)
Wrzoskowicz, Adam
1993-11-01
This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.
Analysis of Factors Affecting System Performance in the ASpIRE Challenge
2015-12-13
performance in the ASpIRE (Automatic Speech recognition In Reverberant Environments) challenge. In particular, overall word error rate (WER) of the solver...systems is analyzed as a function of room, distance between talker and microphone, and microphone type. We also analyze speech activity detection...analysis will inform the design of future challenges and provide insight into the efficacy of current solutions addressing noisy reverberant speech
ERIC Educational Resources Information Center
De Felice, Rachele; Deane, Paul
2012-01-01
This study proposes an approach to automatically score the "TOEIC"® Writing e-mail task. We focus on one component of the scoring rubric, which notes whether the test-takers have used particular speech acts such as requests, orders, or commitments. We developed a computational model for automated speech act identification and tested it…
Automated Intelligibility Assessment of Pathological Speech Using Phonological Features
NASA Astrophysics Data System (ADS)
Middag, Catherine; Martens, Jean-Pierre; Van Nuffelen, Gwen; De Bodt, Marc
2009-12-01
It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words) and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008) is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.
Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.
Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger
2011-01-01
The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.
Tuning time-frequency methods for the detection of metered HF speech
NASA Astrophysics Data System (ADS)
Nelson, Douglas J.; Smith, Lawrence H.
2002-12-01
Speech is metered if the stresses occur at a nearly regular rate. Metered speech is common in poetry, and it can occur naturally in speech, if the speaker is spelling a word or reciting words or numbers from a list. In radio communications, the CQ request, call sign and other codes are frequently metered. In tactical communications and air traffic control, location, heading and identification codes may be metered. Moreover metering may be expected to survive even in HF communications, which are corrupted by noise, interference and mistuning. For this environment, speech recognition and conventional machine-based methods are not effective. We describe Time-Frequency methods which have been adapted successfully to the problem of mitigation of HF signal conditions and detection of metered speech. These methods are based on modeled time and frequency correlation properties of nearly harmonic functions. We derive these properties and demonstrate a performance gain over conventional correlation and spectral methods. Finally, in addressing the problem of HF single sideband (SSB) communications, the problems of carrier mistuning, interfering signals, such as manual Morse, and fast automatic gain control (AGC) must be addressed. We demonstrate simple methods which may be used to blindly mitigate mistuning and narrowband interference, and effectively invert the fast automatic gain function.
Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian
2017-09-18
The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist audiologists/hearing-aid dispensers in the fine-tuning of hearing aids. Sixty young participants with normal hearing listened to speech materials mimicking the perceptual consequences of ARHL at different levels of severity. Two intelligibility tests (repetition of words and sentences) and 1 comprehension test (responding to oral commands by moving virtual objects) were administered. Several language models were developed and used by the ASR system in order to fit human performances. Strong significant positive correlations were observed between human and ASR scores, with coefficients up to .99. However, the spectral smearing used to simulate losses in frequency selectivity caused larger declines in ASR performance than in human performance. Both intelligibility and comprehension scores for listeners with simulated ARHL are highly correlated with the performances of an ASR-based system. In the future, it needs to be determined if the ASR system is similarly successful in predicting speech processing in noise and by older people with ARHL.
Is automatic speech-to-text transcription ready for use in psychological experiments?
Ziman, Kirsten; Heusser, Andrew C; Fitzpatrick, Paxton C; Field, Campbell E; Manning, Jeremy R
2018-04-23
Verbal responses are a convenient and naturalistic way for participants to provide data in psychological experiments (Salzinger, The Journal of General Psychology, 61(1),65-94:1959). However, audio recordings of verbal responses typically require additional processing, such as transcribing the recordings into text, as compared with other behavioral response modalities (e.g., typed responses, button presses, etc.). Further, the transcription process is often tedious and time-intensive, requiring human listeners to manually examine each moment of recorded speech. Here we evaluate the performance of a state-of-the-art speech recognition algorithm (Halpern et al., 2016) in transcribing audio data into text during a list-learning experiment. We compare transcripts made by human annotators to the computer-generated transcripts. Both sets of transcripts matched to a high degree and exhibited similar statistical properties, in terms of the participants' recall performance and recall dynamics that the transcripts captured. This proof-of-concept study suggests that speech-to-text engines could provide a cheap, reliable, and rapid means of automatically transcribing speech data in psychological experiments. Further, our findings open the door for verbal response experiments that scale to thousands of participants (e.g., administered online), as well as a new generation of experiments that decode speech on the fly and adapt experimental parameters based on participants' prior responses.
Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems
NASA Technical Reports Server (NTRS)
Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan
2010-01-01
A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.
Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition
NASA Astrophysics Data System (ADS)
Ryumin, D.; Karpov, A. A.
2017-05-01
In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.
ERIC Educational Resources Information Center
Messaoud-Galusi, Souhila; Hazan, Valerie; Rosen, Stuart
2011-01-01
Purpose: The claim that speech perception abilities are impaired in dyslexia was investigated in a group of 62 children with dyslexia and 51 average readers matched in age. Method: To test whether there was robust evidence of speech perception deficits in children with dyslexia, speech perception in noise and quiet was measured using 8 different…
Objective speech quality evaluation of real-time speech coders
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Russell, W. H.; Huggins, A. W. F.
1984-02-01
This report describes the work performed in two areas: subjective testing of a real-time 16 kbit/s adaptive predictive coder (APC) and objective speech quality evaluation of real-time coders. The speech intelligibility of the APC coder was tested using the Diagnostic Rhyme Test (DRT), and the speech quality was tested using the Diagnostic Acceptability Measure (DAM) test, under eight operating conditions involving channel error, acoustic background noise, and tandem link with two other coders. The test results showed that the DRT and DAM scores of the APC coder equalled or exceeded the corresponding test scores fo the 32 kbit/s CVSD coder. In the area of objective speech quality evaluation, the report describes the development, testing, and validation of a procedure for automatically computing several objective speech quality measures, given only the tape-recordings of the input speech and the corresponding output speech of a real-time speech coder.
Specific acoustic models for spontaneous and dictated style in indonesian speech recognition
NASA Astrophysics Data System (ADS)
Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.
2018-03-01
The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.
ERIC Educational Resources Information Center
Hamade, Rachel; Hewlett, Nigel; Scanlon, Emer
2006-01-01
This study aimed to evaluate a new automatic tracheostoma valve: the Provox FreeHands HME (manufactured by Atos Medical AB, Sweden). Data from four laryngectomee participants using automatic and also manual occlusion were subjected to acoustic and perceptual analysis. The main results were a significant decrease, from the manual to automatic…
Thai Automatic Speech Recognition
2005-01-01
used in an external DARPA evaluation involving medical scenarios between an American Doctor and a naïve monolingual Thai patient. 2. Thai Language... dictionary generation more challenging, and (3) the lack of word segmentation, which calls for automatic segmentation approaches to make n-gram language...requires a dictionary and provides various segmentation algorithms to automatically select suitable segmentations. Here we used a maximal matching
The Relationship Between Speech Production and Speech Perception Deficits in Parkinson's Disease.
De Keyser, Kim; Santens, Patrick; Bockstael, Annelies; Botteldooren, Dick; Talsma, Durk; De Vos, Stefanie; Van Cauwenberghe, Mieke; Verheugen, Femke; Corthals, Paul; De Letter, Miet
2016-10-01
This study investigated the possible relationship between hypokinetic speech production and speech intensity perception in patients with Parkinson's disease (PD). Participants included 14 patients with idiopathic PD and 14 matched healthy controls (HCs) with normal hearing and cognition. First, speech production was objectified through a standardized speech intelligibility assessment, acoustic analysis, and speech intensity measurements. Second, an overall estimation task and an intensity estimation task were addressed to evaluate overall speech perception and speech intensity perception, respectively. Finally, correlation analysis was performed between the speech characteristics of the overall estimation task and the corresponding acoustic analysis. The interaction between speech production and speech intensity perception was investigated by an intensity imitation task. Acoustic analysis and speech intensity measurements demonstrated significant differences in speech production between patients with PD and the HCs. A different pattern in the auditory perception of speech and speech intensity was found in the PD group. Auditory perceptual deficits may influence speech production in patients with PD. The present results suggest a disturbed auditory perception related to an automatic monitoring deficit in PD.
Voice Preprocessor for Digital Voice Applications
1989-09-11
helit tralnsforniers are marketed for use kll ith multitonle MI( LMS and are acceptable for w ice appl ica- tiotns. 3. Automatic Gain (iontro: A...variations of speech spectral tilt to improve the quaiit\\ of the extracted speech parameters. Nlore imnportantly, the onlN analoii circuit "e use is a
Automatic measurement of prosody in behavioral variant FTD.
Nevler, Naomi; Ash, Sharon; Jester, Charles; Irwin, David J; Liberman, Mark; Grossman, Murray
2017-08-15
To help understand speech changes in behavioral variant frontotemporal dementia (bvFTD), we developed and implemented automatic methods of speech analysis for quantification of prosody, and evaluated clinical and anatomical correlations. We analyzed semi-structured, digitized speech samples from 32 patients with bvFTD (21 male, mean age 63 ± 8.5, mean disease duration 4 ± 3.1 years) and 17 matched healthy controls (HC). We automatically extracted fundamental frequency (f0, the physical property of sound most closely correlating with perceived pitch) and computed pitch range on a logarithmic scale (semitone) that controls for individual and sex differences. We correlated f0 range with neuropsychiatric tests, and related f0 range to gray matter (GM) atrophy using 3T T1 MRI. We found significantly reduced f0 range in patients with bvFTD (mean 4.3 ± 1.8 ST) compared to HC (5.8 ± 2.1 ST; p = 0.03). Regression related reduced f0 range in bvFTD to GM atrophy in bilateral inferior and dorsomedial frontal as well as left anterior cingulate and anterior insular regions. Reduced f0 range reflects impaired prosody in bvFTD. This is associated with neuroanatomic networks implicated in language production and social disorders centered in the frontal lobe. These findings support the feasibility of automated speech analysis in frontotemporal dementia and other disorders. © 2017 American Academy of Neurology.
ERIC Educational Resources Information Center
Saltuklaroglu, Tim; Kalinowski, Joseph; Robbins, Mary; Crawcour, Stephen; Bowers, Andrew
2009-01-01
Background: Stuttering is prone to strike during speech initiation more so than at any other point in an utterance. The use of auditory feedback (AAF) has been found to produce robust decreases in the stuttering frequency by creating an electronic rendition of choral speech (i.e., speaking in unison). However, AAF requires users to self-initiate…
Automatically Detecting Likely Edits in Clinical Notes Created Using Automatic Speech Recognition
Lybarger, Kevin; Ostendorf, Mari; Yetisgen, Meliha
2017-01-01
The use of automatic speech recognition (ASR) to create clinical notes has the potential to reduce costs associated with note creation for electronic medical records, but at current system accuracy levels, post-editing by practitioners is needed to ensure note quality. Aiming to reduce the time required to edit ASR transcripts, this paper investigates novel methods for automatic detection of edit regions within the transcripts, including both putative ASR errors but also regions that are targets for cleanup or rephrasing. We create detection models using logistic regression and conditional random field models, exploring a variety of text-based features that consider the structure of clinical notes and exploit the medical context. Different medical text resources are used to improve feature extraction. Experimental results on a large corpus of practitioner-edited clinical notes show that 67% of sentence-level edits and 45% of word-level edits can be detected with a false detection rate of 15%. PMID:29854187
Stelzle, F; Knipfer, C; Schuster, M; Bocklet, T; Nöth, E; Adler, W; Schempf, L; Vieler, P; Riemann, M; Neukam, F W; Nkenke, E
2013-11-01
Oral squamous cell carcinoma (OSCC) and its treatment impair speech intelligibility by alteration of the vocal tract. The aim of this study was to identify the factors of oral cancer treatment that influence speech intelligibility by means of an automatic, standardized speech-recognition system. The study group comprised 71 patients (mean age 59.89, range 35-82 years) with OSCC ranging from stage T1 to T4 (TNM staging). Tumours were located on the tongue (n=23), lower alveolar crest (n=27), and floor of the mouth (n=21). Reconstruction was conducted through local tissue plasty or microvascular transplants. Adjuvant radiotherapy was performed in 49 patients. Speech intelligibility was evaluated before, and at 3, 6, and 12 months after tumour resection, and compared to that of a healthy control group (n=40). Postoperatively, significant influences on speech intelligibility were tumour localization (P=0.010) and resection volume (P=0.019). Additionally, adjuvant radiotherapy (P=0.049) influenced intelligibility at 3 months after surgery. At 6 months after surgery, influences were resection volume (P=0.028) and adjuvant radiotherapy (P=0.034). The influence of tumour localization (P=0.001) and adjuvant radiotherapy (P=0.022) persisted after 12 months. Tumour localization, resection volume, and radiotherapy are crucial factors for speech intelligibility. Radiotherapy significantly impaired word recognition rate (WR) values with a progression of the impairment for up to 12 months after surgery. Copyright © 2013 International Association of Oral and Maxillofacial Surgeons. Published by Elsevier Ltd. All rights reserved.
Walenski, Matthew; Swinney, David
2009-01-01
The central question underlying this study revolves around how children process co-reference relationships—such as those evidenced by pronouns (him) and reflexives (himself)—and how a slowed rate of speech input may critically affect this process. Previous studies of child language processing have demonstrated that typical language developing (TLD) children as young as 4 years of age process co-reference relations in a manner similar to adults on-line. In contrast, off-line measures of pronoun comprehension suggest a developmental delay for pronouns (relative to reflexives). The present study examines dependency relations in TLD children (ages 5–13) and investigates how a slowed rate of speech input affects the unconscious (on-line) and conscious (off-line) parsing of these constructions. For the on-line investigations (using a cross-modal picture priming paradigm), results indicate that at a normal rate of speech TLD children demonstrate adult-like syntactic reflexes. At a slowed rate of speech the typical language developing children displayed a breakdown in automatic syntactic parsing (again, similar to the pattern seen in unimpaired adults). As demonstrated in the literature, our off-line investigations (sentence/picture matching task) revealed that these children performed much better on reflexives than on pronouns at a regular speech rate. However, at the slow speech rate, performance on pronouns was substantially improved, whereas performance on reflexives was not different than at the regular speech rate. We interpret these results in light of a distinction between fast automatic processes (relied upon for on-line processing in real time) and conscious reflective processes (relied upon for off-line processing), such that slowed speech input disrupts the former, yet improves the latter. PMID:19343495
Automatic intelligibility classification of sentence-level pathological speech
Kim, Jangwon; Kumar, Naveen; Tsiartas, Andreas; Li, Ming; Narayanan, Shrikanth S.
2014-01-01
Pathological speech usually refers to the condition of speech distortion resulting from atypicalities in voice and/or in the articulatory mechanisms owing to disease, illness or other physical or biological insult to the production system. Although automatic evaluation of speech intelligibility and quality could come in handy in these scenarios to assist experts in diagnosis and treatment design, the many sources and types of variability often make it a very challenging computational processing problem. In this work we propose novel sentence-level features to capture abnormal variation in the prosodic, voice quality and pronunciation aspects in pathological speech. In addition, we propose a post-classification posterior smoothing scheme which refines the posterior of a test sample based on the posteriors of other test samples. Finally, we perform feature-level fusions and subsystem decision fusion for arriving at a final intelligibility decision. The performances are tested on two pathological speech datasets, the NKI CCRT Speech Corpus (advanced head and neck cancer) and the TORGO database (cerebral palsy or amyotrophic lateral sclerosis), by evaluating classification accuracy without overlapping subjects’ data among training and test partitions. Results show that the feature sets of each of the voice quality subsystem, prosodic subsystem, and pronunciation subsystem, offer significant discriminating power for binary intelligibility classification. We observe that the proposed posterior smoothing in the acoustic space can further reduce classification errors. The smoothed posterior score fusion of subsystems shows the best classification performance (73.5% for unweighted, and 72.8% for weighted, average recalls of the binary classes). PMID:25414544
Cordella, Claire; Dickerson, Bradford C.; Quimby, Megan; Yunusova, Yana; Green, Jordan R.
2016-01-01
Background Primary progressive aphasia (PPA) is a neurodegenerative aphasic syndrome with three distinct clinical variants: non-fluent (nfvPPA), logopenic (lvPPA), and semantic (svPPA). Speech (non-) fluency is a key diagnostic marker used to aid identification of the clinical variants, and researchers have been actively developing diagnostic tools to assess speech fluency. Current approaches reveal coarse differences in fluency between subgroups, but often fail to clearly differentiate nfvPPA from the variably fluent lvPPA. More robust subtype differentiation may be possible with finer-grained measures of fluency. Aims We sought to identify the quantitative measures of speech rate—including articulation rate and pausing measures—that best differentiated PPA subtypes, specifically the non-fluent group (nfvPPA) from the more fluent groups (lvPPA, svPPA). The diagnostic accuracy of the quantitative speech rate variables was compared to that of a speech fluency impairment rating made by clinicians. Methods and Procedures Automatic estimates of pause and speech segment durations and rate measures were derived from connected speech samples of participants with PPA (N=38; 11 nfvPPA, 14 lvPPA, 13 svPPA) and healthy age-matched controls (N=8). Clinician ratings of fluency impairment were made using a previously validated clinician rating scale developed specifically for use in PPA. Receiver operating characteristic (ROC) analyses enabled a quantification of diagnostic accuracy. Outcomes and Results Among the quantitative measures, articulation rate was the most effective for differentiating between nfvPPA and the more fluent lvPPA and svPPA groups. The diagnostic accuracy of both speech and articulation rate measures was markedly better than that of the clinician rating scale, and articulation rate was the best classifier overall. Area under the curve (AUC) values for articulation rate were good to excellent for identifying nfvPPA from both svPPA (AUC=.96) and lvPPA (AUC=.86). Cross-validation of accuracy results for articulation rate showed good generalizability outside the training dataset. Conclusions Results provide empirical support for (1) the efficacy of quantitative assessments of speech fluency and (2) a distinct non-fluent PPA subtype characterized, at least in part, by an underlying disturbance in speech motor control. The trend toward improved classifier performance for quantitative rate measures demonstrates the potential for a more accurate and reliable approach to subtyping in the fluency domain, and suggests that articulation rate may be a useful input variable as part of a multi-dimensional clinical subtyping approach. PMID:28757671
Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel
Kleinschmidt, Dave F.; Jaeger, T. Florian
2016-01-01
Successful speech perception requires that listeners map the acoustic signal to linguistic categories. These mappings are not only probabilistic, but change depending on the situation. For example, one talker’s /p/ might be physically indistinguishable from another talker’s /b/ (cf. lack of invariance). We characterize the computational problem posed by such a subjectively non-stationary world and propose that the speech perception system overcomes this challenge by (1) recognizing previously encountered situations, (2) generalizing to other situations based on previous similar experience, and (3) adapting to novel situations. We formalize this proposal in the ideal adapter framework: (1) to (3) can be understood as inference under uncertainty about the appropriate generative model for the current talker, thereby facilitating robust speech perception despite the lack of invariance. We focus on two critical aspects of the ideal adapter. First, in situations that clearly deviate from previous experience, listeners need to adapt. We develop a distributional (belief-updating) learning model of incremental adaptation. The model provides a good fit against known and novel phonetic adaptation data, including perceptual recalibration and selective adaptation. Second, robust speech recognition requires listeners learn to represent the structured component of cross-situation variability in the speech signal. We discuss how these two aspects of the ideal adapter provide a unifying explanation for adaptation, talker-specificity, and generalization across talkers and groups of talkers (e.g., accents and dialects). The ideal adapter provides a guiding framework for future investigations into speech perception and adaptation, and more broadly language comprehension. PMID:25844873
iFER: facial expression recognition using automatically selected geometric eye and eyebrow features
NASA Astrophysics Data System (ADS)
Oztel, Ismail; Yolcu, Gozde; Oz, Cemil; Kazan, Serap; Bunyak, Filiz
2018-03-01
Facial expressions have an important role in interpersonal communications and estimation of emotional states or intentions. Automatic recognition of facial expressions has led to many practical applications and became one of the important topics in computer vision. We present a facial expression recognition system that relies on geometry-based features extracted from eye and eyebrow regions of the face. The proposed system detects keypoints on frontal face images and forms a feature set using geometric relationships among groups of detected keypoints. Obtained feature set is refined and reduced using the sequential forward selection (SFS) algorithm and fed to a support vector machine classifier to recognize five facial expression classes. The proposed system, iFER (eye-eyebrow only facial expression recognition), is robust to lower face occlusions that may be caused by beards, mustaches, scarves, etc. and lower face motion during speech production. Preliminary experiments on benchmark datasets produced promising results outperforming previous facial expression recognition studies using partial face features, and comparable results to studies using whole face information, only slightly lower by ˜ 2.5 % compared to the best whole face facial recognition system while using only ˜ 1 / 3 of the facial region.
Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students
ERIC Educational Resources Information Center
Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice
2012-01-01
Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…
The Promise of NLP and Speech Processing Technologies in Language Assessment
ERIC Educational Resources Information Center
Chapelle, Carol A.; Chung, Yoo-Ree
2010-01-01
Advances in natural language processing (NLP) and automatic speech recognition and processing technologies offer new opportunities for language testing. Despite their potential uses on a range of language test item types, relatively little work has been done in this area, and it is therefore not well understood by test developers, researchers or…
Validation of Automated Scoring of Oral Reading
ERIC Educational Resources Information Center
Balogh, Jennifer; Bernstein, Jared; Cheng, Jian; Van Moere, Alistair; Townshend, Brent; Suzuki, Masanori
2012-01-01
A two-part experiment is presented that validates a new measurement tool for scoring oral reading ability. Data collected by the U.S. government in a large-scale literacy assessment of adults were analyzed by a system called VersaReader that uses automatic speech recognition and speech processing technologies to score oral reading fluency. In the…
Voice Interactive Analysis System Study. Final Report, August 28, 1978 through March 23, 1979.
ERIC Educational Resources Information Center
Harry, D. P.; And Others
The Voice Interactive Analysis System study continued research and development of the LISTEN real-time, minicomputer based connected speech recognition system, within NAVTRAEQUIPCEN'S program of developing automatic speech technology in support of training. An attempt was made to identify the most effective features detected by the TTI-500 model…
Audio-video feature correlation: faces and speech
NASA Astrophysics Data System (ADS)
Durand, Gwenael; Montacie, Claude; Caraty, Marie-Jose; Faudemay, Pascal
1999-08-01
This paper presents a study of the correlation of features automatically extracted from the audio stream and the video stream of audiovisual documents. In particular, we were interested in finding out whether speech analysis tools could be combined with face detection methods, and to what extend they should be combined. A generic audio signal partitioning algorithm as first used to detect Silence/Noise/Music/Speech segments in a full length movie. A generic object detection method was applied to the keyframes extracted from the movie in order to detect the presence or absence of faces. The correlation between the presence of a face in the keyframes and of the corresponding voice in the audio stream was studied. A third stream, which is the script of the movie, is warped on the speech channel in order to automatically label faces appearing in the keyframes with the name of the corresponding character. We naturally found that extracted audio and video features were related in many cases, and that significant benefits can be obtained from the joint use of audio and video analysis methods.
The Effect of Feedback Schedule Manipulation on Speech Priming Patterns and Reaction Time
ERIC Educational Resources Information Center
Slocomb, Dana; Spencer, Kristie A.
2009-01-01
Speech priming tasks are frequently used to delineate stages in the speech process such as lexical retrieval and motor programming. These tasks, often measured in reaction time (RT), require fast and accurate responses, reflecting maximized participant performance, to result in robust priming effects. Encouraging speed and accuracy in responding…
Automatic Speech Recognition in Air Traffic Control: a Human Factors Perspective
NASA Technical Reports Server (NTRS)
Karlsson, Joakim
1990-01-01
The introduction of Automatic Speech Recognition (ASR) technology into the Air Traffic Control (ATC) system has the potential to improve overall safety and efficiency. However, because ASR technology is inherently a part of the man-machine interface between the user and the system, the human factors issues involved must be addressed. Here, some of the human factors problems are identified and related methods of investigation are presented. Research at M.I.T.'s Flight Transportation Laboratory is being conducted from a human factors perspective, focusing on intelligent parser design, presentation of feedback, error correction strategy design, and optimal choice of input modalities.
NASA Astrophysics Data System (ADS)
Morishima, Shigeo; Nakamura, Satoshi
2004-12-01
We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.
Ertmer, David J.; Jung, Jongmin; Kloiber, Diana True
2013-01-01
Background Speech-like utterances containing rapidly combined consonants and vowels eventually dominate the prelinguistic and early word productions of toddlers who are developing typically (TD). It seems reasonable to expect a similar phenomenon in young cochlear implants (CI) recipients. This study sought to determine the number of months of robust hearing experience needed to achieve a majority of speech-like utterances in both of these groups. Methods Speech samples were recorded at 3-month intervals during the first 2 years of CI experience, and between 6- and 24 months of age in TD children. Speech-like utterances were operationally defined as those belonging to the Basic Canonical Syllables (BCS) or Advanced Forms (AF) levels of the Consolidated Stark Assessment of Early Vocal Development-Revised. Results On average, the CI group achieved a majority of speech- like utterances after 12 months, and the TD group after 18 months of robust hearing experience. The CI group produced greater percentages of speech-like utterances at each interval until 24-months, when both groups approximated 80%. Conclusion Auditory deprivation did not limit progress in vocal development as young CI recipients showed more-rapid-than-typical speech development during the first 2 years of device use. Implications for the Infraphonological model of speech development are considered. PMID:23813203
Current trends in small vocabulary speech recognition for equipment control
NASA Astrophysics Data System (ADS)
Doukas, Nikolaos; Bardis, Nikolaos G.
2017-09-01
Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.
Learner Attention to Form in ACCESS Task-Based Interaction
ERIC Educational Resources Information Center
Dao, Phung; Iwashita, Noriko; Gatbonton, Elizabeth
2017-01-01
This study explored the potential effects of communicative tasks developed using a reformulation of a task-based language teaching called Automatization in Communicative Contexts of Essential Speech Sequences (ACCESS) that includes automatization of language elements as one of its goals on learner attention to form in task-based interaction. The…
Auditory-Motor Processing of Speech Sounds
Möttönen, Riikka; Dutton, Rebekah; Watkins, Kate E.
2013-01-01
The motor regions that control movements of the articulators activate during listening to speech and contribute to performance in demanding speech recognition and discrimination tasks. Whether the articulatory motor cortex modulates auditory processing of speech sounds is unknown. Here, we aimed to determine whether the articulatory motor cortex affects the auditory mechanisms underlying discrimination of speech sounds in the absence of demanding speech tasks. Using electroencephalography, we recorded responses to changes in sound sequences, while participants watched a silent video. We also disrupted the lip or the hand representation in left motor cortex using transcranial magnetic stimulation. Disruption of the lip representation suppressed responses to changes in speech sounds, but not piano tones. In contrast, disruption of the hand representation had no effect on responses to changes in speech sounds. These findings show that disruptions within, but not outside, the articulatory motor cortex impair automatic auditory discrimination of speech sounds. The findings provide evidence for the importance of auditory-motor processes in efficient neural analysis of speech sounds. PMID:22581846
Alternative Speech Communication System for Persons with Severe Speech Disorders
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; Sidi Yakoub, Mohammed; O'Shaughnessy, Douglas
2009-12-01
Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders. The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality of communication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and close to the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a grafting technique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system in order to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and two Acadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposed methods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achieved by the speech synthesis systems that deal with SSD and dysarthria, respectively.
Evaluation of synthesized voice approach callouts /SYNCALL/
NASA Technical Reports Server (NTRS)
Simpson, C. A.
1981-01-01
The two basic approaches to the generation of 'synthesized' speech include a utilization of analog recorded human speech and a construction of speech entirely from algorithms applied to constants describing speech sounds. Given the availability of synthesized speech displays for man-machine systems, research is needed to study suggested applications for speech and design principles for speech displays. The present investigation is concerned with a study for which new performance measures were developed. A number of air carrier approach and landing accidents during low or impaired visibility have been associated with the absence of approach callouts. The study had the purpose to compare a pilot-not-flying (PNF) approach callout system to a system composed of PNF callouts augmented by an automatic synthesized voice callout system (SYNCALL). Pilots were found to favor the use of a SYNCALL system containing certain modifications.
Rusz, J; Cmejla, R; Ruzickova, H; Ruzicka, E
2011-01-01
An assessment of vocal impairment is presented for separating healthy people from persons with early untreated Parkinson's disease (PD). This study's main purpose was to (a) determine whether voice and speech disorder are present from early stages of PD before starting dopaminergic pharmacotherapy, (b) ascertain the specific characteristics of the PD-related vocal impairment, (c) identify PD-related acoustic signatures for the major part of traditional clinically used measurement methods with respect to their automatic assessment, and (d) design new automatic measurement methods of articulation. The varied speech data were collected from 46 Czech native speakers, 23 with PD. Subsequently, 19 representative measurements were pre-selected, and Wald sequential analysis was then applied to assess the efficiency of each measure and the extent of vocal impairment of each subject. It was found that measurement of the fundamental frequency variations applied to two selected tasks was the best method for separating healthy from PD subjects. On the basis of objective acoustic measures, statistical decision-making theory, and validation from practicing speech therapists, it has been demonstrated that 78% of early untreated PD subjects indicate some form of vocal impairment. The speech defects thus uncovered differ individually in various characteristics including phonation, articulation, and prosody.
Automatic processing of tones and speech stimuli in children with specific language impairment.
Uwer, Ruth; Albrecht, Ronald; von Suchodoletz, W
2002-08-01
It is well known from behavioural experiments that children with specific language impairment (SLI) have difficulties discriminating consonant-vowel (CV) syllables such as /ba/, /da/, and /ga/. Mismatch negativity (MMN) is an auditory event-related potential component that represents the outcome of an automatic comparison process. It could, therefore, be a promising tool for assessing central auditory processing deficits for speech and non-speech stimuli in children with SLI. MMN is typically evoked by occasionally occurring 'deviant' stimuli in a sequence of identical 'standard' sounds. In this study MMN was elicited using simple tone stimuli, which differed in frequency (1000 versus 1200 Hz) and duration (175 versus 100 ms) and to digitized CV syllables which differed in place of articulation (/ba/, /da/, and /ga/) in children with expressive and receptive SLI and healthy control children (n=21 in each group, 46 males and 17 females; age range 5 to 10 years). Mean MMN amplitudes between groups were compared. Additionally, the behavioural discrimination performance was assessed. Children with SLI had attenuated MMN amplitudes to speech stimuli, but there was no significant difference between the two diagnostic subgroups. MMN to tone stimuli did not differ between the groups. Children with SLI made more errors in the discrimination task, but discrimination scores did not correlate with MMN amplitudes. The present data suggest that children with SLI show a specific deficit in automatic discrimination of CV syllables differing in place of articulation, whereas the processing of simple tone differences seems to be unimpaired.
Relationship between listeners' nonnative speech recognition and categorization abilities
Atagi, Eriko; Bent, Tessa
2015-01-01
Enhancement of the perceptual encoding of talker characteristics (indexical information) in speech can facilitate listeners' recognition of linguistic content. The present study explored this indexical-linguistic relationship in nonnative speech processing by examining listeners' performance on two tasks: nonnative accent categorization and nonnative speech-in-noise recognition. Results indicated substantial variability across listeners in their performance on both the accent categorization and nonnative speech recognition tasks. Moreover, listeners' accent categorization performance correlated with their nonnative speech-in-noise recognition performance. These results suggest that having more robust indexical representations for nonnative accents may allow listeners to more accurately recognize the linguistic content of nonnative speech. PMID:25618098
Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.
1983-05-01
16.311 % a. Seale In/Se"l tAL4 lrs e y i s 2 I ROM men "Ig eddiei, m releerla ons leveltc. Ŗ dots ghoeea INDtISTRtAIJ%6LITARY SPEECH SYNTHESIS PRODUCTS...saquence The SC-01 Suech Syntheszer conftains 64 cf, arent poneme~hs which are accessed try A 6-tht code. 1 - the proper sequ.enti omthnatiors of thoe...connected speech input with widely differing emotional states, diverse accents, and substantial nonperiodic background noise input. As noted previously
Robust estimators for speech enhancement in real environments
NASA Astrophysics Data System (ADS)
Sandoval-Ibarra, Yuma; Diaz-Ramirez, Victor H.; Kober, Vitaly
2015-09-01
Common statistical estimators for speech enhancement rely on several assumptions about stationarity of speech signals and noise. These assumptions may not always valid in real-life due to nonstationary characteristics of speech and noise processes. We propose new estimators based on existing estimators by incorporation of computation of rank-order statistics. The proposed estimators are better adapted to non-stationary characteristics of speech signals and noise processes. Through computer simulations we show that the proposed estimators yield a better performance in terms of objective metrics than that of known estimators when speech signals are contaminated with airport, babble, restaurant, and train-station noise.
ERIC Educational Resources Information Center
Mackenzie, C.; Muir, M.; Allen, C.; Jensen, A.
2014-01-01
Background: There has been little robust evaluation of the outcome of speech and language therapy (SLT) intervention for post-stroke dysarthria. Non-speech oro-motor exercises (NSOMExs) are a common component of dysarthria intervention. A feasibility study was designed and executed, with participants randomized into two groups, in one of which…
Noise on, Voicing off: Speech Perception Deficits in Children with Specific Language Impairment
ERIC Educational Resources Information Center
Ziegler, Johannes C.; Pech-Georgel, Catherine; George, Florence; Lorenzi, Christian
2011-01-01
Speech perception of four phonetic categories (voicing, place, manner, and nasality) was investigated in children with specific language impairment (SLI) (n=20) and age-matched controls (n=19) in quiet and various noise conditions using an AXB two-alternative forced-choice paradigm. Children with SLI exhibited robust speech perception deficits in…
Robust Frequency Invariant Beamforming with Low Sidelobe for Speech Enhancement
NASA Astrophysics Data System (ADS)
Zhu, Yiting; Pan, Xiang
2018-01-01
Frequency invariant beamformers (FIBs) are widely used in speech enhancement and source localization. There are two traditional optimization methods for FIB design. The first one is convex optimization, which is simple but the frequency invariant characteristic of the beam pattern is poor with respect to frequency band of five octaves. The least squares (LS) approach using spatial response variation (SRV) constraint is another optimization method. Although, it can provide good frequency invariant property, it usually couldn’t be used in speech enhancement for its lack of weight norm constraint which is related to the robustness of a beamformer. In this paper, a robust wideband beamforming method with a constant beamwidth is proposed. The frequency invariant beam pattern is achieved by resolving an optimization problem of the SRV constraint to cover speech frequency band. With the control of sidelobe level, it is available for the frequency invariant beamformer (FIB) to prevent distortion of interference from the undesirable direction. The approach is completed in time-domain by placing tapped delay lines(TDL) and finite impulse response (FIR) filter at the output of each sensor which is more convenient than the Frost processor. By invoking the weight norm constraint, the robustness of the beamformer is further improved against random errors. Experiment results show that the proposed method has a constant beamwidth and almost the same white noise gain as traditional delay-and-sum (DAS) beamformer.
ERIC Educational Resources Information Center
Gweon, Gahgene; Jain, Mahaveer; McDonough, John; Raj, Bhiksha; Rose, Carolyn P.
2013-01-01
This paper contributes to a theory-grounded methodological foundation for automatic collaborative learning process analysis. It does this by illustrating how insights from the social psychology and sociolinguistics of speech style provide a theoretical framework to inform the design of a computational model. The purpose of that model is to detect…
ERIC Educational Resources Information Center
Prata, David Nadler; Baker, Ryan S. J. d.; Costa, Evandro d. B.; Rose, Carolyn P.; Cui, Yue; de Carvalho, Adriana M. J. B.
2009-01-01
This paper presents a model which can automatically detect a variety of student speech acts as students collaborate within a computer supported collaborative learning environment. In addition, an analysis is presented which gives substantial insight as to how students' learning is associated with students' speech acts, knowledge that will…
2017-03-01
the Center for Technology Enhanced Language Learning (CTELL), a research cell in the Department of Foreign Languages, United States Military Academy...models for automatic speech recognition (ASR), and to, thereby, investigate the utility of ASR in pedagogical technology . The corpus is a sample of...lexical resources, language technology 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU 18. NUMBER OF
User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning
ERIC Educational Resources Information Center
Ahn, Tae youn; Lee, Sangmin-Michelle
2016-01-01
With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…
ERIC Educational Resources Information Center
Nash, Hannah M.; Gooch, Debbie; Hulme, Charles; Mahajan, Yatin; McArthur, Genevieve; Steinmetzger, Kurt; Snowling, Margaret J.
2017-01-01
The "automatic letter-sound integration hypothesis" (Blomert, [Blomert, L., 2011]) proposes that dyslexia results from a failure to fully integrate letters and speech sounds into automated audio-visual objects. We tested this hypothesis in a sample of English-speaking children with dyslexic difficulties (N = 13) and samples of…
ERIC Educational Resources Information Center
Franco, Horacio; Bratt, Harry; Rossier, Romain; Rao Gadde, Venkata; Shriberg, Elizabeth; Abrash, Victor; Precoda, Kristin
2010-01-01
SRI International's EduSpeak[R] system is a software development toolkit that enables developers of interactive language education software to use state-of-the-art speech recognition and pronunciation scoring technology. Automatic pronunciation scoring allows the computer to provide feedback on the overall quality of pronunciation and to point to…
Learning L2 Pronunciation with a Mobile Speech Recognizer: French /y/
ERIC Educational Resources Information Center
Liakin, Denis; Cardoso, Walcir; Liakina, Natallia
2015-01-01
This study investigates the acquisition of the L2 French vowel /y/ in a mobile-assisted learning environment, via the use of automatic speech recognition (ASR). Particularly, it addresses the question of whether ASR-based pronunciation instruction using a mobile device can improve the production and perception of French /y/. Forty-two elementary…
ERIC Educational Resources Information Center
Carranza, Mario
2016-01-01
This paper addresses the process of transcribing and annotating spontaneous non-native speech with the aim of compiling a training corpus for the development of Computer Assisted Pronunciation Training (CAPT) applications, enhanced with Automatic Speech Recognition (ASR) technology. To better adapt ASR technology to CAPT tools, the recognition…
Teaching Human Poses Interactively to a Social Robot
Gonzalez-Pacheco, Victor; Malfaz, Maria; Fernandez, Fernando; Salichs, Miguel A.
2013-01-01
The main activity of social robots is to interact with people. In order to do that, the robot must be able to understand what the user is saying or doing. Typically, this capability consists of pre-programmed behaviors or is acquired through controlled learning processes, which are executed before the social interaction begins. This paper presents a software architecture that enables a robot to learn poses in a similar way as people do. That is, hearing its teacher's explanations and acquiring new knowledge in real time. The architecture leans on two main components: an RGB-D (Red-, Green-, Blue- Depth) -based visual system, which gathers the user examples, and an Automatic Speech Recognition (ASR) system, which processes the speech describing those examples. The robot is able to naturally learn the poses the teacher is showing to it by maintaining a natural interaction with the teacher. We evaluate our system with 24 users who teach the robot a predetermined set of poses. The experimental results show that, with a few training examples, the system reaches high accuracy and robustness. This method shows how to combine data from the visual and auditory systems for the acquisition of new knowledge in a natural manner. Such a natural way of training enables robots to learn from users, even if they are not experts in robotics. PMID:24048336
Binaural segregation in multisource reverberant environments.
Roman, Nicoleta; Srinivasan, Soundararajan; Wang, DeLiang
2006-12-01
In a natural environment, speech signals are degraded by both reverberation and concurrent noise sources. While human listening is robust under these conditions using only two ears, current two-microphone algorithms perform poorly. The psychological process of figure-ground segregation suggests that the target signal is perceived as a foreground while the remaining stimuli are perceived as a background. Accordingly, the goal is to estimate an ideal time-frequency (T-F) binary mask, which selects the target if it is stronger than the interference in a local T-F unit. In this paper, a binaural segregation system that extracts the reverberant target signal from multisource reverberant mixtures by utilizing only the location information of target source is proposed. The proposed system combines target cancellation through adaptive filtering and a binary decision rule to estimate the ideal T-F binary mask. The main observation in this work is that the target attenuation in a T-F unit resulting from adaptive filtering is correlated with the relative strength of target to mixture. A comprehensive evaluation shows that the proposed system results in large SNR gains. In addition, comparisons using SNR as well as automatic speech recognition measures show that this system outperforms standard two-microphone beamforming approaches and a recent binaural processor.
Teaching human poses interactively to a social robot.
Gonzalez-Pacheco, Victor; Malfaz, Maria; Fernandez, Fernando; Salichs, Miguel A
2013-09-17
The main activity of social robots is to interact with people. In order to do that, the robot must be able to understand what the user is saying or doing. Typically, this capability consists of pre-programmed behaviors or is acquired through controlled learning processes, which are executed before the social interaction begins. This paper presents a software architecture that enables a robot to learn poses in a similar way as people do. That is, hearing its teacher's explanations and acquiring new knowledge in real time. The architecture leans on two main components: an RGB-D (Red-, Green-, Blue- Depth) -based visual system, which gathers the user examples, and an Automatic Speech Recognition (ASR) system, which processes the speech describing those examples. The robot is able to naturally learn the poses the teacher is showing to it by maintaining a natural interaction with the teacher. We evaluate our system with 24 users who teach the robot a predetermined set of poses. The experimental results show that, with a few training examples, the system reaches high accuracy and robustness. This method shows how to combine data from the visual and auditory systems for the acquisition of new knowledge in a natural manner. Such a natural way of training enables robots to learn from users, even if they are not experts in robotics.
NASA Astrophysics Data System (ADS)
Mapp, Peter
2002-11-01
Although RaSTI is a good indicator of the speech intelligibility capability of auditoria and similar spaces, during the past 2-3 years it has been shown that RaSTI is not a robust predictor of sound system intelligibility performance. Instead, it is now recommended, within both national and international codes and standards, that full STI measurement and analysis be employed. However, new research is reported, that indicates that STI is not as flawless, nor robust as many believe. The paper highlights a number of potential error mechanisms. It is shown that the measurement technique and signal excitation stimulus can have a significant effect on the overall result and accuracy, particularly where DSP-based equipment is employed. It is also shown that in its current state of development, STI is not capable of appropriately accounting for a number of fundamental speech and system attributes, including typical sound system frequency response variations and anomalies. This is particularly shown to be the case when a system is operating under reverberant conditions. Comparisons between actual system measurements and corresponding word score data are reported where errors of up to 50 implications for VA and PA system performance verification will be discussed.
The transition to increased automaticity during finger sequence learning in adult males who stutter.
Smits-Bandstra, Sarah; De Nil, Luc; Rochon, Elizabeth
2006-01-01
The present study compared the automaticity levels of persons who stutter (PWS) and persons who do not stutter (PNS) on a practiced finger sequencing task under dual task conditions. Automaticity was defined as the amount of attention required for task performance. Twelve PWS and 12 control subjects practiced finger tapping sequences under single and then dual task conditions. Control subjects performed the sequencing task significantly faster and less variably under single versus dual task conditions while PWS' performance was consistently slow and variable (comparable to the dual task performance of control subjects) under both conditions. Control subjects were significantly more accurate on a colour recognition distracter task than PWS under dual task conditions. These results suggested that control subjects transitioned to quick, accurate and increasingly automatic performance on the sequencing task after practice, while PWS did not. Because most stuttering treatment programs for adults include practice and automatization of new motor speech skills, findings of this finger sequencing study and future studies of speech sequence learning may have important implications for how to maximize stuttering treatment effectiveness. As a result of this activity, the participant will be able to: (1) Define automaticity and explain the importance of dual task paradigms to investigate automaticity; (2) Relate the proposed relationship between motor learning and automaticity as stated by the authors; (3) Summarize the reviewed literature concerning the performance of PWS on dual tasks; and (4) Explain why the ability to transition to automaticity during motor learning may have important clinical implications for stuttering treatment effectiveness.
Polur, Prasad D; Miller, Gerald E
2006-10-01
Computer speech recognition of individuals with dysarthria, such as cerebral palsy patients requires a robust technique that can handle conditions of very high variability and limited training data. In this study, application of a 10 state ergodic hidden Markov model (HMM)/artificial neural network (ANN) hybrid structure for a dysarthric speech (isolated word) recognition system, intended to act as an assistive tool, was investigated. A small size vocabulary spoken by three cerebral palsy subjects was chosen. The effect of such a structure on the recognition rate of the system was investigated by comparing it with an ergodic hidden Markov model as a control tool. This was done in order to determine if this modified technique contributed to enhanced recognition of dysarthric speech. The speech was sampled at 11 kHz. Mel frequency cepstral coefficients were extracted from them using 15 ms frames and served as training input to the hybrid model setup. The subsequent results demonstrated that the hybrid model structure was quite robust in its ability to handle the large variability and non-conformity of dysarthric speech. The level of variability in input dysarthric speech patterns sometimes limits the reliability of the system. However, its application as a rehabilitation/control tool to assist dysarthric motor impaired individuals holds sufficient promise.
Automatic Classification of Question & Answer Discourse Segments from Teacher's Speech in Classrooms
ERIC Educational Resources Information Center
Blanchard, Nathaniel; D'Mello, Sidney; Olney, Andrew M.; Nystrand, Martin
2015-01-01
Question-answer (Q&A) is fundamental for dialogic instruction, an important pedagogical technique based on the free exchange of ideas and open-ended discussion. Automatically detecting Q&A is key to providing teachers with feedback on appropriate use of dialogic instructional strategies. In line with this, this paper studies the…
Automatic Online Lecture Highlighting Based on Multimedia Analysis
ERIC Educational Resources Information Center
Che, Xiaoyin; Yang, Haojin; Meinel, Christoph
2018-01-01
Textbook highlighting is widely considered to be beneficial for students. In this paper, we propose a comprehensive solution to highlight the online lecture videos in both sentence- and segment-level, just as is done with paper books. The solution is based on automatic analysis of multimedia lecture materials, such as speeches, transcripts, and…
Learning diagnostic models using speech and language measures.
Peintner, Bart; Jarrold, William; Vergyriy, Dimitra; Richey, Colleen; Tempini, Maria Luisa Gorno; Ogar, Jennifer
2008-01-01
We describe results that show the effectiveness of machine learning in the automatic diagnosis of certain neurodegenerative diseases, several of which alter speech and language production. We analyzed audio from 9 control subjects and 30 patients diagnosed with one of three subtypes of Frontotemporal Lobar Degeneration. From this data, we extracted features of the audio signal and the words the patient used, which were obtained using our automated transcription technologies. We then automatically learned models that predict the diagnosis of the patient using these features. Our results show that learned models over these features predict diagnosis with accuracy significantly better than random. Future studies using higher quality recordings will likely improve these results.
DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1
NASA Astrophysics Data System (ADS)
Garofolo, J. S.; Lamel, L. F.; Fisher, W. M.; Fiscus, J. G.; Pallett, D. S.
1993-02-01
The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.
A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields.
Carlin, Michael A; Elhilali, Mounya
2015-12-01
One of the hallmarks of sound processing in the brain is the ability of the nervous system to adapt to changing behavioral demands and surrounding soundscapes. It can dynamically shift sensory and cognitive resources to focus on relevant sounds. Neurophysiological studies indicate that this ability is supported by adaptively retuning the shapes of cortical spectro-temporal receptive fields (STRFs) to enhance features of target sounds while suppressing those of task-irrelevant distractors. Because an important component of human communication is the ability of a listener to dynamically track speech in noisy environments, the solution obtained by auditory neurophysiology implies a useful adaptation strategy for speech activity detection (SAD). SAD is an important first step in a number of automated speech processing systems, and performance is often reduced in highly noisy environments. In this paper, we describe how task-driven adaptation is induced in an ensemble of neurophysiological STRFs, and show how speech-adapted STRFs reorient themselves to enhance spectro-temporal modulations of speech while suppressing those associated with a variety of nonspeech sounds. We then show how an adapted ensemble of STRFs can better detect speech in unseen noisy environments compared to an unadapted ensemble and a noise-robust baseline. Finally, we use a stimulus reconstruction task to demonstrate how the adapted STRF ensemble better captures the spectrotemporal modulations of attended speech in clean and noisy conditions. Our results suggest that a biologically plausible adaptation framework can be applied to speech processing systems to dynamically adapt feature representations for improving noise robustness.
Segmenting words from natural speech: subsegmental variation in segmental cues.
Rytting, C Anton; Brew, Chris; Fosler-Lussier, Eric
2010-06-01
Most computational models of word segmentation are trained and tested on transcripts of speech, rather than the speech itself, and assume that speech is converted into a sequence of symbols prior to word segmentation. We present a way of representing speech corpora that avoids this assumption, and preserves acoustic variation present in speech. We use this new representation to re-evaluate a key computational model of word segmentation. One finding is that high levels of phonetic variability degrade the model's performance. While robustness to phonetic variability may be intrinsically valuable, this finding needs to be complemented by parallel studies of the actual abilities of children to segment phonetically variable speech.
ERIC Educational Resources Information Center
Harkins, Judith E., Ed.; Virvan, Barbara M., Ed.
The conference proceedings contains 23 papers on telephone relay service, real-time captioning, and automatic speech recognition, and a glossary. The keynote address, by Representative Major R. Owens, examines current issues in federal legislation. Other papers have the following titles and authors: "Telephone Relay Service: Rationale and…
NASA Astrophysics Data System (ADS)
He, Di; Lim, Boon Pang; Yang, Xuesong; Hasegawa-Johnson, Mark; Chen, Deming
2018-06-01
Most mainstream Automatic Speech Recognition (ASR) systems consider all feature frames equally important. However, acoustic landmark theory is based on a contradictory idea, that some frames are more important than others. Acoustic landmark theory exploits quantal non-linearities in the articulatory-acoustic and acoustic-perceptual relations to define landmark times at which the speech spectrum abruptly changes or reaches an extremum; frames overlapping landmarks have been demonstrated to be sufficient for speech perception. In this work, we conduct experiments on the TIMIT corpus, with both GMM and DNN based ASR systems and find that frames containing landmarks are more informative for ASR than others. We find that altering the level of emphasis on landmarks by re-weighting acoustic likelihood tends to reduce the phone error rate (PER). Furthermore, by leveraging the landmark as a heuristic, one of our hybrid DNN frame dropping strategies maintained a PER within 0.44% of optimal when scoring less than half (45.8% to be precise) of the frames. This hybrid strategy out-performs other non-heuristic-based methods and demonstrate the potential of landmarks for reducing computation.
Hakvoort, Britt; de Bree, Elise; van der Leij, Aryan; Maassen, Ben; van Setten, Ellie; Maurits, Natasha; van Zuijen, Titia L
2016-12-01
This study assessed whether a categorical speech perception (CP) deficit is associated with dyslexia or familial risk for dyslexia, by exploring a possible cascading relation from speech perception to phonology to reading and by identifying whether speech perception distinguishes familial risk (FR) children with dyslexia (FRD) from those without dyslexia (FRND). Data were collected from 9-year-old FRD (n = 37) and FRND (n = 41) children and age-matched controls (n = 49) on CP identification and discrimination and on the phonological processing measures rapid automatized naming, phoneme awareness, and nonword repetition. The FRD group performed more poorly on CP than the FRND and control groups. Findings on phonological processing align with the literature in that (a) phonological processing related to reading and (b) the FRD group showed the lowest phonological processing outcomes. Furthermore, CP correlated weakly with reading, but this relationship was fully mediated by rapid automatized naming. Although CP phonological skills are related to dyslexia, there was no strong evidence for a cascade from CP to phonology to reading. Deficits in CP at the behavioral level are not directly associated with dyslexia.
An attention-gating recurrent working memory architecture for emergent speech representation
NASA Astrophysics Data System (ADS)
Elshaw, Mark; Moore, Roger K.; Klein, Michael
2010-06-01
This paper describes an attention-gating recurrent self-organising map approach for emergent speech representation. Inspired by evidence from human cognitive processing, the architecture combines two main neural components. The first component, the attention-gating mechanism, uses actor-critic learning to perform selective attention towards speech. Through this selective attention approach, the attention-gating mechanism controls access to working memory processing. The second component, the recurrent self-organising map memory, develops a temporal-distributed representation of speech using phone-like structures. Representing speech in terms of phonetic features in an emergent self-organised fashion, according to research on child cognitive development, recreates the approach found in infants. Using this representational approach, in a fashion similar to infants, should improve the performance of automatic recognition systems through aiding speech segmentation and fast word learning.
Strategies for distant speech recognitionin reverberant environments
NASA Astrophysics Data System (ADS)
Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro
2015-12-01
Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.
Automatic measurement of voice onset time using discriminative structured prediction.
Sonderegger, Morgan; Keshet, Joseph
2012-12-01
A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.
López-de-Ipiña, Karmele; Alonso, Jesus-Bernardino; Travieso, Carlos Manuel; Solé-Casals, Jordi; Egiraun, Harkaitz; Faundez-Zanuy, Marcos; Ezeiza, Aitzol; Barroso, Nora; Ecay-Torres, Miriam; Martinez-Lage, Pablo; de Lizardui, Unai Martinez
2013-01-01
The work presented here is part of a larger study to identify novel technologies and biomarkers for early Alzheimer disease (AD) detection and it focuses on evaluating the suitability of a new approach for early AD diagnosis by non-invasive methods. The purpose is to examine in a pilot study the potential of applying intelligent algorithms to speech features obtained from suspected patients in order to contribute to the improvement of diagnosis of AD and its degree of severity. In this sense, Artificial Neural Networks (ANN) have been used for the automatic classification of the two classes (AD and control subjects). Two human issues have been analyzed for feature selection: Spontaneous Speech and Emotional Response. Not only linear features but also non-linear ones, such as Fractal Dimension, have been explored. The approach is non invasive, low cost and without any side effects. Obtained experimental results were very satisfactory and promising for early diagnosis and classification of AD patients. PMID:23698268
NASA Astrophysics Data System (ADS)
Endah, S. N.; Nugraheni, D. M. K.; Adhy, S.; Sutikno
2017-04-01
According to Law No. 32 of 2002 and the Indonesian Broadcasting Commission Regulation No. 02/P/KPI/12/2009 & No. 03/P/KPI/12/2009, stated that broadcast programs should not scold with harsh words, not harass, insult or demean minorities and marginalized groups. However, there are no suitable tools to censor those words automatically. Therefore, researches to develop a system of intelligent software to censor the words automatically are needed. To conduct censor, the system must be able to recognize the words in question. This research proposes the classification of speech divide into two classes using Support Vector Machine (SVM), first class is set of rude words and the second class is set of properly words. The speech pitch values as an input in SVM, it used for the development of the system for the Indonesian rude swear word. The results of the experiment show that SVM is good for this system.
An overview of neural function and feedback control in human communication.
Hood, L J
1998-01-01
The speech and hearing mechanisms depend on accurate sensory information and intact feedback mechanisms to facilitate communication. This article provides a brief overview of some components of the nervous system important for human communication and some electrophysiological methods used to measure cortical function in humans. An overview of automatic control and feedback mechanisms in general and as they pertain to the speech motor system and control of the hearing periphery is also presented, along with a discussion of how the speech and auditory systems interact.
A social feedback loop for speech development and its reduction in autism
Warlaumont, Anne S.; Richards, Jeffrey A.; Gilkerson, Jill; Oller, D. Kimbrough
2014-01-01
We analyze the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hours total) of children (8- to 48-month-olds) with and without autism. We find that adult responses are more likely when child vocalizations are speech-related. In turn, a child vocalization is more likely to be speech-related if the previous speech-related child vocalization received an immediate adult response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech-language development. Although this feedback loop applies in both typical development and autism, children with autism produce proportionally fewer speech-related vocalizations and the responses they receive are less contingent on whether their vocalizations are speech-related. We argue that such differences will diminish the strength of the social feedback loop with cascading effects on speech development over time. Differences related to socioeconomic status are also reported. PMID:24840717
Random Deep Belief Networks for Recognizing Emotions from Speech Signals.
Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang
2017-01-01
Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.
Random Deep Belief Networks for Recognizing Emotions from Speech Signals
Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang
2017-01-01
Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition. PMID:28356908
A Construction System for CALL Materials from TV News with Captions
NASA Astrophysics Data System (ADS)
Kobayashi, Satoshi; Tanaka, Takashi; Mori, Kazumasa; Nakagawa, Seiichi
Many language learning materials have been published. In language learning, although repetition training is obviously necessary, it is difficult to maintain the learner's interest/motivation using existing learning materials, because those materials are limited in their scope and contents. In addition, we doubt whether the speech sounds used in most materials are natural in various situations. Nowadays, some TV news programs (CNN, ABC, PBS, NHK, etc.) have closed/open captions corresponding to the announcer's speech. We have developed a system that makes Computer Assisted Language Learning (CALL) materials for both English learning by Japanese and Japanese learning by foreign students from such captioned newscasts. This system computes the synchronization between captions and speech by using HMMs and a forced alignment algorithm. Materials made by the system have following functions: full/partial text caption display, repetition listening, consulting an electronic dictionary, display of the user's/announcer's sound waveform and pitch contour, and automatic construction of a dictation test. Materials have following advantages: materials present polite and natural speech, various and timely topics. Furthermore, the materials have the following possibility: automatic creation of listening/understanding tests, and storage/retrieval of the many materials. In this paper, firstly, we present the organization of the system. Then, we describe results of questionnaires on trial use of the materials. As the result, we got enough accuracy on the synchronization between captions and speech. Speaking totally, we encouraged to research this system.
Cognitive Load in Voice Therapy Carry-Over Exercises.
Iwarsson, Jenny; Morris, David Jackson; Balling, Laura Winther
2017-01-01
The cognitive load generated by online speech production may vary with the nature of the speech task. This article examines 3 speech tasks used in voice therapy carry-over exercises, in which a patient is required to adopt and automatize new voice behaviors, ultimately in daily spontaneous communication. Twelve subjects produced speech in 3 conditions: rote speech (weekdays), sentences in a set form, and semispontaneous speech. Subjects simultaneously performed a secondary visual discrimination task for which response times were measured. On completion of each speech task, subjects rated their experience on a questionnaire. Response times from the secondary, visual task were found to be shortest for the rote speech, longer for the semispontaneous speech, and longest for the sentences within the set framework. Principal components derived from the subjective ratings were found to be linked to response times on the secondary visual task. Acoustic measures reflecting fundamental frequency distribution and vocal fold compression varied across the speech tasks. The results indicate that consideration should be given to the selection of speech tasks during the process leading to automation of revised speech behavior and that self-reports may be a reliable index of cognitive load.
NASA Astrophysics Data System (ADS)
Kayasith, Prakasith; Theeramunkong, Thanaruk
It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.
Robot Command Interface Using an Audio-Visual Speech Recognition System
NASA Astrophysics Data System (ADS)
Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy
In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.
Real-time spectrum estimation–based dual-channel speech-enhancement algorithm for cochlear implant
2012-01-01
Background Improvement of the cochlear implant (CI) front-end signal acquisition is needed to increase speech recognition in noisy environments. To suppress the directional noise, we introduce a speech-enhancement algorithm based on microphone array beamforming and spectral estimation. The experimental results indicate that this method is robust to directional mobile noise and strongly enhances the desired speech, thereby improving the performance of CI devices in a noisy environment. Methods The spectrum estimation and the array beamforming methods were combined to suppress the ambient noise. The directivity coefficient was estimated in the noise-only intervals, and was updated to fit for the mobile noise. Results The proposed algorithm was realized in the CI speech strategy. For actual parameters, we use Maxflat filter to obtain fractional sampling points and cepstrum method to differentiate the desired speech frame and the noise frame. The broadband adjustment coefficients were added to compensate the energy loss in the low frequency band. Discussions The approximation of the directivity coefficient is tested and the errors are discussed. We also analyze the algorithm constraint for noise estimation and distortion in CI processing. The performance of the proposed algorithm is analyzed and further be compared with other prevalent methods. Conclusions The hardware platform was constructed for the experiments. The speech-enhancement results showed that our algorithm can suppresses the non-stationary noise with high SNR. Excellent performance of the proposed algorithm was obtained in the speech enhancement experiments and mobile testing. And signal distortion results indicate that this algorithm is robust with high SNR improvement and low speech distortion. PMID:23006896
How should a speech recognizer work?
Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M
2005-11-12
Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.
A social feedback loop for speech development and its reduction in autism.
Warlaumont, Anne S; Richards, Jeffrey A; Gilkerson, Jill; Oller, D Kimbrough
2014-07-01
We analyzed the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hr total) of children (8- to 48-month-olds) with and without autism. We found that an adult was more likely to respond when the child's vocalization was speech related rather than not speech related. In turn, a child's vocalization was more likely to be speech related if the child's previous speech-related vocalization had received an immediate adult response rather than no response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech development. Although this feedback loop applies in both typical development and autism, children with autism produced proportionally fewer speech-related vocalizations, and the responses they received were less contingent on whether their vocalizations were speech related. We argue that such differences will diminish the strength of the social feedback loop and have cascading effects on speech development over time. Differences related to socioeconomic status are also reported. © The Author(s) 2014.
Processing of speech signals for physical and sensory disabilities.
Levitt, H
1995-01-01
Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities. Images Fig. 4 PMID:7479816
Processing of Speech Signals for Physical and Sensory Disabilities
NASA Astrophysics Data System (ADS)
Levitt, Harry
1995-10-01
Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities.
Noise Robust Speech Recognition Applied to Voice-Driven Wheelchair
NASA Astrophysics Data System (ADS)
Sasou, Akira; Kojima, Hiroaki
2009-12-01
Conventional voice-driven wheelchairs usually employ headset microphones that are capable of achieving sufficient recognition accuracy, even in the presence of surrounding noise. However, such interfaces require users to wear sensors such as a headset microphone, which can be an impediment, especially for the hand disabled. Conversely, it is also well known that the speech recognition accuracy drastically degrades when the microphone is placed far from the user. In this paper, we develop a noise robust speech recognition system for a voice-driven wheelchair. This system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors. We verified the effectiveness of our system in experiments in different environments, and confirmed that our system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors.
Joint Spatial-Spectral Feature Space Clustering for Speech Activity Detection from ECoG Signals
Kanas, Vasileios G.; Mporas, Iosif; Benz, Heather L.; Sgarbas, Kyriakos N.; Bezerianos, Anastasios; Crone, Nathan E.
2014-01-01
Brain machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines (SVM) as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and non-speech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllable repetition tasks and may contribute to the development of portable ECoG-based communication. PMID:24658248
Niijima, H; Ito, N; Ogino, S; Takatori, T; Iwase, H; Kobayashi, M
2000-11-01
For the purpose of practical use of speech recognition technology for recording of forensic autopsy, a language model of the speech recording system, specialized for the forensic autopsy, was developed. The language model for the forensic autopsy by applying 3-gram model was created, and an acoustic model for Japanese speech recognition by Hidden Markov Model in addition to the above were utilized to customize the speech recognition engine for forensic autopsy. A forensic vocabulary set of over 10,000 words was compiled and some 300,000 sentence patterns were made to create the forensic language model, then properly mixing with a general language model to attain high exactitude. When tried by dictating autopsy findings, this speech recognition system was proved to be about 95% of recognition rate that seems to have reached to the practical usability in view of speech recognition software, though there remains rooms for improving its hardware and application-layer software.
Optimal pattern synthesis for speech recognition based on principal component analysis
NASA Astrophysics Data System (ADS)
Korsun, O. N.; Poliyev, A. V.
2018-02-01
The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.
Auditory models for speech analysis
NASA Astrophysics Data System (ADS)
Maybury, Mark T.
This paper reviews the psychophysical basis for auditory models and discusses their application to automatic speech recognition. First an overview of the human auditory system is presented, followed by a review of current knowledge gleaned from neurological and psychoacoustic experimentation. Next, a general framework describes established peripheral auditory models which are based on well-understood properties of the peripheral auditory system. This is followed by a discussion of current enhancements to that models to include nonlinearities and synchrony information as well as other higher auditory functions. Finally, the initial performance of auditory models in the task of speech recognition is examined and additional applications are mentioned.
Effects and modeling of phonetic and acoustic confusions in accented speech.
Fung, Pascale; Liu, Yi
2005-11-01
Accented speech recognition is more challenging than standard speech recognition due to the effects of phonetic and acoustic confusions. Phonetic confusion in accented speech occurs when an expected phone is pronounced as a different one, which leads to erroneous recognition. Acoustic confusion occurs when the pronounced phone is found to lie acoustically between two baseform models and can be equally recognized as either one. We propose that it is necessary to analyze and model these confusions separately in order to improve accented speech recognition without degrading standard speech recognition. Since low phonetic confusion units in accented speech do not give rise to automatic speech recognition errors, we focus on analyzing and reducing phonetic and acoustic confusability under high phonetic confusion conditions. We propose using likelihood ratio test to measure phonetic confusion, and asymmetric acoustic distance to measure acoustic confusion. Only accent-specific phonetic units with low acoustic confusion are used in an augmented pronunciation dictionary, while phonetic units with high acoustic confusion are reconstructed using decision tree merging. Experimental results show that our approach is effective and superior to methods modeling phonetic confusion or acoustic confusion alone in accented speech, with a significant 5.7% absolute WER reduction, without degrading standard speech recognition.
Pathways of the inferior frontal occipital fasciculus in overt speech and reading.
Rollans, Claire; Cheema, Kulpreet; Georgiou, George K; Cummine, Jacqueline
2017-11-19
In this study, we examined the relationship between tractography-based measures of white matter integrity (ex. fractional anisotropy [FA]) from diffusion tensor imaging (DTI) and five reading-related tasks, including rapid automatized naming (RAN) of letters, digits, and objects, and reading of real words and nonwords. Twenty university students with no reported history of reading difficulties were tested on all five tasks and their performance was correlated with diffusion measures extracted through DTI tractography. A secondary analysis using whole-brain Tract-Based Spatial Statistics (TBSS) was also used to find clusters showing significant negative correlations between reaction time and FA. Results showed a significant relationship between the left inferior fronto-occipital fasciculus FA and performance on the RAN of objects task, as well as a strong relationship to nonword reading, which suggests a role for this tract in slower, non-automatic and/or resource-demanding speech tasks. There were no significant relationships between FA and the faster, more automatic speech tasks (RAN of letters and digits, and real word reading). These findings provide evidence for the role of the inferior fronto-occipital fasciculus in tasks that are highly demanding of orthography-phonology translation (e.g., nonword reading) and semantic processing (e.g., RAN object). This demonstrates the importance of the inferior fronto-occipital fasciculus in basic naming and suggests that this tract may be a sensitive predictor of rapid naming performance within the typical population. We discuss the findings in the context of current models of reading and speech production to further characterize the white matter pathways associated with basic reading processes. Copyright © 2017 IBRO. Published by Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Kuznetsov, Michael V.
2006-05-01
For reliable teamwork of various systems of automatic telecommunication including transferring systems of optical communication networks it is necessary authentic recognition of signals for one- or two-frequency service signal system. The analysis of time parameters of an accepted signal allows increasing reliability of detection and recognition of the service signal system on a background of speech.
Open-Source Multi-Language Audio Database for Spoken Language Processing Applications
2012-12-01
Mandarin, and Russian . Approximately 30 hours of speech were collected for each language. Each passage has been carefully transcribed at the...manual and automatic methods. The Russian passages have not yet been marked at the phonetic level. Another phase of the work was to explore...You Tube. 300 passages were collected in each of three languages—English, Mandarin, and Russian . Approximately 30 hours of speech were
Wołk, Agnieszka; Glinkowski, Wojciech
2017-01-01
People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer. PMID:29230254
Wołk, Krzysztof; Wołk, Agnieszka; Glinkowski, Wojciech
2017-01-01
People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer.
Caballero-Morales, Santiago-Omar
2013-01-01
An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness). Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR's output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech. PMID:23935410
Bone, Daniel; Lee, Chi-Chun; Black, Matthew P.; Williams, Marian E.; Lee, Sungbok; Levitt, Pat; Narayanan, Shrikanth
2015-01-01
Purpose The purpose of this study was to examine relationships between prosodic speech cues and autism spectrum disorder (ASD) severity, hypothesizing a mutually interactive relationship between the speech characteristics of the psychologist and the child. The authors objectively quantified acoustic-prosodic cues of the psychologist and of the child with ASD during spontaneous interaction, establishing a methodology for future large-sample analysis. Method Speech acoustic-prosodic features were semiautomatically derived from segments of semistructured interviews (Autism Diagnostic Observation Schedule, ADOS; Lord, Rutter, DiLavore, & Risi, 1999; Lord et al., 2012) with 28 children who had previously been diagnosed with ASD. Prosody was quantified in terms of intonation, volume, rate, and voice quality. Research hypotheses were tested via correlation as well as hierarchical and predictive regression between ADOS severity and prosodic cues. Results Automatically extracted speech features demonstrated prosodic characteristics of dyadic interactions. As rated ASD severity increased, both the psychologist and the child demonstrated effects for turn-end pitch slope, and both spoke with atypical voice quality. The psychologist’s acoustic cues predicted the child’s symptom severity better than did the child’s acoustic cues. Conclusion The psychologist, acting as evaluator and interlocutor, was shown to adjust his or her behavior in predictable ways based on the child’s social-communicative impairments. The results support future study of speech prosody of both interaction partners during spontaneous conversation, while using automatic computational methods that allow for scalable analysis on much larger corpora. PMID:24686340
Speech recognition systems on the Cell Broadband Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Y; Jones, H; Vaidya, S
In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousandsmore » of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.« less
Cao, Beiming; Kim, Myungjong; Mau, Ted; Wang, Jun
2017-01-01
Individuals with larynx (vocal folds) impaired have problems in controlling their glottal vibration, producing whispered speech with extreme hoarseness. Standard automatic speech recognition using only acoustic cues is typically ineffective for whispered speech because the corresponding spectral characteristics are distorted. Articulatory cues such as the tongue and lip motion may help in recognizing whispered speech since articulatory motion patterns are generally not affected. In this paper, we investigated whispered speech recognition for patients with reconstructed larynx using articulatory movement data. A data set with both acoustic and articulatory motion data was collected from a patient with surgically reconstructed larynx using an electromagnetic articulograph. Two speech recognition systems, Gaussian mixture model-hidden Markov model (GMM-HMM) and deep neural network-HMM (DNN-HMM), were used in the experiments. Experimental results showed adding either tongue or lip motion data to acoustic features such as mel-frequency cepstral coefficient (MFCC) significantly reduced the phone error rates on both speech recognition systems. Adding both tongue and lip data achieved the best performance. PMID:29423453
The brain is the control center of the body. It controls thoughts, memory, speech, and movement. It regulates the function of many organs. When the brain is healthy, it works quickly and automatically. However, ...
ERIC Educational Resources Information Center
Nip, Ignatius S. B.; Blumenfeld, Henrike K.
2015-01-01
Purpose: Second-language (L2) production requires greater cognitive resources to inhibit the native language and to retrieve less robust lexical representations. The current investigation identifies how proficiency and linguistic complexity, specifically syntactic and lexical factors, influence speech motor control and performance. Method: Speech…
NASA Astrophysics Data System (ADS)
Ghoraani, Behnaz; Krishnan, Sridhar
2009-12-01
The number of people affected by speech problems is increasing as the modern world places increasing demands on the human voice via mobile telephones, voice recognition software, and interpersonal verbal communications. In this paper, we propose a novel methodology for automatic pattern classification of pathological voices. The main contribution of this paper is extraction of meaningful and unique features using Adaptive time-frequency distribution (TFD) and nonnegative matrix factorization (NMF). We construct Adaptive TFD as an effective signal analysis domain to dynamically track the nonstationarity in the speech and utilize NMF as a matrix decomposition (MD) technique to quantify the constructed TFD. The proposed method extracts meaningful and unique features from the joint TFD of the speech, and automatically identifies and measures the abnormality of the signal. Depending on the abnormality measure of each signal, we classify the signal into normal or pathological. The proposed method is applied on the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database which consists of 161 pathological and 51 normal speakers, and an overall classification accuracy of 98.6% was achieved.
Computational Modeling of Emotions and Affect in Social-Cultural Interaction
2013-10-02
acoustic and textual information sources. Second, a cross-lingual study was performed that shed light on how human perception and automatic recognition...speech is produced, a speaker’s pitch and intonational pattern, and word usage. Better feature representation and advanced approaches were used to...recognition performance, and improved our understanding of language/cultural impact on human perception of emotion and automatic classification. • Units
Automatic Barometric Updates from Ground-Based Navigational Aids
1990-03-12
ro fAutomatic Barometric Updates US Department from of Transportation Ground-Based Federal Aviation Administration Navigational Aids Office of Safety...tighter vertical spacing controls , particularly for operations near Terminal Control Areas (TCAs), Airport Radar Service Areas (ARSAs), military climb and...E.F., Ruth, J.C., and Williges, B.H. (1987). Speech Controls and Displays. In Salvendy, G., E. Handbook of Human Factors/Ergonomics, New York, John
Childhood apraxia of speech: A survey of praxis and typical speech characteristics.
Malmenholt, Ann; Lohmander, Anette; McAllister, Anita
2017-07-01
The purpose of this study was to investigate current knowledge of the diagnosis childhood apraxia of speech (CAS) in Sweden and compare speech characteristics and symptoms to those of earlier survey findings in mainly English-speakers. In a web-based questionnaire 178 Swedish speech-language pathologists (SLPs) anonymously answered questions about their perception of typical speech characteristics for CAS. They graded own assessment skills and estimated clinical occurrence. The seven top speech characteristics reported as typical for children with CAS were: inconsistent speech production (85%), sequencing difficulties (71%), oro-motor deficits (63%), vowel errors (62%), voicing errors (61%), consonant cluster deletions (54%), and prosodic disturbance (53%). Motor-programming deficits described as lack of automatization of speech movements were perceived by 82%. All listed characteristics were consistent with the American Speech-Language-Hearing Association (ASHA) consensus-based features, Strand's 10-point checklist, and the diagnostic model proposed by Ozanne. The mode for clinical occurrence was 5%. Number of suspected cases of CAS in the clinical caseload was approximately one new patient/year and SLP. The results support and add to findings from studies of CAS in English-speaking children with similar speech characteristics regarded as typical. Possibly, these findings could contribute to cross-linguistic consensus on CAS characteristics.
Segmental intelligibility of synthetic speech produced by rule.
Logan, J S; Greene, B G; Pisoni, D B
1989-08-01
This paper reports the results of an investigation that employed the modified rhyme test (MRT) to measure the segmental intelligibility of synthetic speech generated automatically by rule. Synthetic speech produced by ten text-to-speech systems was studied and compared to natural speech. A variation of the standard MRT was also used to study the effects of response set size on perceptual confusions. Results indicated that the segmental intelligibility scores formed a continuum. Several systems displayed very high levels of performance that were close to or equal to scores obtained with natural speech; other systems displayed substantially worse performance compared to natural speech. The overall performance of the best system, DECtalk--Paul, was equivalent to the data obtained with natural speech for consonants in syllable-initial position. The findings from this study are discussed in terms of the use of a set of standardized procedures for measuring intelligibility of synthetic speech under controlled laboratory conditions. Recent work investigating the perception of synthetic speech under more severe conditions in which greater demands are made on the listener's processing resources is also considered. The wide range of intelligibility scores obtained in the present study demonstrates important differences in perception and suggests that not all synthetic speech is perceptually equivalent to the listener.
Segmental intelligibility of synthetic speech produced by rule
Logan, John S.; Greene, Beth G.; Pisoni, David B.
2012-01-01
This paper reports the results of an investigation that employed the modified rhyme test (MRT) to measure the segmental intelligibility of synthetic speech generated automatically by rule. Synthetic speech produced by ten text-to-speech systems was studied and compared to natural speech. A variation of the standard MRT was also used to study the effects of response set size on perceptual confusions. Results indicated that the segmental intelligibility scores formed a continuum. Several systems displayed very high levels of performance that were close to or equal to scores obtained with natural speech; other systems displayed substantially worse performance compared to natural speech. The overall performance of the best system, DECtalk—Paul, was equivalent to the data obtained with natural speech for consonants in syllable-initial position. The findings from this study are discussed in terms of the use of a set of standardized procedures for measuring intelligibility of synthetic speech under controlled laboratory conditions. Recent work investigating the perception of synthetic speech under more severe conditions in which greater demands are made on the listener’s processing resources is also considered. The wide range of intelligibility scores obtained in the present study demonstrates important differences in perception and suggests that not all synthetic speech is perceptually equivalent to the listener. PMID:2527884
Speech rate in Parkinson's disease: A controlled study.
Martínez-Sánchez, F; Meilán, J J G; Carro, J; Gómez Íñiguez, C; Millian-Morell, L; Pujante Valverde, I M; López-Alburquerque, T; López, D E
2016-09-01
Speech disturbances will affect most patients with Parkinson's disease (PD) over the course of the disease. The origin and severity of these symptoms are of clinical and diagnostic interest. To evaluate the clinical pattern of speech impairment in PD patients and identify significant differences in speech rate and articulation compared to control subjects. Speech rate and articulation in a reading task were measured using an automatic analytical method. A total of 39 PD patients in the 'on' state and 45 age-and sex-matched asymptomatic controls participated in the study. None of the patients experienced dyskinesias or motor fluctuations during the test. The patients with PD displayed a significant reduction in speech and articulation rates; there were no significant correlations between the studied speech parameters and patient characteristics such as L-dopa dose, duration of the disorder, age, and UPDRS III scores and Hoehn & Yahr scales. Patients with PD show a characteristic pattern of declining speech rate. These results suggest that in PD, disfluencies are the result of the movement disorder affecting the physiology of speech production systems. Copyright © 2014 Sociedad Española de Neurología. Publicado por Elsevier España, S.L.U. All rights reserved.
Modular Neural Networks for Speech Recognition.
1996-08-01
automatic speech rccogni- tion, understanding and translation since the early 1950’ s . Although researchers have demonstrated impressive results with...nodes. It serves only as a data source for the following hidden layer( s ). Finally, the networks output is computed by neurons in the output layer. The...following update rule for weights in the hidden layer: w (,,•+I) ("’) E/V S (W W k- = wj, -- 7 - / v It is easy to generalize the backpropagation
2013-03-20
Wakefield of the University of Michigan as Co-PI. This extended activity produced a large number of products and accomplishments; however, this report...speech communication will be expanded to provide a robust modeling and prediction capability for tasks involving speech production and speech and non...preparations made to move to the newer Cocoa API instead of the previous Carbon API. In the following sections, an extended treatment will be
Stoliarova, L G; Varakin, Iu Ia; Vavilov, S B
1981-01-01
Clinical and tomographic examinations of 40 patients with aphasia developed after an ischemic stroke were carried out. In more than half of them no correlation between the aphasia gravity and character on the one hand, and the size and localization of the ischemic focus (or foci) in the brain on the other was noted. With similar character and gravity of the speech disorder the size and localization of the ischemic foci may be different, ad vice versa. It has been shown that the interrelations between the focal pathology of the brain and the character and gravity of speech disorders are very complicated. One should take into consideration the possibility of individual organization of the speech functions, the degree of the speech activity automatism before the disease, and the state of the cerebrovascular system as a whole.
An experimental version of the MZT (speech-from-text) system with external F(sub 0) control
NASA Astrophysics Data System (ADS)
Nowak, Ignacy
1994-12-01
The version of a Polish speech from text system described in this article was developed using the speech-from-text system. The new system has additional functions which make it possible to enter commands in edited orthographic text to control the phrase component and accentuation parameters. This makes it possible to generate a series of modified intonation contours in the texts spoken by the system. The effects obtained are made easier to control by a graphic illustration of the base frequency pattern in phrases that were last 'spoken' by the system. This version of the system was designed as a test prototype which will help us expand and refine our set of rules for automatic generation of intonation contours, which in turn will enable the fully automated speech-from-text system to generate speech with a more varied and precisely formed fundamental frequency pattern.
Brainstem Correlates of Speech-in-Noise Perception in Children
Anderson, Samira; Skoe, Erika; Chandrasekaran, Bharath; Zecker, Steven; Kraus, Nina
2010-01-01
Children often have difficulty understanding speech in challenging listening environments. In the absence of peripheral hearing loss, these speech perception difficulties may arise from dysfunction at more central levels in the auditory system, including subcortical structures. We examined brainstem encoding of pitch in a speech syllable in 38 school-age children. In children with poor speech-in-noise perception, we find impaired encoding of the fundamental frequency and the second harmonic, two important cues for pitch perception. Pitch, an important factor in speaker identification, aids the listener in tracking a specific voice from a background of voices. These results suggest that the robustness of subcortical neural encoding of pitch features in time-varying signals is an important factor in determining success with speech perception in noise. PMID:20708671
Embedding speech into virtual realities
NASA Technical Reports Server (NTRS)
Bohn, Christian-Arved; Krueger, Wolfgang
1993-01-01
In this work a speaker-independent speech recognition system is presented, which is suitable for implementation in Virtual Reality applications. The use of an artificial neural network in connection with a special compression of the acoustic input leads to a system, which is robust, fast, easy to use and needs no additional hardware, beside a common VR-equipment.
Text-to-audiovisual speech synthesizer for children with learning disabilities.
Mendi, Engin; Bayrak, Coskun
2013-01-01
Learning disabilities affect the ability of children to learn, despite their having normal intelligence. Assistive tools can highly increase functional capabilities of children with learning disorders such as writing, reading, or listening. In this article, we describe a text-to-audiovisual synthesizer that can serve as an assistive tool for such children. The system automatically converts an input text to audiovisual speech, providing synchronization of the head, eye, and lip movements of the three-dimensional face model with appropriate facial expressions and word flow of the text. The proposed system can enhance speech perception and help children having learning deficits to improve their chances of success.
Speaker emotion recognition: from classical classifiers to deep neural networks
NASA Astrophysics Data System (ADS)
Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri
2018-04-01
Speaker emotion recognition is considered among the most challenging tasks in recent years. In fact, automatic systems for security, medicine or education can be improved when considering the speech affective state. In this paper, a twofold approach for speech emotion classification is proposed. At the first side, a relevant set of features is adopted, and then at the second one, numerous supervised training techniques, involving classic methods as well as deep learning, are experimented. Experimental results indicate that deep architecture can improve classification performance on two affective databases, the Berlin Dataset of Emotional Speech and the SAVEE Dataset Surrey Audio-Visual Expressed Emotion.
Nawaz, Tabassam; Mehmood, Zahid; Rashid, Muhammad; Habib, Hafiz Adnan
2018-01-01
Recent research on speech segregation and music fingerprinting has led to improvements in speech segregation and music identification algorithms. Speech and music segregation generally involves the identification of music followed by speech segregation. However, music segregation becomes a challenging task in the presence of noise. This paper proposes a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. A recurrent neural network (RNN)-based hidden layer segregation model is applied to remove stationary noise. Dictionary-based fisher algorithms are employed for speech classification. The proposed method is tested on three datasets (TIMIT, MIR-1K, and MusicBrainz), and the results indicate the robustness of proposed method for speech segregation. The qualitative and quantitative analysis carried out on three datasets demonstrate the efficiency of the proposed method compared to the state-of-the-art speech segregation and classification-based methods. PMID:29558485
Speech input system for meat inspection and pathological coding used thereby
NASA Astrophysics Data System (ADS)
Abe, Shozo
Meat inspection is one of exclusive and important jobs of veterinarians though it is not well known in general. As the inspection should be conducted skillfully during a series of continuous operations in a slaughter house, development of automatic inspecting systems has been required for a long time. We employed a hand-free speech input system to record the inspecting data because inspecters have to use their both hands to treat the internals of catles and check their health conditions by necked eyes. The data collected by the inspectors are transfered to a speech recognizer and then stored as controlable data of each catle inspected. Control of terms such as pathological conditions to be input and their coding are also important in this speech input system and practical examples are shown.
A novel speech watermarking algorithm by line spectrum pair modification
NASA Astrophysics Data System (ADS)
Zhang, Qian; Yang, Senbin; Chen, Guang; Zhou, Jun
2011-10-01
To explore digital watermarking specifically suitable for the speech domain, this paper experimentally investigates the properties of line spectrum pair (LSP) parameters firstly. The results show that the differences between contiguous LSPs are robust against common signal processing operations and small modifications of LSPs are imperceptible to the human auditory system (HAS). According to these conclusions, three contiguous LSPs of a speech frame are selected to embed a watermark bit. The middle LSP is slightly altered to modify the differences of these LSPs when embedding watermark. Correspondingly, the watermark is extracted by comparing these differences. The proposed algorithm's transparency is adjustable to meet the needs of different applications. The algorithm has good robustness against additive noise, quantization, amplitude scale and MP3 compression attacks, for the bit error rate (BER) is less than 5%. In addition, the algorithm allows a relatively low capacity, which approximates to 50 bps.
Ahrens, Merle-Marie; Veniero, Domenica; Gross, Joachim; Harvey, Monika; Thut, Gregor
2015-01-01
Many behaviourally relevant sensory events such as motion stimuli and speech have an intrinsic spatio-temporal structure. This will engage intentional and most likely unintentional (automatic) prediction mechanisms enhancing the perception of upcoming stimuli in the event stream. Here we sought to probe the anticipatory processes that are automatically driven by rhythmic input streams in terms of their spatial and temporal components. To this end, we employed an apparent visual motion paradigm testing the effects of pre-target motion on lateralized visual target discrimination. The motion stimuli either moved towards or away from peripheral target positions (valid vs. invalid spatial motion cueing) at a rhythmic or arrhythmic pace (valid vs. invalid temporal motion cueing). Crucially, we emphasized automatic motion-induced anticipatory processes by rendering the motion stimuli non-predictive of upcoming target position (by design) and task-irrelevant (by instruction), and by creating instead endogenous (orthogonal) expectations using symbolic cueing. Our data revealed that the apparent motion cues automatically engaged both spatial and temporal anticipatory processes, but that these processes were dissociated. We further found evidence for lateralisation of anticipatory temporal but not spatial processes. This indicates that distinct mechanisms may drive automatic spatial and temporal extrapolation of upcoming events from rhythmic event streams. This contrasts with previous findings that instead suggest an interaction between spatial and temporal attention processes when endogenously driven. Our results further highlight the need for isolating intentional from unintentional processes for better understanding the various anticipatory mechanisms engaged in processing behaviourally relevant stimuli with predictable spatio-temporal structure such as motion and speech. PMID:26623650
The Development of the Speaker Independent ARM Continuous Speech Recognition System
1992-01-01
spokeTi airborne reconnaissance reports u-ing a speech recognition system based on phoneme-level hidden Markov models (HMMs). Previous versions of the ARM...will involve automatic selection from multiple model sets, corresponding to different speaker types, and that the most rudimen- tary partition of a...The vocabulary size for the ARM task is 497 words. These words are related to the phoneme-level symbols corresponding to the models in the model set
Stochastic Modeling as a Means of Automatic Speech Recognition
1975-04-01
companng ihc features of different speech recognition systems, attention is often focused on thc control structures and the methods o’ communication...with no need to use secondary storage . Note that we go from a group of separate knowledge sources to an integrated network representation in...exhaust the available lime or storage . - - - . . 1- .-.-.. mmm^~ i — ■ ■ ’ ■ C haplcr I - IN I ROÜliCl ION Page 13 On the other hand
Multimodal fusion of polynomial classifiers for automatic person recgonition
NASA Astrophysics Data System (ADS)
Broun, Charles C.; Zhang, Xiaozheng
2001-03-01
With the prevalence of the information age, privacy and personalization are forefront in today's society. As such, biometrics are viewed as essential components of current evolving technological systems. Consumers demand unobtrusive and non-invasive approaches. In our previous work, we have demonstrated a speaker verification system that meets these criteria. However, there are additional constraints for fielded systems. The required recognition transactions are often performed in adverse environments and across diverse populations, necessitating robust solutions. There are two significant problem areas in current generation speaker verification systems. The first is the difficulty in acquiring clean audio signals in all environments without encumbering the user with a head- mounted close-talking microphone. Second, unimodal biometric systems do not work with a significant percentage of the population. To combat these issues, multimodal techniques are being investigated to improve system robustness to environmental conditions, as well as improve overall accuracy across the population. We propose a multi modal approach that builds on our current state-of-the-art speaker verification technology. In order to maintain the transparent nature of the speech interface, we focus on optical sensing technology to provide the additional modality-giving us an audio-visual person recognition system. For the audio domain, we use our existing speaker verification system. For the visual domain, we focus on lip motion. This is chosen, rather than static face or iris recognition, because it provides dynamic information about the individual. In addition, the lip dynamics can aid speech recognition to provide liveness testing. The visual processing method makes use of both color and edge information, combined within Markov random field MRF framework, to localize the lips. Geometric features are extracted and input to a polynomial classifier for the person recognition process. A late integration approach, based on a probabilistic model, is employed to combine the two modalities. The system is tested on the XM2VTS database combined with AWGN in the audio domain over a range of signal-to-noise ratios.
A voice-input voice-output communication aid for people with severe speech impairment.
Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter
2013-01-01
A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.
Noise-immune multisensor transduction of speech
NASA Astrophysics Data System (ADS)
Viswanathan, Vishu R.; Henry, Claudia M.; Derr, Alan G.; Roucos, Salim; Schwartz, Richard M.
1986-08-01
Two types of configurations of multiple sensors were developed, tested and evaluated in speech recognition application for robust performance in high levels of acoustic background noise: One type combines the individual sensor signals to provide a single speech signal input, and the other provides several parallel inputs. For single-input systems, several configurations of multiple sensors were developed and tested. Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show that each of the two-sensor configurations tested outperforms the constituent individual sensors in high noise. Also presented are results comparing the performance of two-sensor configurations and individual sensors in speaker-dependent, isolated-word speech recognition tests performed using a commercial recognizer (Verbex 4000) in simulated fighter aircraft cockpit noise.
Speech endpoint detection with non-language speech sounds for generic speech processing applications
NASA Astrophysics Data System (ADS)
McClain, Matthew; Romanowski, Brian
2009-05-01
Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.
Abrams, Daniel A; Nicol, Trent; White-Schwoch, Travis; Zecker, Steven; Kraus, Nina
2017-05-01
Speech perception relies on a listener's ability to simultaneously resolve multiple temporal features in the speech signal. Little is known regarding neural mechanisms that enable the simultaneous coding of concurrent temporal features in speech. Here we show that two categories of temporal features in speech, the low-frequency speech envelope and periodicity cues, are processed by distinct neural mechanisms within the same population of cortical neurons. We measured population activity in primary auditory cortex of anesthetized guinea pig in response to three variants of a naturally produced sentence. Results show that the envelope of population responses closely tracks the speech envelope, and this cortical activity more closely reflects wider bandwidths of the speech envelope compared to narrow bands. Additionally, neuronal populations represent the fundamental frequency of speech robustly with phase-locked responses. Importantly, these two temporal features of speech are simultaneously observed within neuronal ensembles in auditory cortex in response to clear, conversation, and compressed speech exemplars. Results show that auditory cortical neurons are adept at simultaneously resolving multiple temporal features in extended speech sentences using discrete coding mechanisms. Copyright © 2017 Elsevier B.V. All rights reserved.
FY 1992-1993 RDT&E Descriptive Summaries: DARPA
1991-02-01
combining natural language and user workflow model information. * Determine effectiveness of auditory models as preprocessors for robust speech...for indexing and retrieving design knowledge. * Evaluate ability of message understanding systems to extract crisis -situation data from news wires...energy effects , underwater vehicles, neutrino detection, speech, tailored nuclear weapons, hypervelocity, nanosecond timing, and MAD/RPV. FY 1991 Planned
Visual Speech Contributes to Phonetic Learning in 6-Month-Old Infants
ERIC Educational Resources Information Center
Teinonen, Tuomas; Aslin, Richard N.; Alku, Paavo; Csibra, Gergely
2008-01-01
Previous research has shown that infants match vowel sounds to facial displays of vowel articulation [Kuhl, P. K., & Meltzoff, A. N. (1982). The bimodal perception of speech in infancy. "Science, 218", 1138-1141; Patterson, M. L., & Werker, J. F. (1999). Matching phonetic information in lips and voice is robust in 4.5-month-old infants. "Infant…
Low-dimensional recurrent neural network-based Kalman filter for speech enhancement.
Xia, Youshen; Wang, Jun
2015-07-01
This paper proposes a new recurrent neural network-based Kalman filter for speech enhancement, based on a noise-constrained least squares estimate. The parameters of speech signal modeled as autoregressive process are first estimated by using the proposed recurrent neural network and the speech signal is then recovered from Kalman filtering. The proposed recurrent neural network is globally asymptomatically stable to the noise-constrained estimate. Because the noise-constrained estimate has a robust performance against non-Gaussian noise, the proposed recurrent neural network-based speech enhancement algorithm can minimize the estimation error of Kalman filter parameters in non-Gaussian noise. Furthermore, having a low-dimensional model feature, the proposed neural network-based speech enhancement algorithm has a much faster speed than two existing recurrent neural networks-based speech enhancement algorithms. Simulation results show that the proposed recurrent neural network-based speech enhancement algorithm can produce a good performance with fast computation and noise reduction. Copyright © 2015 Elsevier Ltd. All rights reserved.
Text as a Supplement to Speech in Young and Older Adults a)
Krull, Vidya; Humes, Larry E.
2015-01-01
Objective The purpose of this experiment was to quantify the contribution of visual text to auditory speech recognition in background noise. Specifically, we tested the hypothesis that partially accurate visual text from an automatic speech recognizer could be used successfully to supplement speech understanding in difficult listening conditions in older adults, with normal or impaired hearing. Our working hypotheses were based on what is known regarding audiovisual speech perception in the elderly from speechreading literature. We hypothesized that: 1) combining auditory and visual text information will result in improved recognition accuracy compared to auditory or visual text information alone; 2) benefit from supplementing speech with visual text (auditory and visual enhancement) in young adults will be greater than that in older adults; and 3) individual differences in performance on perceptual measures would be associated with cognitive abilities. Design Fifteen young adults with normal hearing, fifteen older adults with normal hearing, and fifteen older adults with hearing loss participated in this study. All participants completed sentence recognition tasks in auditory-only, text-only, and combined auditory-text conditions. The auditory sentence stimuli were spectrally shaped to restore audibility for the older participants with impaired hearing. All participants also completed various cognitive measures, including measures of working memory, processing speed, verbal comprehension, perceptual and cognitive speed, processing efficiency, inhibition, and the ability to form wholes from parts. Group effects were examined for each of the perceptual and cognitive measures. Audiovisual benefit was calculated relative to performance on auditory-only and visual-text only conditions. Finally, the relationship between perceptual measures and other independent measures were examined using principal-component factor analyses, followed by regression analyses. Results Both young and older adults performed similarly on nine out of ten perceptual measures (auditory, visual, and combined measures). Combining degraded speech with partially correct text from an automatic speech recognizer improved the understanding of speech in both young and older adults, relative to both auditory- and text-only performance. In all subjects, cognition emerged as a key predictor for a general speech-text integration ability. Conclusions These results suggest that neither age nor hearing loss affected the ability of subjects to benefit from text when used to support speech, after ensuring audibility through spectral shaping. These results also suggest that the benefit obtained by supplementing auditory input with partially accurate text is modulated by cognitive ability, specifically lexical and verbal skills. PMID:26458131
Xiao, Bo; Huang, Chewei; Imel, Zac E; Atkins, David C; Georgiou, Panayiotis; Narayanan, Shrikanth S
2016-04-01
Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.
Xiao, Bo; Huang, Chewei; Imel, Zac E.; Atkins, David C.; Georgiou, Panayiotis; Narayanan, Shrikanth S.
2016-01-01
Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy—a key therapy quality index—from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training. PMID:28286867
A hypothetical neurological association between dehumanization and human rights abuses.
Murrow, Gail B; Murrow, Richard
2015-07-01
Dehumanization is anecdotally and historically associated with reduced empathy for the pain of dehumanized individuals and groups and with psychological and legal denial of their human rights and extreme violence against them. We hypothesize that 'empathy' for the pain and suffering of dehumanized social groups is automatically reduced because, as the research we review suggests, an individual's neural mechanisms of pain empathy best respond to (or produce empathy for) the pain of people whom the individual automatically or implicitly associates with her or his own species. This theory has implications for the philosophical conception of 'human' and of 'legal personhood' in human rights jurisprudence. It further has implications for First Amendment free speech jurisprudence, including the doctrine of 'corporate personhood' and consideration of the potential harm caused by dehumanizing hate speech. We suggest that the new, social neuroscience of empathy provides evidence that both the vagaries of the legal definition or legal fiction of 'personhood' and hate speech that explicitly and implicitly dehumanizes may (in their respective capacities to artificially humanize or dehumanize) manipulate the neural mechanisms of pain empathy in ways that could pose more of a true threat to human rights and rights-based democracy than previously appreciated.
Retrieving Tract Variables From Acoustics: A Comparison of Different Machine Learning Strategies.
Mitra, Vikramjit; Nam, Hosung; Espy-Wilson, Carol Y; Saltzman, Elliot; Goldstein, Louis
2010-09-13
Many different studies have claimed that articulatory information can be used to improve the performance of automatic speech recognition systems. Unfortunately, such articulatory information is not readily available in typical speaker-listener situations. Consequently, such information has to be estimated from the acoustic signal in a process which is usually termed "speech-inversion." This study aims to propose and compare various machine learning strategies for speech inversion: Trajectory mixture density networks (TMDNs), feedforward artificial neural networks (FF-ANN), support vector regression (SVR), autoregressive artificial neural network (AR-ANN), and distal supervised learning (DSL). Further, using a database generated by the Haskins Laboratories speech production model, we test the claim that information regarding constrictions produced by the distinct organs of the vocal tract (vocal tract variables) is superior to flesh-point information (articulatory pellet trajectories) for the inversion process.
Exploring expressivity and emotion with artificial voice and speech technologies.
Pauletto, Sandra; Balentine, Bruce; Pidcock, Chris; Jones, Kevin; Bottaci, Leonardo; Aretoulaki, Maria; Wells, Jez; Mundy, Darren P; Balentine, James
2013-10-01
Emotion in audio-voice signals, as synthesized by text-to-speech (TTS) technologies, was investigated to formulate a theory of expression for user interface design. Emotional parameters were specified with markup tags, and the resulting audio was further modulated with post-processing techniques. Software was then developed to link a selected TTS synthesizer with an automatic speech recognition (ASR) engine, producing a chatbot that could speak and listen. Using these two artificial voice subsystems, investigators explored both artistic and psychological implications of artificial speech emotion. Goals of the investigation were interdisciplinary, with interest in musical composition, augmentative and alternative communication (AAC), commercial voice announcement applications, human-computer interaction (HCI), and artificial intelligence (AI). The work-in-progress points towards an emerging interdisciplinary ontology for artificial voices. As one study output, HCI tools are proposed for future collaboration.
Bentsen, Thomas; May, Tobias; Kressner, Abigail A; Dau, Torsten
2018-01-01
Computational speech segregation attempts to automatically separate speech from noise. This is challenging in conditions with interfering talkers and low signal-to-noise ratios. Recent approaches have adopted deep neural networks and successfully demonstrated speech intelligibility improvements. A selection of components may be responsible for the success with these state-of-the-art approaches: the system architecture, a time frame concatenation technique and the learning objective. The aim of this study was to explore the roles and the relative contributions of these components by measuring speech intelligibility in normal-hearing listeners. A substantial improvement of 25.4 percentage points in speech intelligibility scores was found going from a subband-based architecture, in which a Gaussian Mixture Model-based classifier predicts the distributions of speech and noise for each frequency channel, to a state-of-the-art deep neural network-based architecture. Another improvement of 13.9 percentage points was obtained by changing the learning objective from the ideal binary mask, in which individual time-frequency units are labeled as either speech- or noise-dominated, to the ideal ratio mask, where the units are assigned a continuous value between zero and one. Therefore, both components play significant roles and by combining them, speech intelligibility improvements were obtained in a six-talker condition at a low signal-to-noise ratio.
Knipfer, Christian; Riemann, Max; Bocklet, Tobias; Noeth, Elmar; Schuster, Maria; Sokol, Biljana; Eitner, Stephan; Nkenke, Emeka; Stelzle, Florian
2014-01-01
Tooth loss and its prosthetic rehabilitation significantly affect speech intelligibility. However, little is known about the influence of speech deficiencies on oral health-related quality of life (OHRQoL). The aim of this study was to investigate whether speech intelligibility enhancement through prosthetic rehabilitation significantly influences OHRQoL in patients wearing complete maxillary dentures. Speech intelligibility by means of an automatic speech recognition system (ASR) was prospectively evaluated and compared with subjectively assessed Oral Health Impact Profile (OHIP) scores. Speech was recorded in 28 edentulous patients 1 week prior to the fabrication of new complete maxillary dentures and 6 months thereafter. Speech intelligibility was computed based on the word accuracy (WA) by means of an ASR and compared with a matched control group. One week before and 6 months after rehabilitation, patients assessed themselves for OHRQoL. Speech intelligibility improved significantly after 6 months. Subjects reported a significantly higher OHRQoL after maxillary rehabilitation with complete dentures. No significant correlation was found between the OHIP sum score or its subscales to the WA. Speech intelligibility enhancement achieved through the fabrication of new complete maxillary dentures might not be in the forefront of the patients' perception of their quality of life. For the improvement of OHRQoL in patients wearing complete maxillary dentures, food intake and mastication as well as freedom from pain play a more prominent role.
DETECTION AND IDENTIFICATION OF SPEECH SOUNDS USING CORTICAL ACTIVITY PATTERNS
Centanni, T.M.; Sloan, A.M.; Reed, A.C.; Engineer, C.T.; Rennaker, R.; Kilgard, M.P.
2014-01-01
We have developed a classifier capable of locating and identifying speech sounds using activity from rat auditory cortex with an accuracy equivalent to behavioral performance without the need to specify the onset time of the speech sounds. This classifier can identify speech sounds from a large speech set within 40 ms of stimulus presentation. To compare the temporal limits of the classifier to behavior, we developed a novel task that requires rats to identify individual consonant sounds from a stream of distracter consonants. The classifier successfully predicted the ability of rats to accurately identify speech sounds for syllable presentation rates up to 10 syllables per second (up to 17.9 ± 1.5 bits/sec), which is comparable to human performance. Our results demonstrate that the spatiotemporal patterns generated in primary auditory cortex can be used to quickly and accurately identify consonant sounds from a continuous speech stream without prior knowledge of the stimulus onset times. Improved understanding of the neural mechanisms that support robust speech processing in difficult listening conditions could improve the identification and treatment of a variety of speech processing disorders. PMID:24286757
Higgins, Eleanor L; Raskind, Marshall H
2004-12-01
This study was conducted to assess the effectiveness of two programs developed by the Frostig Center Research Department to improve the reading and spelling of students with learning disabilities (LD): a computer Speech Recognition-based Program (SRBP) and a computer and text-based Automaticity Program (AP). Twenty-eight LD students with reading and spelling difficulties (aged 8 to 18) received each program for 17 weeks and were compared with 16 students in a contrast group who did not receive either program. After adjusting for age and IQ, both the SRBP and AP groups showed significant differences over the contrast group in improving word recognition and reading comprehension. Neither program showed significant differences over contrasts in spelling. The SRBP also improved the performance of the target group when compared with the contrast group on phonological elision and nonword reading efficiency tasks. The AP showed significant differences in all process and reading efficiency measures.
Do perceived context pictures automatically activate their phonological code?
Jescheniak, Jörg D; Oppermann, Frank; Hantsch, Ansgar; Wagner, Valentin; Mädebach, Andreas; Schriefers, Herbert
2009-01-01
Morsella and Miozzo (Morsella, E., & Miozzo, M. (2002). Evidence for a cascade model of lexical access in speech production. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28, 555-563) have reported that the to-be-ignored context pictures become phonologically activated when participants name a target picture, and took this finding as support for cascaded models of lexical retrieval in speech production. In a replication and extension of their experiment in German, we failed to obtain priming effects from context pictures phonologically related to a to-be-named target picture. By contrast, corresponding context words (i.e., the names of the respective pictures) and the same context pictures, when used in an identity condition, did reliably facilitate the naming process. This pattern calls into question the generality of the claim advanced by Morsella and Miozzo that perceptual processing of pictures in the context of a naming task automatically leads to the activation of corresponding lexical-phonological codes.
LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.
Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu
2005-01-01
Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.
Laroche, Marilyn; Dajani, Hilmi R; Prévost, François; Marcoux, André M
2013-01-01
This study investigated speech auditory brainstem responses (speech ABR) with variants of a synthetic vowel in quiet and in background noise. Its objectives were to study the noise robustness of the brainstem response at the fundamental frequency F0 and at the first formant F1, evaluate how the resolved/unresolved harmonics regions in speech contribute to the response at F0, and investigate the origin of the response at F0 to resolved and unresolved harmonics in speech. In total, 18 normal-hearing subjects (11 women, aged 18-33 years) participated in this study. Speech ABRs were recorded using variants of a 300 msec formant-synthesized /a/ vowel in quiet and in white noise. The first experiment employed three variants containing the first three formants F1 to F3, F1 only, and F2 and F3 only with relative formant levels following those reported in the literature. The second experiment employed three variants containing F1 only, F2 only, and F3 only, with the formants equalized to the same level and the signal-to-noise ratio (SNR) maintained at -5 dB. Overall response latency was estimated, and the amplitude and local SNR of the envelope following response at F0 and of the frequency following response at F1 were compared for the different stimulus variants in quiet and in noise. The response at F0 was more robust to noise than that at F1. There were no statistically significant differences in the response at F0 caused by the three stimulus variants in both experiments in quiet. However, the response at F0 with the variant dominated by resolved harmonics was more robust to noise than the response at F0 with the stimulus variants dominated by unresolved harmonics. The latencies of the responses in all cases were very similar in quiet, but the responses at F0 due to resolved and unresolved harmonics combined nonlinearly when both were present in the stimulus. Speech ABR has been suggested as a marker of central auditory processing. The results of this study support earlier work on the differential susceptibility to noise of the F0 and F1 components of the evoked response. In the case of F0, the results support the view that in speech, the pitch of resolved harmonics and that of unresolved harmonics are processed in different but interacting pathways that converge in the upper brainstem. Pitch plays an important role in speech perception, and speech ABR can offer a window into the neural extraction of the pitch of speech and how it may change with hearing impairment.
ERIC Educational Resources Information Center
Mersad, Karima; Nazzi, Thierry
2012-01-01
Transitional Probability (TP) computations are regarded as a powerful learning mechanism that is functional early in development and has been proposed as an initial bootstrapping device for speech segmentation. However, a recent study casts doubt on the robustness of early statistical word-learning. Johnson and Tyler (2010) showed that when…
Effects of emotion on different phoneme classes
NASA Astrophysics Data System (ADS)
Lee, Chul Min; Yildirim, Serdar; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Abe; Lee, Sungbok; Narayanan, Shrikanth
2004-10-01
This study investigates the effects of emotion on different phoneme classes using short-term spectral features. In the research on emotion in speech, most studies have focused on prosodic features of speech. In this study, based on the hypothesis that different emotions have varying effects on the properties of the different speech sounds, we investigate the usefulness of phoneme-class level acoustic modeling for automatic emotion classification. Hidden Markov models (HMM) based on short-term spectral features for five broad phonetic classes are used for this purpose using data obtained from recordings of two actresses. Each speaker produces 211 sentences with four different emotions (neutral, sad, angry, happy). Using the speech material we trained and compared the performances of two sets of HMM classifiers: a generic set of ``emotional speech'' HMMs (one for each emotion) and a set of broad phonetic-class based HMMs (vowel, glide, nasal, stop, fricative) for each emotion type considered. Comparison of classification results indicates that different phoneme classes were affected differently by emotional change and that the vowel sounds are the most important indicator of emotions in speech. Detailed results and their implications on the underlying speech articulation will be discussed.
Study of acoustic correlates associate with emotional speech
NASA Astrophysics Data System (ADS)
Yildirim, Serdar; Lee, Sungbok; Lee, Chul Min; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Ebrahim; Narayanan, Shrikanth
2004-10-01
This study investigates the acoustic characteristics of four different emotions expressed in speech. The aim is to obtain detailed acoustic knowledge on how a speech signal is modulated by changes from neutral to a certain emotional state. Such knowledge is necessary for automatic emotion recognition and classification and emotional speech synthesis. Speech data obtained from two semi-professional actresses are analyzed and compared. Each subject produces 211 sentences with four different emotions; neutral, sad, angry, happy. We analyze changes in temporal and acoustic parameters such as magnitude and variability of segmental duration, fundamental frequency and the first three formant frequencies as a function of emotion. Acoustic differences among the emotions are also explored with mutual information computation, multidimensional scaling and acoustic likelihood comparison with normal speech. Results indicate that speech associated with anger and happiness is characterized by longer duration, shorter interword silence, higher pitch and rms energy with wider ranges. Sadness is distinguished from other emotions by lower rms energy and longer interword silence. Interestingly, the difference in formant pattern between [happiness/anger] and [neutral/sadness] are better reflected in back vowels such as /a/(/father/) than in front vowels. Detailed results on intra- and interspeaker variability will be reported.
Toward dynamic magnetic resonance imaging of the vocal tract during speech production.
Ventura, Sandra M Rua; Freitas, Diamantino Rui S; Tavares, João Manuel R S
2011-07-01
The most recent and significant magnetic resonance imaging (MRI) improvements allow for the visualization of the vocal tract during speech production, which has been revealed to be a powerful tool in dynamic speech research. However, a synchronization technique with enhanced temporal resolution is still required. The study design was transversal in nature. Throughout this work, a technique for the dynamic study of the vocal tract with MRI by using the heart's signal to synchronize and trigger the imaging-acquisition process is presented and described. The technique in question is then used in the measurement of four speech articulatory parameters to assess three different syllables (articulatory gestures) of European Portuguese Language. The acquired MR images are automatically reconstructed so as to result in a variable sequence of images (slices) of different vocal tract shapes in articulatory positions associated with Portuguese speech sounds. The knowledge obtained as a result of the proposed technique represents a direct contribution to the improvement of speech synthesis algorithms, thereby allowing for novel perceptions in coarticulation studies, in addition to providing further efficient clinical guidelines in the pursuit of more proficient speech rehabilitation processes. Copyright © 2011 The Voice Foundation. Published by Mosby, Inc. All rights reserved.
Robust extraction of the aorta and pulmonary artery from 3D MDCT image data
NASA Astrophysics Data System (ADS)
Taeprasartsit, Pinyo; Higgins, William E.
2010-03-01
Accurate definition of the aorta and pulmonary artery from three-dimensional (3D) multi-detector CT (MDCT) images is important for pulmonary applications. This work presents robust methods for defining the aorta and pulmonary artery in the central chest. The methods work on both contrast enhanced and no-contrast 3D MDCT image data. The automatic methods use a common approach employing model fitting and selection and adaptive refinement. During the occasional event that more precise vascular extraction is desired or the method fails, we also have an alternate semi-automatic fail-safe method. The semi-automatic method extracts the vasculature by extending the medial axes into a user-guided direction. A ground-truth study over a series of 40 human 3D MDCT images demonstrates the efficacy, accuracy, robustness, and efficiency of the methods.
Characteristics of speaking style and implications for speech recognition.
Shinozaki, Takahiro; Ostendorf, Mari; Atlas, Les
2009-09-01
Differences in speaking style are associated with more or less spectral variability, as well as different modulation characteristics. The greater variation in some styles (e.g., spontaneous speech and infant-directed speech) poses challenges for recognition but possibly also opportunities for learning more robust models, as evidenced by prior work and motivated by child language acquisition studies. In order to investigate this possibility, this work proposes a new method for characterizing speaking style (the modulation spectrum), examines spontaneous, read, adult-directed, and infant-directed styles in this space, and conducts pilot experiments in style detection and sampling for improved speech recognizer training. Speaking style classification is improved by using the modulation spectrum in combination with standard pitch and energy variation. Speech recognition experiments on a small vocabulary conversational speech recognition task show that sampling methods for training with a small amount of data benefit from the new features.
Speech Cues Contribute to Audiovisual Spatial Integration
Bishop, Christopher W.; Miller, Lee M.
2011-01-01
Speech is the most important form of human communication but ambient sounds and competing talkers often degrade its acoustics. Fortunately the brain can use visual information, especially its highly precise spatial information, to improve speech comprehension in noisy environments. Previous studies have demonstrated that audiovisual integration depends strongly on spatiotemporal factors. However, some integrative phenomena such as McGurk interference persist even with gross spatial disparities, suggesting that spatial alignment is not necessary for robust integration of audiovisual place-of-articulation cues. It is therefore unclear how speech-cues interact with audiovisual spatial integration mechanisms. Here, we combine two well established psychophysical phenomena, the McGurk effect and the ventriloquist's illusion, to explore this dependency. Our results demonstrate that conflicting spatial cues may not interfere with audiovisual integration of speech, but conflicting speech-cues can impede integration in space. This suggests a direct but asymmetrical influence between ventral ‘what’ and dorsal ‘where’ pathways. PMID:21909378
Analysis of human scream and its impact on text-independent speaker verification.
Hansen, John H L; Nandwana, Mahesh Kumar; Shokouhi, Navid
2017-04-01
Scream is defined as sustained, high-energy vocalizations that lack phonological structure. Lack of phonological structure is how scream is identified from other forms of loud vocalization, such as "yell." This study investigates the acoustic aspects of screams and addresses those that are known to prevent standard speaker identification systems from recognizing the identity of screaming speakers. It is well established that speaker variability due to changes in vocal effort and Lombard effect contribute to degraded performance in automatic speech systems (i.e., speech recognition, speaker identification, diarization, etc.). However, previous research in the general area of speaker variability has concentrated on human speech production, whereas less is known about non-speech vocalizations. The UT-NonSpeech corpus is developed here to investigate speaker verification from scream samples. This study considers a detailed analysis in terms of fundamental frequency, spectral peak shift, frame energy distribution, and spectral tilt. It is shown that traditional speaker recognition based on the Gaussian mixture models-universal background model framework is unreliable when evaluated with screams.
Phonologically-based biomarkers for major depressive disorder
NASA Astrophysics Data System (ADS)
Trevino, Andrea Carolina; Quatieri, Thomas Francis; Malyska, Nicolas
2011-12-01
Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
Automatic measurement and representation of prosodic features
NASA Astrophysics Data System (ADS)
Ying, Goangshiuan Shawn
Effective measurement and representation of prosodic features of the acoustic signal for use in automatic speech recognition and understanding systems is the goal of this work. Prosodic features-stress, duration, and intonation-are variations of the acoustic signal whose domains are beyond the boundaries of each individual phonetic segment. Listeners perceive prosodic features through a complex combination of acoustic correlates such as intensity, duration, and fundamental frequency (F0). We have developed new tools to measure F0 and intensity features. We apply a probabilistic global error correction routine to an Average Magnitude Difference Function (AMDF) pitch detector. A new short-term frequency-domain Teager energy algorithm is used to measure the energy of a speech signal. We have conducted a series of experiments performing lexical stress detection on words in continuous English speech from two speech corpora. We have experimented with two different approaches, a segment-based approach and a rhythm unit-based approach, in lexical stress detection. The first approach uses pattern recognition with energy- and duration-based measurements as features to build Bayesian classifiers to detect the stress level of a vowel segment. In the second approach we define rhythm unit and use only the F0-based measurement and a scoring system to determine the stressed segment in the rhythm unit. A duration-based segmentation routine was developed to break polysyllabic words into rhythm units. The long-term goal of this work is to develop a system that can effectively detect the stress pattern for each word in continuous speech utterances. Stress information will be integrated as a constraint for pruning the word hypotheses in a word recognition system based on hidden Markov models.
Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers
Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng
2014-01-01
Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004
NASA Astrophysics Data System (ADS)
Wu, Bo; Yang, Minglei; Li, Kehuang; Huang, Zhen; Siniscalchi, Sabato Marco; Wang, Tong; Lee, Chin-Hui
2017-12-01
A reverberation-time-aware deep-neural-network (DNN)-based multi-channel speech dereverberation framework is proposed to handle a wide range of reverberation times (RT60s). There are three key steps in designing a robust system. First, to accomplish simultaneous speech dereverberation and beamforming, we propose a framework, namely DNNSpatial, by selectively concatenating log-power spectral (LPS) input features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on a single deep neural network (DNN). Next, the temporal auto-correlation function of received signals at different RT60s is investigated to show that RT60-dependent temporal-spatial contexts in feature selection are needed in the DNNSpatial training stage in order to optimize the system performance in diverse reverberant environments. Finally, the RT60 is estimated to select the proper temporal and spatial contexts before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. The experimental evidence gathered in this study indicates that the proposed framework outperforms the state-of-the-art signal processing dereverberation algorithm weighted prediction error (WPE) and conventional DNNSpatial systems without taking the reverberation time into account, even for extremely weak and severe reverberant conditions. The proposed technique generalizes well to unseen room size, array geometry and loudspeaker position, and is robust to reverberation time estimation error.
Nouns slow down speech across structurally and culturally diverse languages
Danielsen, Swintha; Hartmann, Iren; Pakendorf, Brigitte; Witzlack-Makarevich, Alena; de Jong, Nivja H.
2018-01-01
By force of nature, every bit of spoken language is produced at a particular speed. However, this speed is not constant—speakers regularly speed up and slow down. Variation in speech rate is influenced by a complex combination of factors, including the frequency and predictability of words, their information status, and their position within an utterance. Here, we use speech rate as an index of word-planning effort and focus on the time window during which speakers prepare the production of words from the two major lexical classes, nouns and verbs. We show that, when naturalistic speech is sampled from languages all over the world, there is a robust cross-linguistic tendency for slower speech before nouns compared with verbs, both in terms of slower articulation and more pauses. We attribute this slowdown effect to the increased amount of planning that nouns require compared with verbs. Unlike verbs, nouns can typically only be used when they represent new or unexpected information; otherwise, they have to be replaced by pronouns or be omitted. These conditions on noun use appear to outweigh potential advantages stemming from differences in internal complexity between nouns and verbs. Our findings suggest that, beneath the staggering diversity of grammatical structures and cultural settings, there are robust universals of language processing that are intimately tied to how speakers manage referential information when they communicate with one another. PMID:29760059
An optimization method for speech enhancement based on deep neural network
NASA Astrophysics Data System (ADS)
Sun, Haixia; Li, Sikun
2017-06-01
Now, this document puts forward a deep neural network (DNN) model with more credible data set and more robust structure. First, we take two regularization skills, dropout and sparsity constraint to strengthen the generalization ability of the model. In this way, not only the model is able to reach the consistency between the pre-training model and the fine-tuning model, but also it reduce resource consumption. Then network compression by weights sharing and quantization is allowed to reduce storage cost. In the end, we evaluate the quality of the reconstructed speech according to different criterion. The result proofs that the improved framework has good performance on speech enhancement and meets the requirement of speech processing.
Brain-to-text: decoding spoken phrases from phone representations in the brain.
Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja
2015-01-01
It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.
Brain-to-text: decoding spoken phrases from phone representations in the brain
Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja
2015-01-01
It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech. PMID:26124702
On the Development of Speech Resources for the Mixtec Language
2013-01-01
The Mixtec language is one of the main native languages in Mexico. In general, due to urbanization, discrimination, and limited attempts to promote the culture, the native languages are disappearing. Most of the information available about the Mixtec language is in written form as in dictionaries which, although including examples about how to pronounce the Mixtec words, are not as reliable as listening to the correct pronunciation from a native speaker. Formal acoustic resources, as speech corpora, are almost non-existent for the Mixtec, and no speech technologies are known to have been developed for it. This paper presents the development of the following resources for the Mixtec language: (1) a speech database of traditional narratives of the Mixtec culture spoken by a native speaker (labelled at the phonetic and orthographic levels by means of spectral analysis) and (2) a native speaker-adaptive automatic speech recognition (ASR) system (trained with the speech database) integrated with a Mixtec-to-Spanish/Spanish-to-Mixtec text translator. The speech database, although small and limited to a single variant, was reliable enough to build the multiuser speech application which presented a mean recognition/translation performance up to 94.36% in experiments with non-native speakers (the target users). PMID:23710134
Is talking to an automated teller machine natural and fun?
Chan, F Y; Khalid, H M
Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.
Ait Khelifa-Gallois, N; Puget, S; Longaud, A; Laroussinie, F; Soria, C; Sainte-Rose, C; Dellatolas, G
2015-04-01
It has been suggested that the cerebellum is involved in reading acquisition and in particular in the progression from automatic grapheme-phoneme conversion to the internalization of speech required for silent reading. This idea is in line with clinical and neuroimaging data showing a cerebellar role in subvocal rehearsal for printed verbalizable material and with computational "internal models" of the cerebellum suggesting its role in inner speech (i.e. covert speech without mouthing the words). However, studies examining a possible cerebellar role in the suppression of articulatory movements during silent reading acquisition in children are lacking. Here, we report clinical evidence that the cerebellum plays a part in this transition. Reading performances were compared between a group of 17 paediatric patients treated for benign cerebellar tumours and a group of controls matched for age, gender, and parental socio-educational level. The patients scored significantly lower on all reading, but the most striking difference concerned silent reading, perfectly acquired by almost all controls, contrasting with 41 % of the patients who were unable to read any item silently. Silent reading was correlated with the Working Memory Index. The present findings converge with previous reports on an implication of the cerebellum in inner speech and in the automatization of reading. This cerebellar implication is probably not specific to reading, as it also seems to affect non-reading tasks such as counting.
Özçalışkan, Şeyda; Levine, Susan C.; Goldin-Meadow, Susan
2013-01-01
Children with pre/perinatal unilateral brain lesions (PL) show remarkable plasticity for language development. Is this plasticity characterized by the same developmental trajectory that characterizes typically developing (TD) children, with gesture leading the way into speech? We explored this question, comparing 11 children with PL—matched to 30 TD children on expressive vocabulary—in the second year of life. Children with PL showed similarities to TD children for simple but not complex sentence types. Children with PL produced simple sentences across gesture and speech several months before producing them entirely in speech, exhibiting parallel delays in both gesture+speech and speech-alone. However, unlike TD children, children with PL produced complex sentence types first in speech-alone. Overall, the gesture-speech system appears to be a robust feature of language-learning for simple—but not complex—sentence constructions, acting as a harbinger of change in language development even when that language is developing in an injured brain. PMID:23217292
Chaspari, Theodora; Soldatos, Constantin; Maragos, Petros
2015-01-01
The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. The resulting database contains 696 recorded utterances in Greek language by 20 native speakers and has a total duration of approximately 28 min. Speech classification results yield accuracy up to 75.15% for automatically recognizing the emotions in AESI. These results indicate the usefulness of our approach for collecting emotional data with reliable content, balanced across classes and with reduced environmental variability.
Markov random field based automatic image alignment for electron tomography.
Amat, Fernando; Moussavi, Farshid; Comolli, Luis R; Elidan, Gal; Downing, Kenneth H; Horowitz, Mark
2008-03-01
We present a method for automatic full-precision alignment of the images in a tomographic tilt series. Full-precision automatic alignment of cryo electron microscopy images has remained a difficult challenge to date, due to the limited electron dose and low image contrast. These facts lead to poor signal to noise ratio (SNR) in the images, which causes automatic feature trackers to generate errors, even with high contrast gold particles as fiducial features. To enable fully automatic alignment for full-precision reconstructions, we frame the problem probabilistically as finding the most likely particle tracks given a set of noisy images, using contextual information to make the solution more robust to the noise in each image. To solve this maximum likelihood problem, we use Markov Random Fields (MRF) to establish the correspondence of features in alignment and robust optimization for projection model estimation. The resulting algorithm, called Robust Alignment and Projection Estimation for Tomographic Reconstruction, or RAPTOR, has not needed any manual intervention for the difficult datasets we have tried, and has provided sub-pixel alignment that is as good as the manual approach by an expert user. We are able to automatically map complete and partial marker trajectories and thus obtain highly accurate image alignment. Our method has been applied to challenging cryo electron tomographic datasets with low SNR from intact bacterial cells, as well as several plastic section and X-ray datasets.
Cohen, Alex S; Hong, S Lee; Guevara, Alvaro
2010-06-01
Emotional expression is an essential function for daily life that can be severely affected in some psychological disorders. Laboratory-based procedures designed to measure prosodic expression from natural speech have shown early promise for measuring individual differences in emotional expression but have yet to produce robust within-group prosodic changes across various evocative conditions. This report presents data from three separate studies (total N = 464) that digitally recorded subjects as they verbalized their reactions to various stimuli. Format and stimuli were modified to maximize prosodic expression. Our results suggest that use of evocative slides organized according to either a dimensional (e.g., high and low arousal - pleasant, unpleasant and neutral valence) or categorical (e.g., fear, surprise, happiness) models produced robust changes in subjective state but only negligible change in prosodic expression. Alternatively, speech from the recall of autobiographical memories resulted in meaningful changes in both subjective state and prosodic expression. Implications for the study of psychological disorders are discussed.
Creating speech-synchronized animation.
King, Scott A; Parent, Richard E
2005-01-01
We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements.
Effects of stimulus response compatibility on covert imitation of vowels.
Adank, Patti; Nuttall, Helen; Bekkering, Harold; Maegherman, Gwijde
2018-03-13
When we observe someone else speaking, we tend to automatically activate the corresponding speech motor patterns. When listening, we therefore covertly imitate the observed speech. Simulation theories of speech perception propose that covert imitation of speech motor patterns supports speech perception. Covert imitation of speech has been studied with interference paradigms, including the stimulus-response compatibility paradigm (SRC). The SRC paradigm measures covert imitation by comparing articulation of a prompt following exposure to a distracter. Responses tend to be faster for congruent than for incongruent distracters; thus, showing evidence of covert imitation. Simulation accounts propose a key role for covert imitation in speech perception. However, covert imitation has thus far only been demonstrated for a select class of speech sounds, namely consonants, and it is unclear whether covert imitation extends to vowels. We aimed to demonstrate that covert imitation effects as measured with the SRC paradigm extend to vowels, in two experiments. We examined whether covert imitation occurs for vowels in a consonant-vowel-consonant context in visual, audio, and audiovisual modalities. We presented the prompt at four time points to examine how covert imitation varied over the distracter's duration. The results of both experiments clearly demonstrated covert imitation effects for vowels, thus supporting simulation theories of speech perception. Covert imitation was not affected by stimulus modality and was maximal for later time points.
Sound Classification in Hearing Aids Inspired by Auditory Scene Analysis
NASA Astrophysics Data System (ADS)
Büchler, Michael; Allegro, Silvia; Launer, Stefan; Dillier, Norbert
2005-12-01
A sound classification system for the automatic recognition of the acoustic environment in a hearing aid is discussed. The system distinguishes the four sound classes "clean speech," "speech in noise," "noise," and "music." A number of features that are inspired by auditory scene analysis are extracted from the sound signal. These features describe amplitude modulations, spectral profile, harmonicity, amplitude onsets, and rhythm. They are evaluated together with different pattern classifiers. Simple classifiers, such as rule-based and minimum-distance classifiers, are compared with more complex approaches, such as Bayes classifier, neural network, and hidden Markov model. Sounds from a large database are employed for both training and testing of the system. The achieved recognition rates are very high except for the class "speech in noise." Problems arise in the classification of compressed pop music, strongly reverberated speech, and tonal or fluctuating noises.
Abnormal laughter-like vocalisations replacing speech in primary progressive aphasia
Rohrer, Jonathan D.; Warren, Jason D.; Rossor, Martin N.
2009-01-01
We describe ten patients with a clinical diagnosis of primary progressive aphasia (PPA) (pathologically confirmed in three cases) who developed abnormal laughter-like vocalisations in the context of progressive speech output impairment leading to mutism. Failure of speech output was accompanied by increasing frequency of the abnormal vocalisations until ultimately they constituted the patient's only extended utterance. The laughter-like vocalisations did not show contextual sensitivity but occurred as an automatic vocal output that replaced speech. Acoustic analysis of the vocalisations in two patients revealed abnormal motor features including variable note duration and inter-note interval, loss of temporal symmetry of laugh notes and loss of the normal decrescendo. Abnormal laughter-like vocalisations may be a hallmark of a subgroup in the PPA spectrum with impaired control and production of nonverbal vocal behaviour due to disruption of fronto-temporal networks mediating vocalisation. PMID:19435636
Abnormal laughter-like vocalisations replacing speech in primary progressive aphasia.
Rohrer, Jonathan D; Warren, Jason D; Rossor, Martin N
2009-09-15
We describe ten patients with a clinical diagnosis of primary progressive aphasia (PPA) (pathologically confirmed in three cases) who developed abnormal laughter-like vocalisations in the context of progressive speech output impairment leading to mutism. Failure of speech output was accompanied by increasing frequency of the abnormal vocalisations until ultimately they constituted the patient's only extended utterance. The laughter-like vocalisations did not show contextual sensitivity but occurred as an automatic vocal output that replaced speech. Acoustic analysis of the vocalisations in two patients revealed abnormal motor features including variable note duration and inter-note interval, loss of temporal symmetry of laugh notes and loss of the normal decrescendo. Abnormal laughter-like vocalisations may be a hallmark of a subgroup in the PPA spectrum with impaired control and production of nonverbal vocal behaviour due to disruption of fronto-temporal networks mediating vocalisation.
Influence of musical training on understanding voiced and whispered speech in noise.
Ruggles, Dorea R; Freyman, Richard L; Oxenham, Andrew J
2014-01-01
This study tested the hypothesis that the previously reported advantage of musicians over non-musicians in understanding speech in noise arises from more efficient or robust coding of periodic voiced speech, particularly in fluctuating backgrounds. Speech intelligibility was measured in listeners with extensive musical training, and in those with very little musical training or experience, using normal (voiced) or whispered (unvoiced) grammatically correct nonsense sentences in noise that was spectrally shaped to match the long-term spectrum of the speech, and was either continuous or gated with a 16-Hz square wave. Performance was also measured in clinical speech-in-noise tests and in pitch discrimination. Musicians exhibited enhanced pitch discrimination, as expected. However, no systematic or statistically significant advantage for musicians over non-musicians was found in understanding either voiced or whispered sentences in either continuous or gated noise. Musicians also showed no statistically significant advantage in the clinical speech-in-noise tests. Overall, the results provide no evidence for a significant difference between young adult musicians and non-musicians in their ability to understand speech in noise.
Temporal Resolution Needed for Auditory Communication: Measurement With Mosaic Speech
Nakajima, Yoshitaka; Matsuda, Mizuki; Ueda, Kazuo; Remijn, Gerard B.
2018-01-01
Temporal resolution needed for Japanese speech communication was measured. A new experimental paradigm that can reflect the spectro-temporal resolution necessary for healthy listeners to perceive speech is introduced. As a first step, we report listeners' intelligibility scores of Japanese speech with a systematically degraded temporal resolution, so-called “mosaic speech”: speech mosaicized in the coordinates of time and frequency. The results of two experiments show that mosaic speech cut into short static segments was almost perfectly intelligible with a temporal resolution of 40 ms or finer. Intelligibility dropped for a temporal resolution of 80 ms, but was still around 50%-correct level. The data are in line with previous results showing that speech signals separated into short temporal segments of <100 ms can be remarkably robust in terms of linguistic-content perception against drastic manipulations in each segment, such as partial signal omission or temporal reversal. The human perceptual system thus can extract meaning from unexpectedly rough temporal information in speech. The process resembles that of the visual system stringing together static movie frames of ~40 ms into vivid motion. PMID:29740295
Prosodic Stress, Information, and Intelligibility of Speech in Noise
2009-02-28
across periods during which acoustic information has been suppressed. 15. SUBJECT TERMS Robust speech intelligibility Computational model of...Research Fellow at the Department of Computer Science at the University of Southern California). This research involved superimposing acoustic and...presented at an invitational-only session of the Acoustical Society of America’s and European Acoustic Association’s joint meeting in 2008. In summary, the
Analysis of speech sounds is left-hemisphere predominant at 100-150ms after sound onset.
Rinne, T; Alho, K; Alku, P; Holi, M; Sinkkonen, J; Virtanen, J; Bertrand, O; Näätänen, R
1999-04-06
Hemispheric specialization of human speech processing has been found in brain imaging studies using fMRI and PET. Due to the restricted time resolution, these methods cannot, however, determine the stage of auditory processing at which this specialization first emerges. We used a dense electrode array covering the whole scalp to record the mismatch negativity (MMN), an event-related brain potential (ERP) automatically elicited by occasional changes in sounds, which ranged from non-phonetic (tones) to phonetic (vowels). MMN can be used to probe auditory central processing on a millisecond scale with no attention-dependent task requirements. Our results indicate that speech processing occurs predominantly in the left hemisphere at the early, pre-attentive level of auditory analysis.
SaRAD: a Simple and Robust Abbreviation Dictionary.
Adar, Eytan
2004-03-01
Due to recent interest in the use of textual material to augment traditional experiments it has become necessary to automatically cluster, classify and filter natural language information. The Simple and Robust Abbreviation Dictionary (SaRAD) provides an easy to implement, high performance tool for the construction of a biomedical symbol dictionary. The algorithms, applied to the MEDLINE document set, result in a high quality dictionary and toolset to disambiguate abbreviation symbols automatically.
Speech training alters consonant and vowel responses in multiple auditory cortex fields
Engineer, Crystal T.; Rahebi, Kimiya C.; Buell, Elizabeth P.; Fink, Melyssa K.; Kilgard, Michael P.
2015-01-01
Speech sounds evoke unique neural activity patterns in primary auditory cortex (A1). Extensive speech sound discrimination training alters A1 responses. While the neighboring auditory cortical fields each contain information about speech sound identity, each field processes speech sounds differently. We hypothesized that while all fields would exhibit training-induced plasticity following speech training, there would be unique differences in how each field changes. In this study, rats were trained to discriminate speech sounds by consonant or vowel in quiet and in varying levels of background speech-shaped noise. Local field potential and multiunit responses were recorded from four auditory cortex fields in rats that had received 10 weeks of speech discrimination training. Our results reveal that training alters speech evoked responses in each of the auditory fields tested. The neural response to consonants was significantly stronger in anterior auditory field (AAF) and A1 following speech training. The neural response to vowels following speech training was significantly weaker in ventral auditory field (VAF) and posterior auditory field (PAF). This differential plasticity of consonant and vowel sound responses may result from the greater paired pulse depression, expanded low frequency tuning, reduced frequency selectivity, and lower tone thresholds, which occurred across the four auditory fields. These findings suggest that alterations in the distributed processing of behaviorally relevant sounds may contribute to robust speech discrimination. PMID:25827927
Speech recognition features for EEG signal description in detection of neonatal seizures.
Temko, A; Boylan, G; Marnane, W; Lightbody, G
2010-01-01
In this work, features which are usually employed in automatic speech recognition (ASR) are used for the detection of neonatal seizures in newborn EEG. Three conventional ASR feature sets are compared to the feature set which has been previously developed for this task. The results indicate that the thoroughly-studied spectral envelope based ASR features perform reasonably well on their own. Additionally, the SVM Recursive Feature Elimination routine is applied to all extracted features pooled together. It is shown that ASR features consistently appear among the top-rank features.
2010-09-01
the flies.”) or a present tense verb when describing what an airplane does (“An airplane flies.”) This disambiguation is, in general, computationally...as part-of-speech and dialog-act tagging, and yet the volume of data created makes human analysis impractical. We present a cross-genre part-of...acceptable automatic dialog-act determination. Furthermore, we show that a simple naı̈ve Bayes classifier achieves the same performance in a fraction of
A Deep Ensemble Learning Method for Monaural Speech Separation.
Zhang, Xiao-Lei; Wang, DeLiang
2016-03-01
Monaural speech separation is a fundamental problem in robust speech processing. Recently, deep neural network (DNN)-based speech separation methods, which predict either clean speech or an ideal time-frequency mask, have demonstrated remarkable performance improvement. However, a single DNN with a given window length does not leverage contextual information sufficiently, and the differences between the two optimization objectives are not well understood. In this paper, we propose a deep ensemble method, named multicontext networks, to address monaural speech separation. The first multicontext network averages the outputs of multiple DNNs whose inputs employ different window lengths. The second multicontext network is a stack of multiple DNNs. Each DNN in a module of the stack takes the concatenation of original acoustic features and expansion of the soft output of the lower module as its input, and predicts the ratio mask of the target speaker; the DNNs in the same module employ different contexts. We have conducted extensive experiments with three speech corpora. The results demonstrate the effectiveness of the proposed method. We have also compared the two optimization objectives systematically and found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.
Seeing a singer helps comprehension of the song's lyrics.
Jesse, Alexandra; Massaro, Dominic W
2010-06-01
When listening to speech, we often benefit when also seeing the speaker's face. If this advantage is not domain specific for speech, the recognition of sung lyrics should also benefit from seeing the singer's face. By independently varying the sight and sound of the lyrics, we found a substantial comprehension benefit of seeing a singer. This benefit was robust across participants, lyrics, and repetition of the test materials. This benefit was much larger than the benefit for sung lyrics obtained in previous research, which had not provided the visual information normally present in singing. Given that the comprehension of sung lyrics benefits from seeing the singer, just like speech comprehension benefits from seeing the speaker, both speech and music perception appear to be multisensory processes.
A Generative Model of Speech Production in Broca’s and Wernicke’s Areas
Price, Cathy J.; Crinion, Jenny T.; MacSweeney, Mairéad
2011-01-01
Speech production involves the generation of an auditory signal from the articulators and vocal tract. When the intended auditory signal does not match the produced sounds, subsequent articulatory commands can be adjusted to reduce the difference between the intended and produced sounds. This requires an internal model of the intended speech output that can be compared to the produced speech. The aim of this functional imaging study was to identify brain activation related to the internal model of speech production after activation related to vocalization, auditory feedback, and movement in the articulators had been controlled. There were four conditions: silent articulation of speech, non-speech mouth movements, finger tapping, and visual fixation. In the speech conditions, participants produced the mouth movements associated with the words “one” and “three.” We eliminated auditory feedback from the spoken output by instructing participants to articulate these words without producing any sound. The non-speech mouth movement conditions involved lip pursing and tongue protrusions to control for movement in the articulators. The main difference between our speech and non-speech mouth movement conditions is that prior experience producing speech sounds leads to the automatic and covert generation of auditory and phonological associations that may play a role in predicting auditory feedback. We found that, relative to non-speech mouth movements, silent speech activated Broca’s area in the left dorsal pars opercularis and Wernicke’s area in the left posterior superior temporal sulcus. We discuss these results in the context of a generative model of speech production and propose that Broca’s and Wernicke’s areas may be involved in predicting the speech output that follows articulation. These predictions could provide a mechanism by which rapid movement of the articulators is precisely matched to the intended speech outputs during future articulations. PMID:21954392
Sokka, Laura; Huotilainen, Minna; Leinikka, Marianne; Korpela, Jussi; Henelius, Andreas; Alain, Claude; Müller, Kiti; Pakarinen, Satu
2014-12-01
Job burnout is a significant cause of work absenteeism. Evidence from behavioral studies and patient reports suggests that job burnout is associated with impairments of attention and decreased working capacity, and it has overlapping elements with depression, anxiety and sleep disturbances. Here, we examined the electrophysiological correlates of automatic sound change detection and involuntary attention allocation in job burnout using scalp recordings of event-related potentials (ERP). Volunteers with job burnout symptoms but without severe depression and anxiety disorders and their non-burnout controls were presented with natural speech sound stimuli (standard and nine deviants), as well as three rarely occurring speech sounds with strong emotional prosody. All stimuli elicited mismatch negativity (MMN) responses that were comparable in both groups. The groups differed with respect to the P3a, an ERP component reflecting involuntary shift of attention: job burnout group showed a shorter P3a latency in response to the emotionally negative stimulus, and a longer latency in response to the positive stimulus. Results indicate that in job burnout, automatic speech sound discrimination is intact, but there is an attention capture tendency that is faster for negative, and slower to positive information compared to that of controls. Copyright © 2014 Elsevier B.V. All rights reserved.
Rhythm in disguise: why singing may not hold the key to recovery from aphasia
Kotz, Sonja A.; Henseler, Ilona; Turner, Robert; Geyer, Stefan
2011-01-01
The question of whether singing may be helpful for stroke patients with non-fluent aphasia has been debated for many years. However, the role of rhythm in speech recovery appears to have been neglected. In the current lesion study, we aimed to assess the relative importance of melody and rhythm for speech production in 17 non-fluent aphasics. Furthermore, we systematically alternated the lyrics to test for the influence of long-term memory and preserved motor automaticity in formulaic expressions. We controlled for vocal frequency variability, pitch accuracy, rhythmicity, syllable duration, phonetic complexity and other relevant factors, such as learning effects or the acoustic setting. Contrary to some opinion, our data suggest that singing may not be decisive for speech production in non-fluent aphasics. Instead, our results indicate that rhythm may be crucial, particularly for patients with lesions including the basal ganglia. Among the patients we studied, basal ganglia lesions accounted for more than 50% of the variance related to rhythmicity. Our findings therefore suggest that benefits typically attributed to melodic intoning in the past could actually have their roots in rhythm. Moreover, our data indicate that lyric production in non-fluent aphasics may be strongly mediated by long-term memory and motor automaticity, irrespective of whether lyrics are sung or spoken. PMID:21948939
A hypothetical neurological association between dehumanization and human rights abuses
Murrow, Gail B.; Murrow, Richard
2015-01-01
Dehumanization is anecdotally and historically associated with reduced empathy for the pain of dehumanized individuals and groups and with psychological and legal denial of their human rights and extreme violence against them. We hypothesize that ‘empathy’ for the pain and suffering of dehumanized social groups is automatically reduced because, as the research we review suggests, an individual's neural mechanisms of pain empathy best respond to (or produce empathy for) the pain of people whom the individual automatically or implicitly associates with her or his own species. This theory has implications for the philosophical conception of ‘human’ and of ‘legal personhood’ in human rights jurisprudence. It further has implications for First Amendment free speech jurisprudence, including the doctrine of ‘corporate personhood’ and consideration of the potential harm caused by dehumanizing hate speech. We suggest that the new, social neuroscience of empathy provides evidence that both the vagaries of the legal definition or legal fiction of ‘personhood’ and hate speech that explicitly and implicitly dehumanizes may (in their respective capacities to artificially humanize or dehumanize) manipulate the neural mechanisms of pain empathy in ways that could pose more of a true threat to human rights and rights-based democracy than previously appreciated. PMID:27774198
NASA Astrophysics Data System (ADS)
Kardava, Irakli; Tadyszak, Krzysztof; Gulua, Nana; Jurga, Stefan
2017-02-01
For more flexibility of environmental perception by artificial intelligence it is needed to exist the supporting software modules, which will be able to automate the creation of specific language syntax and to make a further analysis for relevant decisions based on semantic functions. According of our proposed approach, of which implementation it is possible to create the couples of formal rules of given sentences (in case of natural languages) or statements (in case of special languages) by helping of computer vision, speech recognition or editable text conversion system for further automatic improvement. In other words, we have developed an approach, by which it can be achieved to significantly improve the training process automation of artificial intelligence, which as a result will give us a higher level of self-developing skills independently from us (from users). At the base of our approach we have developed a software demo version, which includes the algorithm and software code for the entire above mentioned component's implementation (computer vision, speech recognition and editable text conversion system). The program has the ability to work in a multi - stream mode and simultaneously create a syntax based on receiving information from several sources.
Reading disorders and dyslexia.
Hulme, Charles; Snowling, Margaret J
2016-12-01
We review current knowledge about the nature of reading development and disorders, distinguishing between the processes involved in learning to decode print, and the processes involved in reading comprehension. Children with decoding difficulties/dyslexia experience deficits in phoneme awareness, letter-sound knowledge and rapid automatized naming in the preschool years and beyond. These phonological/language difficulties appear to be proximal causes of the problems in learning to decode print in dyslexia. We review data from a prospective study of children at high risk of dyslexia to show that being at family risk of dyslexia is a primary risk factor for poor reading and children with persistent language difficulties at school entry are more likely to develop reading problems. Early oral language difficulties are strong predictors of later difficulties in reading comprehension. There are two distinct forms of reading disorder in children: dyslexia (a difficulty in learning to translate print into speech) and reading comprehension impairment. Both forms of reading problem appear to be predominantly caused by deficits in underlying oral language skills. Implications for screening and for the delivery of robust interventions for language and reading are discussed.
Subcortical encoding of sound is enhanced in bilinguals and relates to executive function advantages
Krizman, Jennifer; Marian, Viorica; Shook, Anthony; Skoe, Erika; Kraus, Nina
2012-01-01
Bilingualism profoundly affects the brain, yielding functional and structural changes in cortical regions dedicated to language processing and executive function [Crinion J, et al. (2006) Science 312:1537–1540; Kim KHS, et al. (1997) Nature 388:171–174]. Comparatively, musical training, another type of sensory enrichment, translates to expertise in cognitive processing and refined biological processing of sound in both cortical and subcortical structures. Therefore, we asked whether bilingualism can also promote experience-dependent plasticity in subcortical auditory processing. We found that adolescent bilinguals, listening to the speech syllable [da], encoded the stimulus more robustly than age-matched monolinguals. Specifically, bilinguals showed enhanced encoding of the fundamental frequency, a feature known to underlie pitch perception and grouping of auditory objects. This enhancement was associated with executive function advantages. Thus, through experience-related tuning of attention, the bilingual auditory system becomes highly efficient in automatically processing sound. This study provides biological evidence for system-wide neural plasticity in auditory experts that facilitates a tight coupling of sensory and cognitive functions. PMID:22547804
Automatic Image Registration of Multimodal Remotely Sensed Data with Global Shearlet Features
NASA Technical Reports Server (NTRS)
Murphy, James M.; Le Moigne, Jacqueline; Harding, David J.
2015-01-01
Automatic image registration is the process of aligning two or more images of approximately the same scene with minimal human assistance. Wavelet-based automatic registration methods are standard, but sometimes are not robust to the choice of initial conditions. That is, if the images to be registered are too far apart relative to the initial guess of the algorithm, the registration algorithm does not converge or has poor accuracy, and is thus not robust. These problems occur because wavelet techniques primarily identify isotropic textural features and are less effective at identifying linear and curvilinear edge features. We integrate the recently developed mathematical construction of shearlets, which is more effective at identifying sparse anisotropic edges, with an existing automatic wavelet-based registration algorithm. Our shearlet features algorithm produces more distinct features than wavelet features algorithms; the separation of edges from textures is even stronger than with wavelets. Our algorithm computes shearlet and wavelet features for the images to be registered, then performs least squares minimization on these features to compute a registration transformation. Our algorithm is two-staged and multiresolution in nature. First, a cascade of shearlet features is used to provide a robust, though approximate, registration. This is then refined by registering with a cascade of wavelet features. Experiments across a variety of image classes show an improved robustness to initial conditions, when compared to wavelet features alone.
Automatic Image Registration of Multi-Modal Remotely Sensed Data with Global Shearlet Features
Murphy, James M.; Le Moigne, Jacqueline; Harding, David J.
2017-01-01
Automatic image registration is the process of aligning two or more images of approximately the same scene with minimal human assistance. Wavelet-based automatic registration methods are standard, but sometimes are not robust to the choice of initial conditions. That is, if the images to be registered are too far apart relative to the initial guess of the algorithm, the registration algorithm does not converge or has poor accuracy, and is thus not robust. These problems occur because wavelet techniques primarily identify isotropic textural features and are less effective at identifying linear and curvilinear edge features. We integrate the recently developed mathematical construction of shearlets, which is more effective at identifying sparse anisotropic edges, with an existing automatic wavelet-based registration algorithm. Our shearlet features algorithm produces more distinct features than wavelet features algorithms; the separation of edges from textures is even stronger than with wavelets. Our algorithm computes shearlet and wavelet features for the images to be registered, then performs least squares minimization on these features to compute a registration transformation. Our algorithm is two-staged and multiresolution in nature. First, a cascade of shearlet features is used to provide a robust, though approximate, registration. This is then refined by registering with a cascade of wavelet features. Experiments across a variety of image classes show an improved robustness to initial conditions, when compared to wavelet features alone. PMID:29123329
Research on Spoken Dialogue Systems
NASA Technical Reports Server (NTRS)
Aist, Gregory; Hieronymus, James; Dowding, John; Hockey, Beth Ann; Rayner, Manny; Chatzichrisafis, Nikos; Farrell, Kim; Renders, Jean-Michel
2010-01-01
Research in the field of spoken dialogue systems has been performed with the goal of making such systems more robust and easier to use in demanding situations. The term "spoken dialogue systems" signifies unified software systems containing speech-recognition, speech-synthesis, dialogue management, and ancillary components that enable human users to communicate, using natural spoken language or nearly natural prescribed spoken language, with other software systems that provide information and/or services.
Tone classification of syllable-segmented Thai speech based on multilayer perception
NASA Astrophysics Data System (ADS)
Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman
2002-05-01
Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.
A novel probabilistic framework for event-based speech recognition
NASA Astrophysics Data System (ADS)
Juneja, Amit; Espy-Wilson, Carol
2003-10-01
One of the reasons for unsatisfactory performance of the state-of-the-art automatic speech recognition (ASR) systems is the inferior acoustic modeling of low-level acoustic-phonetic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal, but such a system for continuous speech recognition (CSR) is not known to exist. A probabilistic and statistical framework for CSR based on the idea of the representation of speech sounds by bundles of binary valued articulatory phonetic features is proposed. Multiple probabilistic sequences of linguistically motivated landmarks are obtained using binary classifiers of manner phonetic features-syllabic, sonorant and continuant-and the knowledge-based acoustic parameters (APs) that are acoustic correlates of those features. The landmarks are then used for the extraction of knowledge-based APs for source and place phonetic features and their binary classification. Probabilistic landmark sequences are constrained using manner class language models for isolated or connected word recognition. The proposed method could overcome the disadvantages encountered by the early acoustic-phonetic knowledge-based systems that led the ASR community to switch to systems highly dependent on statistical pattern analysis methods and probabilistic language or grammar models.
Vogel, Adam P; Block, Susan; Kefalianos, Elaina; Onslow, Mark; Eadie, Patricia; Barth, Ben; Conway, Laura; Mundt, James C; Reilly, Sheena
2015-04-01
To investigate the feasibility of adopting automated interactive voice response (IVR) technology for remotely capturing standardized speech samples from stuttering children. Participants were 10 6-year-old stuttering children. Their parents called a toll-free number from their homes and were prompted to elicit speech from their children using a standard protocol involving conversation, picture description and games. The automated IVR system was implemented using an off-the-shelf telephony software program and delivered by a standard desktop computer. The software infrastructure utilizes voice over internet protocol. Speech samples were automatically recorded during the calls. Video recordings were simultaneously acquired in the home at the time of the call to evaluate the fidelity of the telephone collected samples. Key outcome measures included syllables spoken, percentage of syllables stuttered and an overall rating of stuttering severity using a 10-point scale. Data revealed a high level of relative reliability in terms of intra-class correlation between the video and telephone acquired samples on all outcome measures during the conversation task. Findings were less consistent for speech samples during picture description and games. Results suggest that IVR technology can be used successfully to automate remote capture of child speech samples.
Automatic detection of obstructive sleep apnea using speech signals.
Goldshtein, Evgenia; Tarasiuk, Ariel; Zigel, Yaniv
2011-05-01
Obstructive sleep apnea (OSA) is a common disorder associated with anatomical abnormalities of the upper airways that affects 5% of the population. Acoustic parameters may be influenced by the vocal tract structure and soft tissue properties. We hypothesize that speech signal properties of OSA patients will be different than those of control subjects not having OSA. Using speech signal processing techniques, we explored acoustic speech features of 93 subjects who were recorded using a text-dependent speech protocol and a digital audio recorder immediately prior to polysomnography study. Following analysis of the study, subjects were divided into OSA (n=67) and non-OSA (n=26) groups. A Gaussian mixture model-based system was developed to model and classify between the groups; discriminative features such as vocal tract length and linear prediction coefficients were selected using feature selection technique. Specificity and sensitivity of 83% and 79% were achieved for the male OSA and 86% and 84% for the female OSA patients, respectively. We conclude that acoustic features from speech signals during wakefulness can detect OSA patients with good specificity and sensitivity. Such a system can be used as a basis for future development of a tool for OSA screening. © 2011 IEEE
Speech rhythm alterations in Spanish-speaking individuals with Alzheimer's disease.
Martínez-Sánchez, Francisco; Meilán, Juan J G; Vera-Ferrandiz, Juan Antonio; Carro, Juan; Pujante-Valverde, Isabel M; Ivanova, Olga; Carcavilla, Nuria
2017-07-01
Rhythm is the speech property related to the temporal organization of sounds. Considerable evidence is now available for suggesting that dementia of Alzheimer's type is associated with impairments in speech rhythm. The aim of this study is to assess the use of an automatic computerized system for measuring speech rhythm characteristics in an oral reading task performed by 45 patients with Alzheimer's disease (AD) compared with those same characteristics among 82 healthy older adults without a diagnosis of dementia, and matched by age, sex and cultural background. Ranges of rhythmic-metric and clinical measurements were applied. The results show rhythmic differences between the groups, with higher variability of syllabic intervals in AD patients. Signal processing algorithms applied to oral reading recordings prove to be capable of differentiating between AD patients and older adults without dementia with an accuracy of 87% (specificity 81.7%, sensitivity 82.2%), based on the standard deviation of the duration of syllabic intervals. Experimental results show that the syllabic variability measurements extracted from the speech signal can be used to distinguish between older adults without a diagnosis of dementia and those with AD, and may be useful as a tool for the objective study and quantification of speech deficits in AD.
When speaker identity is unavoidable: Neural processing of speaker identity cues in natural speech.
Tuninetti, Alba; Chládková, Kateřina; Peter, Varghese; Schiller, Niels O; Escudero, Paola
2017-11-01
Speech sound acoustic properties vary largely across speakers and accents. When perceiving speech, adult listeners normally disregard non-linguistic variation caused by speaker or accent differences, in order to comprehend the linguistic message, e.g. to correctly identify a speech sound or a word. Here we tested whether the process of normalizing speaker and accent differences, facilitating the recognition of linguistic information, is found at the level of neural processing, and whether it is modulated by the listeners' native language. In a multi-deviant oddball paradigm, native and nonnative speakers of Dutch were exposed to naturally-produced Dutch vowels varying in speaker, sex, accent, and phoneme identity. Unexpectedly, the analysis of mismatch negativity (MMN) amplitudes elicited by each type of change shows a large degree of early perceptual sensitivity to non-linguistic cues. This finding on perception of naturally-produced stimuli contrasts with previous studies examining the perception of synthetic stimuli wherein adult listeners automatically disregard acoustic cues to speaker identity. The present finding bears relevance to speech normalization theories, suggesting that at an unattended level of processing, listeners are indeed sensitive to changes in fundamental frequency in natural speech tokens. Copyright © 2017 Elsevier Inc. All rights reserved.
Fang, Chunying; Li, Haifeng; Ma, Lin; Zhang, Mancai
2017-01-01
Pathological speech usually refers to speech distortion resulting from illness or other biological insults. The assessment of pathological speech plays an important role in assisting the experts, while automatic evaluation of speech intelligibility is difficult because it is usually nonstationary and mutational. In this paper, we carry out an independent innovation of feature extraction and reduction, and we describe a multigranularity combined feature scheme which is optimized by the hierarchical visual method. A novel method of generating feature set based on S -transform and chaotic analysis is proposed. There are BAFS (430, basic acoustics feature), local spectral characteristics MSCC (84, Mel S -transform cepstrum coefficients), and chaotic features (12). Finally, radar chart and F -score are proposed to optimize the features by the hierarchical visual fusion. The feature set could be optimized from 526 to 96 dimensions based on NKI-CCRT corpus and 104 dimensions based on SVD corpus. The experimental results denote that new features by support vector machine (SVM) have the best performance, with a recognition rate of 84.4% on NKI-CCRT corpus and 78.7% on SVD corpus. The proposed method is thus approved to be effective and reliable for pathological speech intelligibility evaluation.
Tsanas, Athanasios; Zañartu, Matías; Little, Max A.; Fox, Cynthia; Ramig, Lorraine O.; Clifford, Gari D.
2014-01-01
There has been consistent interest among speech signal processing researchers in the accurate estimation of the fundamental frequency (F0) of speech signals. This study examines ten F0 estimation algorithms (some well-established and some proposed more recently) to determine which of these algorithms is, on average, better able to estimate F0 in the sustained vowel /a/. Moreover, a robust method for adaptively weighting the estimates of individual F0 estimation algorithms based on quality and performance measures is proposed, using an adaptive Kalman filter (KF) framework. The accuracy of the algorithms is validated using (a) a database of 117 synthetic realistic phonations obtained using a sophisticated physiological model of speech production and (b) a database of 65 recordings of human phonations where the glottal cycles are calculated from electroglottograph signals. On average, the sawtooth waveform inspired pitch estimator and the nearly defect-free algorithms provided the best individual F0 estimates, and the proposed KF approach resulted in a ∼16% improvement in accuracy over the best single F0 estimation algorithm. These findings may be useful in speech signal processing applications where sustained vowels are used to assess vocal quality, when very accurate F0 estimation is required. PMID:24815269
A speech-controlled environmental control system for people with severe dysarthria.
Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca
2007-06-01
Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.
Parbery-Clark, Alexandra; Anderson, Samira; Hittner, Emily; Kraus, Nina
2012-01-01
Older adults frequently complain that while they can hear a person talking, they cannot understand what is being said; this difficulty is exacerbated by background noise. Peripheral hearing loss cannot fully account for this age-related decline in speech-in-noise ability, as declines in central processing also contribute to this problem. Given that musicians have enhanced speech-in-noise perception, we aimed to define the effects of musical experience on subcortical responses to speech and speech-in-noise perception in middle-aged adults. Results reveal that musicians have enhanced neural encoding of speech in quiet and noisy settings. Enhancements include faster neural response timing, higher neural response consistency, more robust encoding of speech harmonics, and greater neural precision. Taken together, we suggest that musical experience provides perceptual benefits in an aging population by strengthening the underlying neural pathways necessary for the accurate representation of important temporal and spectral features of sound. PMID:23189051
Speech versus non-speech as irrelevant sound: controlling acoustic variation.
Little, Jason S; Martin, Frances Heritage; Thomson, Richard H S
2010-09-01
Functional differences between speech and non-speech within the irrelevant sound effect were investigated using repeated and changing formats of irrelevant sounds in the form of intelligible words and unintelligible signal correlated noise (SCN) versions of the words. Event-related potentials were recorded from 25 females aged between 18 and 25 while they completed a serial order recall task in the presence of irrelevant sound or silence. As expected and in line with the changing-state hypothesis both words and SCN produced robust changing-state effects. However, words produced a greater changing-state effect than SCN indicating that the spectral detail inherent within speech accounts for the greater irrelevant sound effect and changing-state effect typically observed with speech. ERP data in the form of N1 amplitude was modulated within some irrelevant sound conditions suggesting that attentional aspects are involved in the elicitation of the irrelevant sound effect. Copyright (c) 2010 Elsevier B.V. All rights reserved.
McCabe, Patricia J
2018-06-01
Evidence-based practice (EBP) is a well-accepted theoretical framework around which speech-language pathologists strive to build their clinical decisions. The profession's conceptualisation of EBP has been evolving over the last 20 years with the practice of EBP now needing to balance research evidence, clinical data and informed patient choices. However, although EBP is not a new concept, as a profession, we seem to be no closer to closing the gap between research evidence and practice than we were at the start of the movement toward EBP in the late 1990s. This paper examines why speech-language pathologists find it difficult to change our own practice when we are experts in changing the behaviour of others. Using the lens of behavioural economics to examine the heuristics and cognitive processes which facilitate and inhibit change, the paper explores research showing how inconsistency of belief and action, or cognitive dissonance, is inevitable unless we act reflectively instead of automatically. The paper argues that heuristics that prevent us changing our practice toward EBP include the sunk cost fallacy, loss aversion, social desirability bias, choice overload and inertia. These automatic cognitive processes work to inhibit change and may partially account for the slow translation of research into practice. Fortunately, understanding and using other heuristics such as the framing effect, reciprocity, social proof, consistency and commitment may help us to understand our own behaviour as speech-language pathologists and help the profession, and those we work with, move towards EBP.
Speech waveform perturbation analysis: a perceptual-acoustical comparison of seven measures.
Askenfelt, A G; Hammarberg, B
1986-03-01
The performance of seven acoustic measures of cycle-to-cycle variations (perturbations) in the speech waveform was compared. All measures were calculated automatically and applied on running speech. Three of the measures refer to the frequency of occurrence and severity of waveform perturbations in special selected parts of the speech, identified by means of the rate of change in the fundamental frequency. Three other measures refer to statistical properties of the distribution of the relative frequency differences between adjacent pitch periods. One perturbation measure refers to the percentage of consecutive pitch period differences with alternating signs. The acoustic measures were tested on tape recorded speech samples from 41 voice patients, before and after successful therapy. Scattergrams of acoustic waveform perturbation data versus an average of perceived deviant voice qualities, as rated by voice clinicians, are presented. The perturbation measures were compared with regard to the acoustic-perceptual correlation and their ability to discriminate between normal and pathological voice status. The standard deviation of the distribution of the relative frequency differences was suggested as the most useful acoustic measure of waveform perturbations for clinical applications.
Design of a robust baseband LPC coder for speech transmission over 9.6 kbit/s noisy channels
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Russell, W. H.; Higgins, A. L.
1982-04-01
This paper describes the design of a baseband Linear Predictive Coder (LPC) which transmits speech over 9.6 kbit/sec synchronous channels with random bit errors of up to 1%. Presented are the results of our investigation of a number of aspects of the baseband LPC coder with the goal of maximizing the quality of the transmitted speech. Important among these aspects are: bandwidth of the baseband, coding of the baseband residual, high-frequency regeneration, and error protection of important transmission parameters. The paper discusses these and other issues, presents the results of speech-quality tests conducted during the various stages of optimization, and describes the details of the optimized speech coder. This optimized speech coding algorithm has been implemented as a real-time full-duplex system on an array processor. Informal listening tests of the real-time coder have shown that the coder produces good speech quality in the absence of channel bit errors and introduces only a slight degradation in quality for channel bit error rates of up to 1%.
Cortical activity patterns predict robust speech discrimination ability in noise
Shetake, Jai A.; Wolf, Jordan T.; Cheung, Ryan J.; Engineer, Crystal T.; Ram, Satyananda K.; Kilgard, Michael P.
2012-01-01
The neural mechanisms that support speech discrimination in noisy conditions are poorly understood. In quiet conditions, spike timing information appears to be used in the discrimination of speech sounds. In this study, we evaluated the hypothesis that spike timing is also used to distinguish between speech sounds in noisy conditions that significantly degrade neural responses to speech sounds. We tested speech sound discrimination in rats and recorded primary auditory cortex (A1) responses to speech sounds in background noise of different intensities and spectral compositions. Our behavioral results indicate that rats, like humans, are able to accurately discriminate consonant sounds even in the presence of background noise that is as loud as the speech signal. Our neural recordings confirm that speech sounds evoke degraded but detectable responses in noise. Finally, we developed a novel neural classifier that mimics behavioral discrimination. The classifier discriminates between speech sounds by comparing the A1 spatiotemporal activity patterns evoked on single trials with the average spatiotemporal patterns evoked by known sounds. Unlike classifiers in most previous studies, this classifier is not provided with the stimulus onset time. Neural activity analyzed with the use of relative spike timing was well correlated with behavioral speech discrimination in quiet and in noise. Spike timing information integrated over longer intervals was required to accurately predict rat behavioral speech discrimination in noisy conditions. The similarity of neural and behavioral discrimination of speech in noise suggests that humans and rats may employ similar brain mechanisms to solve this problem. PMID:22098331
Dissecting choral speech: properties of the accompanist critical to stuttering reduction.
Kiefte, Michael; Armson, Joy
2008-01-01
The effects of choral speech and altered auditory feedback (AAF) on stuttering frequency were compared to identify those properties of choral speech that make it a more effective condition for stuttering reduction. Seventeen adults who stutter (AWS) participated in an experiment consisting of special choral speech conditions that were manipulated to selectively eliminate specific differences between choral speech and AAF. Consistent with previous findings, results showed that both choral speech and AAF reduced stuttering compared to solo reading. Although reductions under AAF were substantial, they were less dramatic than those for choral speech. Stuttering reduction for choral speech was highly robust even when the accompanist's voice temporally lagged that of the AWS, when there was no opportunity for dynamic interplay between the AWS and accompanist, and when the accompanist was replaced by the AWS's own voice, all of which approximate specific features of AAF. Choral speech was also highly effective in reducing stuttering across changes in speech rate and for both familiar and unfamiliar passages. We concluded that differences in properties between choral speech and AAF other than those that were manipulated in this experiment must account for differences in stuttering reduction. The reader will be able to (1) describe differences in stuttering reduction associated with altered auditory feedback compared to choral speech conditions and (2) describe differences between delivery of a second voice signal as an altered rendition of the speakers own voice (altered auditory feedback) and alterations in the voice of an accompanist (choral speech).
Approximated mutual information training for speech recognition using myoelectric signals.
Guo, Hua J; Chan, A D C
2006-01-01
A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.
Highlight summarization in golf videos using audio signals
NASA Astrophysics Data System (ADS)
Kim, Hyoung-Gook; Kim, Jin Young
2008-01-01
In this paper, we present an automatic summarization of highlights in golf videos based on audio information alone without video information. The proposed highlight summarization system is carried out based on semantic audio segmentation and detection on action units from audio signals. Studio speech, field speech, music, and applause are segmented by means of sound classification. Swing is detected by the methods of impulse onset detection. Sounds like swing and applause form a complete action unit, while studio speech and music parts are used to anchor the program structure. With the advantage of highly precise detection of applause, highlights are extracted effectively. Our experimental results obtain high classification precision on 18 golf games. It proves that the proposed system is very effective and computationally efficient to apply the technology to embedded consumer electronic devices.
Role of Visual Speech in Phonological Processing by Children With Hearing Loss
Jerger, Susan; Tye-Murray, Nancy; Abdi, Hervé
2011-01-01
Purpose This research assessed the influence of visual speech on phonological processing by children with hearing loss (HL). Method Children with HL and children with normal hearing (NH) named pictures while attempting to ignore auditory or audiovisual speech distractors whose onsets relative to the pictures were either congruent, conflicting in place of articulation, or conflicting in voicing—for example, the picture “pizza” coupled with the distractors “peach,” “teacher,” or “beast,” respectively. Speed of picture naming was measured. Results The conflicting conditions slowed naming, and phonological processing by children with HL displayed the age-related shift in sensitivity to visual speech seen in children with NH, although with developmental delay. Younger children with HL exhibited a disproportionately large influence of visual speech and a negligible influence of auditory speech, whereas older children with HL showed a robust influence of auditory speech with no benefit to performance from adding visual speech. The congruent conditions did not speed naming in children with HL, nor did the addition of visual speech influence performance. Unexpectedly, the /∧/-vowel congruent distractors slowed naming in children with HL and decreased articulatory proficiency. Conclusions Results for the conflicting conditions are consistent with the hypothesis that speech representations in children with HL (a) are initially disproportionally structured in terms of visual speech and (b) become better specified with age in terms of auditorily encoded information. PMID:19339701
Shao, Yu; Chang, Chip-Hong
2007-08-01
We present a new speech enhancement scheme for a single-microphone system to meet the demand for quality noise reduction algorithms capable of operating at a very low signal-to-noise ratio. A psychoacoustic model is incorporated into the generalized perceptual wavelet denoising method to reduce the residual noise and improve the intelligibility of speech. The proposed method is a generalized time-frequency subtraction algorithm, which advantageously exploits the wavelet multirate signal representation to preserve the critical transient information. Simultaneous masking and temporal masking of the human auditory system are modeled by the perceptual wavelet packet transform via the frequency and temporal localization of speech components. The wavelet coefficients are used to calculate the Bark spreading energy and temporal spreading energy, from which a time-frequency masking threshold is deduced to adaptively adjust the subtraction parameters of the proposed method. An unvoiced speech enhancement algorithm is also integrated into the system to improve the intelligibility of speech. Through rigorous objective and subjective evaluations, it is shown that the proposed speech enhancement system is capable of reducing noise with little speech degradation in adverse noise environments and the overall performance is superior to several competitive methods.
NASA Astrophysics Data System (ADS)
Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui
2012-04-01
Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.
Watch what you say, your computer might be listening: A review of automated speech recognition
NASA Technical Reports Server (NTRS)
Degennaro, Stephen V.
1991-01-01
Spoken language is the most convenient and natural means by which people interact with each other and is, therefore, a promising candidate for human-machine interactions. Speech also offers an additional channel for hands-busy applications, complementing the use of motor output channels for control. Current speech recognition systems vary considerably across a number of important characteristics, including vocabulary size, speaking mode, training requirements for new speakers, robustness to acoustic environments, and accuracy. Algorithmically, these systems range from rule-based techniques through more probabilistic or self-learning approaches such as hidden Markov modeling and neural networks. This tutorial begins with a brief summary of the relevant features of current speech recognition systems and the strengths and weaknesses of the various algorithmic approaches.
A robust and hierarchical approach for the automatic co-registration of intensity and visible images
NASA Astrophysics Data System (ADS)
González-Aguilera, Diego; Rodríguez-Gonzálvez, Pablo; Hernández-López, David; Luis Lerma, José
2012-09-01
This paper presents a new robust approach to integrate intensity and visible images which have been acquired with a terrestrial laser scanner and a calibrated digital camera, respectively. In particular, an automatic and hierarchical method for the co-registration of both sensors is developed. The approach integrates several existing solutions to improve the performance of the co-registration between range-based and visible images: the Affine Scale-Invariant Feature Transform (A-SIFT), the epipolar geometry, the collinearity equations, the Groebner basis solution and the RANdom SAmple Consensus (RANSAC), integrating a voting scheme. The approach presented herein improves the existing co-registration approaches in automation, robustness, reliability and accuracy.
Fifty years of progress in speech and speaker recognition
NASA Astrophysics Data System (ADS)
Furui, Sadaoki
2004-10-01
Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.
Ding, Nai; Pan, Xunyi; Luo, Cheng; Su, Naifei; Zhang, Wen; Zhang, Jianfeng
2018-01-31
How the brain groups sequential sensory events into chunks is a fundamental question in cognitive neuroscience. This study investigates whether top-down attention or specific tasks are required for the brain to apply lexical knowledge to group syllables into words. Neural responses tracking the syllabic and word rhythms of a rhythmic speech sequence were concurrently monitored using electroencephalography (EEG). The participants performed different tasks, attending to either the rhythmic speech sequence or a distractor, which was another speech stream or a nonlinguistic auditory/visual stimulus. Attention to speech, but not a lexical-meaning-related task, was required for reliable neural tracking of words, even when the distractor was a nonlinguistic stimulus presented cross-modally. Neural tracking of syllables, however, was reliably observed in all tested conditions. These results strongly suggest that neural encoding of individual auditory events (i.e., syllables) is automatic, while knowledge-based construction of temporal chunks (i.e., words) crucially relies on top-down attention. SIGNIFICANCE STATEMENT Why we cannot understand speech when not paying attention is an old question in psychology and cognitive neuroscience. Speech processing is a complex process that involves multiple stages, e.g., hearing and analyzing the speech sound, recognizing words, and combining words into phrases and sentences. The current study investigates which speech-processing stage is blocked when we do not listen carefully. We show that the brain can reliably encode syllables, basic units of speech sounds, even when we do not pay attention. Nevertheless, when distracted, the brain cannot group syllables into multisyllabic words, which are basic units for speech meaning. Therefore, the process of converting speech sound into meaning crucially relies on attention. Copyright © 2018 the authors 0270-6474/18/381178-11$15.00/0.
Methods for eliciting, annotating, and analyzing databases for child speech development.
Beckman, Mary E; Plummer, Andrew R; Munson, Benjamin; Reidy, Patrick F
2017-09-01
Methods from automatic speech recognition (ASR), such as segmentation and forced alignment, have facilitated the rapid annotation and analysis of very large adult speech databases and databases of caregiver-infant interaction, enabling advances in speech science that were unimaginable just a few decades ago. This paper centers on two main problems that must be addressed in order to have analogous resources for developing and exploiting databases of young children's speech. The first problem is to understand and appreciate the differences between adult and child speech that cause ASR models developed for adult speech to fail when applied to child speech. These differences include the fact that children's vocal tracts are smaller than those of adult males and also changing rapidly in size and shape over the course of development, leading to between-talker variability across age groups that dwarfs the between-talker differences between adult men and women. Moreover, children do not achieve fully adult-like speech motor control until they are young adults, and their vocabularies and phonological proficiency are developing as well, leading to considerably more within-talker variability as well as more between-talker variability. The second problem then is to determine what annotation schemas and analysis techniques can most usefully capture relevant aspects of this variability. Indeed, standard acoustic characterizations applied to child speech reveal that adult-centered annotation schemas fail to capture phenomena such as the emergence of covert contrasts in children's developing phonological systems, while also revealing children's nonuniform progression toward community speech norms as they acquire the phonological systems of their native languages. Both problems point to the need for more basic research into the growth and development of the articulatory system (as well as of the lexicon and phonological system) that is oriented explicitly toward the construction of age-appropriate computational models.
An automatic speech recognition system with speaker-independent identification support
NASA Astrophysics Data System (ADS)
Caranica, Alexandru; Burileanu, Corneliu
2015-02-01
The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.
Research Directory for Manpower, Personnel, Training, and Human Factors.
1991-01-01
Enhance Automatic Recognition of Speech in Noisy, Highly Stressful Environments Cofod R* Lica Systems Inc 703-359-0996 Smart Contract Preparation...Lab 301-278-2946 Smart Contract Preparation Expediter Frezell T LTCOL Human Engineering Lab 301-278-5998 Impulse Noise Hazard Information Processing R&D
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Makhoul, J.; Schwartz, R. M.; Huggins, A. W. F.
1982-04-01
The variable frame rate (VFR) transmission methodology developed, implemented, and tested in the years 1973-1978 for efficiently transmitting linear predictive coding (LPC) vocoder parameters extracted from the input speech at a fixed frame rate is reviewed. With the VFR method, parameters are transmitted only when their values have changed sufficiently over the interval since their preceding transmission. Two distinct approaches to automatic implementation of the VFR method are discussed. The first bases the transmission decisions on comparisons between the parameter values of the present frame and the last transmitted frame. The second, which is based on a functional perceptual model of speech, compares the parameter values of all the frames that lie in the interval between the present frame and the last transmitted frame against a linear model of parameter variation over that interval. Also considered is the application of VFR transmission to the design of narrow-band LPC speech coders with average bit rates of 2000-2400 bts/s.
Sound and speech detection and classification in a Health Smart Home.
Fleury, A; Noury, N; Vacher, M; Glasson, H; Seri, J F
2008-01-01
Improvements in medicine increase life expectancy in the world and create a new bottleneck at the entrance of specialized and equipped institutions. To allow elderly people to stay at home, researchers work on ways to monitor them in their own environment, with non-invasive sensors. To meet this goal, smart homes, equipped with lots of sensors, deliver information on the activities of the person and can help detect distress situations. In this paper, we present a global speech and sound recognition system that can be set-up in a flat. We placed eight microphones in the Health Smart Home of Grenoble (a real living flat of 47m(2)) and we automatically analyze and sort out the different sounds recorded in the flat and the speech uttered (to detect normal or distress french sentences). We introduce the methods for the sound and speech recognition, the post-processing of the data and finally the experimental results obtained in real conditions in the flat.
Study of wavelet packet energy entropy for emotion classification in speech and glottal signals
NASA Astrophysics Data System (ADS)
He, Ling; Lech, Margaret; Zhang, Jing; Ren, Xiaomei; Deng, Lihua
2013-07-01
The automatic speech emotion recognition has important applications in human-machine communication. Majority of current research in this area is focused on finding optimal feature parameters. In recent studies, several glottal features were examined as potential cues for emotion differentiation. In this study, a new type of feature parameter is proposed, which calculates energy entropy on values within selected Wavelet Packet frequency bands. The modeling and classification tasks are conducted using the classical GMM algorithm. The experiments use two data sets: the Speech Under Simulated Emotion (SUSE) data set annotated with three different emotions (angry, neutral and soft) and Berlin Emotional Speech (BES) database annotated with seven different emotions (angry, bored, disgust, fear, happy, sad and neutral). The average classification accuracy achieved for the SUSE data (74%-76%) is significantly higher than the accuracy achieved for the BES data (51%-54%). In both cases, the accuracy was significantly higher than the respective random guessing levels (33% for SUSE and 14.3% for BES).
Bradlow, Ann R.; Kim, Midam; Blasingame, Michael
2017-01-01
Second-language (L2) speech is consistently slower than first-language (L1) speech, and L1 speaking rate varies within- and across-talkers depending on many individual, situational, linguistic, and sociolinguistic factors. It is asked whether speaking rate is also determined by a language-independent talker-specific trait such that, across a group of bilinguals, L1 speaking rate significantly predicts L2 speaking rate. Two measurements of speaking rate were automatically extracted from recordings of read and spontaneous speech by English monolinguals (n = 27) and bilinguals from ten L1 backgrounds (n = 86): speech rate (syllables/second), and articulation rate (syllables/second excluding silent pauses). Replicating prior work, L2 speaking rates were significantly slower than L1 speaking rates both across-groups (monolinguals' L1 English vs bilinguals' L2 English), and across L1 and L2 within bilinguals. Critically, within the bilingual group, L1 speaking rate significantly predicted L2 speaking rate, suggesting that a significant portion of inter-talker variation in L2 speech is derived from inter-talker variation in L1 speech, and that individual variability in L2 spoken language production may be best understood within the context of individual variability in L1 spoken language production. PMID:28253679
Restoring the missing features of the corrupted speech using linear interpolation methods
NASA Astrophysics Data System (ADS)
Rassem, Taha H.; Makbol, Nasrin M.; Hasan, Ali Muttaleb; Zaki, Siti Syazni Mohd; Girija, P. N.
2017-10-01
One of the main challenges in the Automatic Speech Recognition (ASR) is the noise. The performance of the ASR system reduces significantly if the speech is corrupted by noise. In spectrogram representation of a speech signal, after deleting low Signal to Noise Ratio (SNR) elements, the incomplete spectrogram is obtained. In this case, the speech recognizer should make modifications to the spectrogram in order to restore the missing elements, which is one direction. In another direction, speech recognizer should be able to restore the missing elements due to deleting low SNR elements before performing the recognition. This is can be done using different spectrogram reconstruction methods. In this paper, the geometrical spectrogram reconstruction methods suggested by some researchers are implemented as a toolbox. In these geometrical reconstruction methods, the linear interpolation along time or frequency methods are used to predict the missing elements between adjacent observed elements in the spectrogram. Moreover, a new linear interpolation method using time and frequency together is presented. The CMU Sphinx III software is used in the experiments to test the performance of the linear interpolation reconstruction method. The experiments are done under different conditions such as different lengths of the window and different lengths of utterances. Speech corpus consists of 20 males and 20 females; each one has two different utterances are used in the experiments. As a result, 80% recognition accuracy is achieved with 25% SNR ratio.
Liu, Xunying; Zhang, Chao; Woodland, Phil; Fonteneau, Elisabeth
2017-01-01
There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain. PMID:28945744
NASA Astrophysics Data System (ADS)
Kattoju, Ravi Kiran; Barber, Daniel J.; Abich, Julian; Harris, Jonathan
2016-05-01
With increasing necessity for intuitive Soldier-robot communication in military operations and advancements in interactive technologies, autonomous robots have transitioned from assistance tools to functional and operational teammates able to service an array of military operations. Despite improvements in gesture and speech recognition technologies, their effectiveness in supporting Soldier-robot communication is still uncertain. The purpose of the present study was to evaluate the performance of gesture and speech interface technologies to facilitate Soldier-robot communication during a spatial-navigation task with an autonomous robot. Gesture and speech semantically based spatial-navigation commands leveraged existing lexicons for visual and verbal communication from the U.S Army field manual for visual signaling and a previously established Squad Level Vocabulary (SLV). Speech commands were recorded by a Lapel microphone and Microsoft Kinect, and classified by commercial off-the-shelf automatic speech recognition (ASR) software. Visual signals were captured and classified using a custom wireless gesture glove and software. Participants in the experiment commanded a robot to complete a simulated ISR mission in a scaled down urban scenario by delivering a sequence of gesture and speech commands, both individually and simultaneously, to the robot. Performance and reliability of gesture and speech hardware interfaces and recognition tools were analyzed and reported. Analysis of experimental results demonstrated the employed gesture technology has significant potential for enabling bidirectional Soldier-robot team dialogue based on the high classification accuracy and minimal training required to perform gesture commands.
A real-time phoneme counting algorithm and application for speech rate monitoring.
Aharonson, Vered; Aharonson, Eran; Raichlin-Levi, Katia; Sotzianu, Aviv; Amir, Ofer; Ovadia-Blechman, Zehava
2017-03-01
Adults who stutter can learn to control and improve their speech fluency by modifying their speaking rate. Existing speech therapy technologies can assist this practice by monitoring speaking rate and providing feedback to the patient, but cannot provide an accurate, quantitative measurement of speaking rate. Moreover, most technologies are too complex and costly to be used for home practice. We developed an algorithm and a smartphone application that monitor a patient's speaking rate in real time and provide user-friendly feedback to both patient and therapist. Our speaking rate computation is performed by a phoneme counting algorithm which implements spectral transition measure extraction to estimate phoneme boundaries. The algorithm is implemented in real time in a mobile application that presents its results in a user-friendly interface. The application incorporates two modes: one provides the patient with visual feedback of his/her speech rate for self-practice and another provides the speech therapist with recordings, speech rate analysis and tools to manage the patient's practice. The algorithm's phoneme counting accuracy was validated on ten healthy subjects who read a paragraph at slow, normal and fast paces, and was compared to manual counting of speech experts. Test-retest and intra-counter reliability were assessed. Preliminary results indicate differences of -4% to 11% between automatic and human phoneme counting. Differences were largest for slow speech. The application can thus provide reliable, user-friendly, real-time feedback for speaking rate control practice. Copyright © 2017 Elsevier Inc. All rights reserved.
Interactive segmentation of tongue contours in ultrasound video sequences using quality maps
NASA Astrophysics Data System (ADS)
Ghrenassia, Sarah; Ménard, Lucie; Laporte, Catherine
2014-03-01
Ultrasound (US) imaging is an effective and non invasive way of studying the tongue motions involved in normal and pathological speech, and the results of US studies are of interest for the development of new strategies in speech therapy. State-of-the-art tongue shape analysis techniques based on US images depend on semi-automated tongue segmentation and tracking techniques. Recent work has mostly focused on improving the accuracy of the tracking techniques themselves. However, occasional errors remain inevitable, regardless of the technique used, and the tongue tracking process must thus be supervised by a speech scientist who will correct these errors manually or semi-automatically. This paper proposes an interactive framework to facilitate this process. In this framework, the user is guided towards potentially problematic portions of the US image sequence by a segmentation quality map that is based on the normalized energy of an active contour model and automatically produced during tracking. When a problematic segmentation is identified, corrections to the segmented contour can be made on one image and propagated both forward and backward in the problematic subsequence, thereby improving the user experience. The interactive tools were tested in combination with two different tracking algorithms. Preliminary results illustrate the potential of the proposed framework, suggesting that the proposed framework generally improves user interaction time, with little change in segmentation repeatability.
Modeling the Perceptual Learning of Novel Dialect Features
ERIC Educational Resources Information Center
Tatman, Rachael
2017-01-01
All language use reflects the user's social identity in systematic ways. While humans can easily adapt to this sociolinguistic variation, automatic speech recognition (ASR) systems continue to struggle with it. This dissertation makes three main contributions. The first is to provide evidence that modern state-of-the-art commercial ASR systems…
Morphosyntactic Neural Analysis for Generalized Lexical Normalization
ERIC Educational Resources Information Center
Leeman-Munk, Samuel Paul
2016-01-01
The phenomenal growth of social media, web forums, and online reviews has spurred a growing interest in automated analysis of user-generated text. At the same time, a proliferation of voice recordings and efforts to archive culture heritage documents are fueling demand for effective automatic speech recognition (ASR) and optical character…
Automatic Intention Recognition in Conversation Processing
ERIC Educational Resources Information Center
Holtgraves, Thomas
2008-01-01
A fundamental assumption of many theories of conversation is that comprehension of a speaker's utterance involves recognition of the speaker's intention in producing that remark. However, the nature of intention recognition is not clear. One approach is to conceptualize a speaker's intention in terms of speech acts [Searle, J. (1969). "Speech…
Multilingual Videos for MOOCs and OER
ERIC Educational Resources Information Center
Valor Miró, Juan Daniel; Baquero-Arnal, Pau; Civera, Jorge; Turró, Carlos; Juan, Alfons
2018-01-01
Massive Open Online Courses (MOOCs) and Open Educational Resources (OER) are rapidly growing, but are not usually offered in multiple languages due to the lack of cost-effective solutions to translate the different objects comprising them and particularly videos. However, current state-of-the-art automatic speech recognition (ASR) and machine…
Envelope responses in single-trial EEG indicate attended speaker in a 'cocktail party'.
Horton, Cort; Srinivasan, Ramesh; D'Zmura, Michael
2014-08-01
Recent studies have shown that auditory cortex better encodes the envelope of attended speech than that of unattended speech during multi-speaker ('cocktail party') situations. We investigated whether these differences were sufficiently robust within single-trial electroencephalographic (EEG) data to accurately determine where subjects attended. Additionally, we compared this measure to other established EEG markers of attention. High-resolution EEG was recorded while subjects engaged in a two-speaker 'cocktail party' task. Cortical responses to speech envelopes were extracted by cross-correlating the envelopes with each EEG channel. We also measured steady-state responses (elicited via high-frequency amplitude modulation of the speech) and alpha-band power, both of which have been sensitive to attention in previous studies. Using linear classifiers, we then examined how well each of these features could be used to predict the subjects' side of attention at various epoch lengths. We found that the attended speaker could be determined reliably from the envelope responses calculated from short periods of EEG, with accuracy improving as a function of sample length. Furthermore, envelope responses were far better indicators of attention than changes in either alpha power or steady-state responses. These results suggest that envelope-related signals recorded in EEG data can be used to form robust auditory BCI's that do not require artificial manipulation (e.g., amplitude modulation) of stimuli to function.
Shahamiri, Seyed Reza; Salim, Siti Salwah Binti
2014-09-01
Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of motor speech articulators. Although a few attempts have been made to apply ASR technologies to sufferers of dysarthria, previous studies show that such ASR systems have not attained an adequate level of performance. In this study, a dysarthric multi-networks speech recognizer (DM-NSR) model is provided using a realization of multi-views multi-learners approach called multi-nets artificial neural networks, which tolerates variability of dysarthric speech. In particular, the DM-NSR model employs several ANNs (as learners) to approximate the likelihood of ASR vocabulary words and to deal with the complexity of dysarthric speech. The proposed DM-NSR approach was presented as both speaker-dependent and speaker-independent paradigms. In order to highlight the performance of the proposed model over legacy models, multi-views single-learner models of the DM-NSRs were also provided and their efficiencies were compared in detail. Moreover, a comparison among the prominent dysarthric ASR methods and the proposed one is provided. The results show that the DM-NSR recorded improved recognition rate by up to 24.67% and the error rate was reduced by up to 8.63% over the reference model.
Accurate visible speech synthesis based on concatenating variable length motion capture data.
Ma, Jiyong; Cole, Ron; Pellom, Bryan; Ward, Wayne; Wise, Barbara
2006-01-01
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
Studies in automatic speech recognition and its application in aerospace
NASA Astrophysics Data System (ADS)
Taylor, Michael Robinson
Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.
Li, X; Yang, Y; Ren, G
2009-06-16
Language is often perceived together with visual information. Recent experimental evidences indicated that, during spoken language comprehension, the brain can immediately integrate visual information with semantic or syntactic information from speech. Here we used the mismatch negativity to further investigate whether prosodic information from speech could be immediately integrated into a visual scene context or not, and especially the time course and automaticity of this integration process. Sixteen Chinese native speakers participated in the study. The materials included Chinese spoken sentences and picture pairs. In the audiovisual situation, relative to the concomitant pictures, the spoken sentence was appropriately accented in the standard stimuli, but inappropriately accented in the two kinds of deviant stimuli. In the purely auditory situation, the speech sentences were presented without pictures. It was found that the deviants evoked mismatch responses in both audiovisual and purely auditory situations; the mismatch negativity in the purely auditory situation peaked at the same time as, but was weaker than that evoked by the same deviant speech sounds in the audiovisual situation. This pattern of results suggested immediate integration of prosodic information from speech and visual information from pictures in the absence of focused attention.
Factors that influence the performance of experienced speech recognition users.
Koester, Heidi Horstmann
2006-01-01
Performance on automatic speech recognition (ASR) systems for users with physical disabilities varies widely between individuals. The goal of this study was to discover some key factors that account for that variation. Using data from 23 experienced ASR users with physical disabilities, the effect of 20 different independent variables on recognition accuracy and text entry rate with ASR was measured using bivariate and multivariate analyses. The results show that use of appropriate correction strategies had the strongest influence on user performance with ASR. The amount of time the user spent on his or her computer, the user's manual typing speed, and the speed with which the ASR system recognized speech were all positively associated with better performance. The amount or perceived adequacy of ASR training did not have a significant impact on performance for this user group.
Effects of a metronome on the filled pauses of fluent speakers.
Christenfeld, N
1996-12-01
Filled pauses (the "ums" and "uhs" that litter spontaneous speech) seem to be a product of the speaker paying deliberate attention to the normally automatic act of talking. This is the same sort of explanation that has been offered for stuttering. In this paper we explore whether a manipulation that has long been known to decrease stuttering, synchronizing speech to the beats of a metronome, will then also decrease filled pauses. Two experiments indicate that a metronome has a dramatic effect on the production of filled pauses. This effect is not due to any simplification or slowing of the speech and supports the view that a metronome causes speakers to attend more to how they are talking and less to what they are saying. It also lends support to the connection between stutters and filled pauses.
Neural Recruitment for the Production of Native and Novel Speech Sounds
Moser, Dana; Fridriksson, Julius; Bonilha, Leonardo; Healy, Eric W.; Baylis, Gordon; Baker, Julie; Rorden, Chris
2010-01-01
Two primary areas of damage have been implicated in apraxia of speech (AOS) based on the time post-stroke: (1) the left inferior frontal gyrus (IFG) in acute patients, and (2) the left anterior insula (aIns) in chronic patients. While AOS is widely characterized as a disorder in motor speech planning, little is known about the specific contributions of each of these regions in speech. The purpose of this study was to investigate cortical activation during speech production with a specific focus on the aIns and the IFG in normal adults. While undergoing sparse fMRI, 30 normal adults completed a 30-minute speech-repetition task consisting of three-syllable nonwords that contained either (a) English (native) syllables or (b) Non-English (novel) syllables. When the novel syllable productions were compared to the native syllable productions, greater neural activation was observed in the aIns and IFG, particularly during the first 10 minutes of the task when novelty was the greatest. Although activation in the aIns remained high throughout the task for novel productions, greater activation was clearly demonstrated when the initial 10 minutes were compared to the final 10 minutes of the task. These results suggest increased activity within an extensive neural network, including the aIns and IFG, when the motor speech system is taxed, such as during the production of novel speech. We speculate that the amount of left aIns recruitment during speech production may be related to the internal construction of the motor speech unit such that the degree of novelty/automaticity would result in more or less demands respectively. The role of the IFG as a storehouse and integrative processor for previously acquired routines is also discussed. PMID:19385020
Multi-sensory learning and learning to read.
Blomert, Leo; Froyen, Dries
2010-09-01
The basis of literacy acquisition in alphabetic orthographies is the learning of the associations between the letters and the corresponding speech sounds. In spite of this primacy in learning to read, there is only scarce knowledge on how this audiovisual integration process works and which mechanisms are involved. Recent electrophysiological studies of letter-speech sound processing have revealed that normally developing readers take years to automate these associations and dyslexic readers hardly exhibit automation of these associations. It is argued that the reason for this effortful learning may reside in the nature of the audiovisual process that is recruited for the integration of in principle arbitrarily linked elements. It is shown that letter-speech sound integration does not resemble the processes involved in the integration of natural audiovisual objects such as audiovisual speech. The automatic symmetrical recruitment of the assumedly uni-sensory visual and auditory cortices in audiovisual speech integration does not occur for letter and speech sound integration. It is also argued that letter-speech sound integration only partly resembles the integration of arbitrarily linked unfamiliar audiovisual objects. Letter-sound integration and artificial audiovisual objects share the necessity of a narrow time window for integration to occur. However, they differ from these artificial objects, because they constitute an integration of partly familiar elements which acquire meaning through the learning of an orthography. Although letter-speech sound pairs share similarities with audiovisual speech processing as well as with unfamiliar, arbitrary objects, it seems that letter-speech sound pairs develop into unique audiovisual objects that furthermore have to be processed in a unique way in order to enable fluent reading and thus very likely recruit other neurobiological learning mechanisms than the ones involved in learning natural or arbitrary unfamiliar audiovisual associations. Copyright 2010 Elsevier B.V. All rights reserved.
Vector Sum Excited Linear Prediction (VSELP) speech coding at 4.8 kbps
NASA Technical Reports Server (NTRS)
Gerson, Ira A.; Jasiuk, Mark A.
1990-01-01
Code Excited Linear Prediction (CELP) speech coders exhibit good performance at data rates as low as 4800 bps. The major drawback to CELP type coders is their larger computational requirements. The Vector Sum Excited Linear Prediction (VSELP) speech coder utilizes a codebook with a structure which allows for a very efficient search procedure. Other advantages of the VSELP codebook structure is discussed and a detailed description of a 4.8 kbps VSELP coder is given. This coder is an improved version of the VSELP algorithm, which finished first in the NSA's evaluation of the 4.8 kbps speech coders. The coder uses a subsample resolution single tap long term predictor, a single VSELP excitation codebook, a novel gain quantizer which is robust to channel errors, and a new adaptive pre/postfilter arrangement.
Synchronized and noise-robust audio recordings during realtime magnetic resonance imaging scans.
Bresch, Erik; Nielsen, Jon; Nayak, Krishna; Narayanan, Shrikanth
2006-10-01
This letter describes a data acquisition setup for recording, and processing, running speech from a person in a magnetic resonance imaging (MRI) scanner. The main focus is on ensuring synchronicity between image and audio acquisition, and in obtaining good signal to noise ratio to facilitate further speech analysis and modeling. A field-programmable gate array based hardware design for synchronizing the scanner image acquisition to other external data such as audio is described. The audio setup itself features two fiber optical microphones and a noise-canceling filter. Two noise cancellation methods are described including a novel approach using a pulse sequence specific model of the gradient noise of the MRI scanner. The setup is useful for scientific speech production studies. Sample results of speech and singing data acquired and processed using the proposed method are given.
Synchronized and noise-robust audio recordings during realtime magnetic resonance imaging scans (L)
Bresch, Erik; Nielsen, Jon; Nayak, Krishna; Narayanan, Shrikanth
2007-01-01
This letter describes a data acquisition setup for recording, and processing, running speech from a person in a magnetic resonance imaging (MRI) scanner. The main focus is on ensuring synchronicity between image and audio acquisition, and in obtaining good signal to noise ratio to facilitate further speech analysis and modeling. A field-programmable gate array based hardware design for synchronizing the scanner image acquisition to other external data such as audio is described. The audio setup itself features two fiber optical microphones and a noise-canceling filter. Two noise cancellation methods are described including a novel approach using a pulse sequence specific model of the gradient noise of the MRI scanner. The setup is useful for scientific speech production studies. Sample results of speech and singing data acquired and processed using the proposed method are given. PMID:17069275
Suppression of competing speech through entrainment of cortical oscillations
D'Zmura, Michael; Srinivasan, Ramesh
2013-01-01
People are highly skilled at attending to one speaker in the presence of competitors, but the neural mechanisms supporting this remain unclear. Recent studies have argued that the auditory system enhances the gain of a speech stream relative to competitors by entraining (or “phase-locking”) to the rhythmic structure in its acoustic envelope, thus ensuring that syllables arrive during periods of high neuronal excitability. We hypothesized that such a mechanism could also suppress a competing speech stream by ensuring that syllables arrive during periods of low neuronal excitability. To test this, we analyzed high-density EEG recorded from human adults while they attended to one of two competing, naturalistic speech streams. By calculating the cross-correlation between the EEG channels and the speech envelopes, we found evidence of entrainment to the attended speech's acoustic envelope as well as weaker yet significant entrainment to the unattended speech's envelope. An independent component analysis (ICA) decomposition of the data revealed sources in the posterior temporal cortices that displayed robust correlations to both the attended and unattended envelopes. Critically, in these components the signs of the correlations when attended were opposite those when unattended, consistent with the hypothesized entrainment-based suppressive mechanism. PMID:23515789
Messaoud-Galusi, Souhila; Hazan, Valerie; Rosen, Stuart
2012-01-01
Purpose The claim that speech perception abilities are impaired in dyslexia was investigated in a group of 62 dyslexic children and 51 average readers matched in age. Method To test whether there was robust evidence of speech perception deficits in children with dyslexia, speech perception in noise and quiet was measured using eight different tasks involving the identification and discrimination of a complex and highly natural synthetic ‘pea’-‘bee’ contrast (copy synthesised from natural models) and the perception of naturally-produced words. Results Children with dyslexia, on average, performed more poorly than average readers in the synthetic syllables identification task in quiet and in across-category discrimination (but not when tested using an adaptive procedure). They did not differ from average readers on two tasks of word recognition in noise or identification of synthetic syllables in noise. For all tasks, a majority of individual children with dyslexia performed within norms. Finally, speech perception generally did not correlate with pseudo-word reading or phonological processing, the core skills related to dyslexia. Conclusions On the tasks and speech stimuli we used, most children with dyslexia do not appear to show a consistent deficit in speech perception. PMID:21930615
Robust relationship between reading span and speech recognition in noise
Souza, Pamela; Arehart, Kathryn
2015-01-01
Objective Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. Design The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. Study sample The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Results Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Conclusions Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition. PMID:25975360
Robust relationship between reading span and speech recognition in noise.
Souza, Pamela; Arehart, Kathryn
2015-01-01
Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition.
Automated Electroglottographic Inflection Events Detection. A Pilot Study.
Codino, Juliana; Torres, María Eugenia; Rubin, Adam; Jackson-Menaldi, Cristina
2016-11-01
Vocal-fold vibration can be analyzed in a noninvasive way by registering impedance changes within the glottis, through electroglottography. The morphology of the electroglottographic (EGG) signal is related to different vibratory patterns. In the literature, a characteristic knee in the descending portion of the signal has been reported. Some EGG signals do not exhibit this particular knee and have other types of events (inflection events) throughout the ascending and/or descending portion of the vibratory cycle. The goal of this work is to propose an automatic method to identify and classify these events. A computational algorithm was developed based on the mathematical properties of the EGG signal, which detects and reports events throughout the contact phase. Retrospective analysis of EGG signals obtained during routine voice evaluation of adult individuals with a variety of voice disorders was performed using the algorithm as well as human raters. Two judges, both experts in clinical voice analysis, and three general speech pathologists performed manual and visual evaluation of the sample set. The results obtained by the automatic method were compared with those of the human raters. Statistical analysis revealed a significant level of agreement. This automatic tool could allow professionals in the clinical setting to obtain an automatic quantitative and qualitative report of such events present in a voice sample, without having to manually analyze the whole EGG signal. In addition, it might provide the speech pathologist with more information that would complement the standard voice evaluation. It could also be a valuable tool in voice research. Copyright © 2016 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
[A research in speech endpoint detection based on boxes-coupling generalization dimension].
Wang, Zimei; Yang, Cuirong; Wu, Wei; Fan, Yingle
2008-06-01
In this paper, a new calculating method of generalized dimension, based on boxes-coupling principle, is proposed to overcome the edge effects and to improve the capability of the speech endpoint detection which is based on the original calculating method of generalized dimension. This new method has been applied to speech endpoint detection. Firstly, the length of overlapping border was determined, and through calculating the generalized dimension by covering the speech signal with overlapped boxes, three-dimension feature vectors including the box dimension, the information dimension and the correlation dimension were obtained. Secondly, in the light of the relation between feature distance and similarity degree, feature extraction was conducted by use of common distance. Lastly, bi-threshold method was used to classify the speech signals. The results of experiment indicated that, by comparison with the original generalized dimension (OGD) and the spectral entropy (SE) algorithm, the proposed method is more robust and effective for detecting the speech signals which contain different kinds of noise in different signal noise ratio (SNR), especially in low SNR.
Masking effects of speech and music: does the masker's hierarchical structure matter?
Shi, Lu-Feng; Law, Yvonne
2010-04-01
Speech and music are time-varying signals organized by parallel hierarchical rules. Through a series of four experiments, this study compared the masking effects of single-talker speech and instrumental music on speech perception while manipulating the complexity of hierarchical and temporal structures of the maskers. Listeners' word recognition was found to be similar between hierarchically intact and disrupted speech or classical music maskers (Experiment 1). When sentences served as the signal, significantly greater masking effects were observed with disrupted than intact speech or classical music maskers (Experiment 2), although not with jazz or serial music maskers, which differed from the classical music masker in their hierarchical structures (Experiment 3). Removing the classical music masker's temporal dynamics or partially restoring it affected listeners' sentence recognition; yet, differences in performance between intact and disrupted maskers remained robust (Experiment 4). Hence, the effect of structural expectancy was largely present across maskers when comparing them before and after their hierarchical structure was purposefully disrupted. This effect seemed to lend support to the auditory stream segregation theory.
Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features
NASA Astrophysics Data System (ADS)
Nguyen, Chuong H.; Karavas, George K.; Artemiadis, Panagiotis
2018-02-01
Objective. In this paper, we investigate the suitability of imagined speech for brain-computer interface (BCI) applications. Approach. A novel method based on covariance matrix descriptors, which lie in Riemannian manifold, and the relevance vector machines classifier is proposed. The method is applied on electroencephalographic (EEG) signals and tested in multiple subjects. Main results. The method is shown to outperform other approaches in the field with respect to accuracy and robustness. The algorithm is validated on various categories of speech, such as imagined pronunciation of vowels, short words and long words. The classification accuracy of our methodology is in all cases significantly above chance level, reaching a maximum of 70% for cases where we classify three words and 95% for cases of two words. Significance. The results reveal certain aspects that may affect the success of speech imagery classification from EEG signals, such as sound, meaning and word complexity. This can potentially extend the capability of utilizing speech imagery in future BCI applications. The dataset of speech imagery collected from total 15 subjects is also published.
Robust recognition of loud and Lombard speech in the fighter cockpit environment
NASA Astrophysics Data System (ADS)
Stanton, Bill J., Jr.
1988-08-01
There are a number of challenges associated with incorporating speech recognition technology into the fighter cockpit. One of the major problems is the wide range of variability in the pilot's voice. That can result from changing levels of stress and workload. Increasing the training set to include abnormal speech is not an attractive option because of the innumerable conditions that would have to be represented and the inordinate amount of time to collect such a training set. A more promising approach is to study subsets of abnormal speech that have been produced under controlled cockpit conditions with the purpose of characterizing reliable shifts that occur relative to normal speech. Such was the initiative of this research. Analyses were conducted for 18 features on 17671 phoneme tokens across eight speakers for normal, loud, and Lombard speech. It was discovered that there was a consistent migration of energy in the sonorants. This discovery of reliable energy shifts led to the development of a method to reduce or eliminate these shifts in the Euclidean distances between LPC log magnitude spectra. This combination significantly improved recognition performance of loud and Lombard speech. Discrepancies in recognition error rates between normal and abnormal speech were reduced by approximately 50 percent for all eight speakers combined.
Significance of parametric spectral ratio methods in detection and recognition of whispered speech
NASA Astrophysics Data System (ADS)
Mathur, Arpit; Reddy, Shankar M.; Hegde, Rajesh M.
2012-12-01
In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.
López-Linares, Karen; Aranjuelo, Nerea; Kabongo, Luis; Maclair, Gregory; Lete, Nerea; Ceresa, Mario; García-Familiar, Ainhoa; Macía, Iván; González Ballester, Miguel A
2018-05-01
Computerized Tomography Angiography (CTA) based follow-up of Abdominal Aortic Aneurysms (AAA) treated with Endovascular Aneurysm Repair (EVAR) is essential to evaluate the progress of the patient and detect complications. In this context, accurate quantification of post-operative thrombus volume is required. However, a proper evaluation is hindered by the lack of automatic, robust and reproducible thrombus segmentation algorithms. We propose a new fully automatic approach based on Deep Convolutional Neural Networks (DCNN) for robust and reproducible thrombus region of interest detection and subsequent fine thrombus segmentation. The DetecNet detection network is adapted to perform region of interest extraction from a complete CTA and a new segmentation network architecture, based on Fully Convolutional Networks and a Holistically-Nested Edge Detection Network, is presented. These networks are trained, validated and tested in 13 post-operative CTA volumes of different patients using a 4-fold cross-validation approach to provide more robustness to the results. Our pipeline achieves a Dice score of more than 82% for post-operative thrombus segmentation and provides a mean relative volume difference between ground truth and automatic segmentation that lays within the experienced human observer variance without the need of human intervention in most common cases. Copyright © 2018 Elsevier B.V. All rights reserved.
Albà, Xènia; Figueras I Ventura, Rosa M; Lekadir, Karim; Tobon-Gomez, Catalina; Hoogendoorn, Corné; Frangi, Alejandro F
2014-12-01
Magnetic resonance imaging (MRI), specifically late-enhanced MRI, is the standard clinical imaging protocol to assess cardiac viability. Segmentation of myocardial walls is a prerequisite for this assessment. Automatic and robust multisequence segmentation is required to support processing massive quantities of data. A generic rule-based framework to automatically segment the left ventricle myocardium is presented here. We use intensity information, and include shape and interslice smoothness constraints, providing robustness to subject- and study-specific changes. Our automatic initialization considers the geometrical and appearance properties of the left ventricle, as well as interslice information. The segmentation algorithm uses a decoupled, modified graph cut approach with control points, providing a good balance between flexibility and robustness. The method was evaluated on late-enhanced MRI images from a 20-patient in-house database, and on cine-MRI images from a 15-patient open access database, both using as reference manually delineated contours. Segmentation agreement, measured using the Dice coefficient, was 0.81±0.05 and 0.92±0.04 for late-enhanced MRI and cine-MRI, respectively. The method was also compared favorably to a three-dimensional Active Shape Model approach. The experimental validation with two magnetic resonance sequences demonstrates increased accuracy and versatility. © 2013 Wiley Periodicals, Inc.
A hardware experimental platform for neural circuits in the auditory cortex
NASA Astrophysics Data System (ADS)
Rodellar-Biarge, Victoria; García-Dominguez, Pablo; Ruiz-Rizaldos, Yago; Gómez-Vilda, Pedro
2011-05-01
Speech processing in the human brain is a very complex process far from being fully understood although much progress has been done recently. Neuromorphic Speech Processing is a new research orientation in bio-inspired systems approach to find solutions to automatic treatment of specific problems (recognition, synthesis, segmentation, diarization, etc) which can not be adequately solved using classical algorithms. In this paper a neuromorphic speech processing architecture is presented. The systematic bottom-up synthesis of layered structures reproduce the dynamic feature detection of speech related to plausible neural circuits which work as interpretation centres located in the Auditory Cortex. The elementary model is based on Hebbian neuron-like units. For the computation of the architecture a flexible framework is proposed in the environment of Matlab®/Simulink®/HDL, which allows building models in different description styles, complexity and implementation levels. It provides a flexible platform for experimenting on the influence of the number of neurons and interconnections, in the precision of the results and in performance evaluation. The experimentation with different architecture configurations may help both in better understanding how neural circuits may work in the brain as well as in how speech processing can benefit from this understanding.
Computational validation of the motor contribution to speech perception.
Badino, Leonardo; D'Ausilio, Alessandro; Fadiga, Luciano; Metta, Giorgio
2014-07-01
Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion information onto observers' motor representations. This process of matching the actions of others onto our own sensorimotor repertoire is thought to be important for action recognition, providing a non-mediated "motor perception" based on a bidirectional flow of information along the mirror parieto-frontal circuits. State-of-the-art machine learning strategies for hand action identification have shown better performances when sensorimotor data, as opposed to visual information only, are available during learning. As speech is a particular type of action (with acoustic targets), it is expected to activate a mirror neuron mechanism. Indeed, in speech perception, motor centers have been shown to be causally involved in the discrimination of speech sounds. In this paper, we review recent neurophysiological and machine learning-based studies showing (a) the specific contribution of the motor system to speech perception and (b) that automatic phone recognition is significantly improved when motor data are used during training of classifiers (as opposed to learning from purely auditory data). Copyright © 2014 Cognitive Science Society, Inc.
Keller, M David; Ziriax, John M; Barns, William; Sheffield, Benjamin; Brungart, Douglas; Thomas, Tony; Jaeger, Bobby; Yankaskas, Kurt
2017-06-01
Noise, hearing loss, and electronic signal distortion, which are common problems in military environments, can impair speech intelligibility and thereby jeopardize mission success. The current study investigated the impact that impaired communication has on operational performance in a command and control environment by parametrically degrading speech intelligibility in a simulated shipborne Combat Information Center. Experienced U.S. Navy personnel served as the study participants and were required to monitor information from multiple sources and respond appropriately to communications initiated by investigators playing the roles of other personnel involved in a realistic Naval scenario. In each block of the scenario, an adaptive intelligibility modification system employing automatic gain control was used to adjust the signal-to-noise ratio to achieve one of four speech intelligibility levels on a Modified Rhyme Test: No Loss, 80%, 60%, or 40%. Objective and subjective measures of operational performance suggested that performance systematically degraded with decreasing speech intelligibility, with the largest drop occurring between 80% and 60%. These results confirm the importance of noise reduction, good communication design, and effective hearing conservation programs to maximize the operational effectiveness of military personnel. Published by Elsevier B.V.
Development of coffee maker service robot using speech and face recognition systems using POMDP
NASA Astrophysics Data System (ADS)
Budiharto, Widodo; Meiliana; Santoso Gunawan, Alexander Agung
2016-07-01
There are many development of intelligent service robot in order to interact with user naturally. This purpose can be done by embedding speech and face recognition ability on specific tasks to the robot. In this research, we would like to propose Intelligent Coffee Maker Robot which the speech recognition is based on Indonesian language and powered by statistical dialogue systems. This kind of robot can be used in the office, supermarket or restaurant. In our scenario, robot will recognize user's face and then accept commands from the user to do an action, specifically in making a coffee. Based on our previous work, the accuracy for speech recognition is about 86% and face recognition is about 93% in laboratory experiments. The main problem in here is to know the intention of user about how sweetness of the coffee. The intelligent coffee maker robot should conclude the user intention through conversation under unreliable automatic speech in noisy environment. In this paper, this spoken dialog problem is treated as a partially observable Markov decision process (POMDP). We describe how this formulation establish a promising framework by empirical results. The dialog simulations are presented which demonstrate significant quantitative outcome.
Imitation of contrastive lexical stress in children with speech delay
NASA Astrophysics Data System (ADS)
Vick, Jennell C.; Moore, Christopher A.
2005-09-01
This study examined the relationship between acoustic correlates of stress in trochaic (strong-weak), spondaic (strong-strong), and iambic (weak-strong) nonword bisyllables produced by children (30-50) with normal speech acquisition and children with speech delay. Ratios comparing the acoustic measures (vowel duration, rms, and f0) of the first syllable to the second syllable were calculated to evaluate the extent to which each phonetic parameter was used to mark stress. In addition, a calculation of the variability of jaw movement in each bisyllable was made. Finally, perceptual judgments of accuracy of stress production were made. Analysis of perceptual judgments indicated a robust difference between groups: While both groups of children produced errors in imitating the contrastive lexical stress models (~40%), the children with normal speech acquisition tended to produce trochaic forms in substitution for other stress types, whereas children with speech delay showed no preference for trochees. The relationship between segmental acoustic parameters, kinematic variability, and the ratings of stress by trained listeners will be presented.
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; O'Shaughnessy, Douglas
2003-12-01
Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.
2015-10-01
Scoring, Gaussian Backend , etc.) as shown in Fig. 39. The methods in this domain also emphasized the ability to perform data purification for both...investigation using the same infrastructure was undertaken to explore Lombard effect “flavor” detection for improved speaker ID. The study The presence of...dimension selection and compared to a common N-gram frequency based selection. 2.1.2: Exploration on NN/DBN backend : Since Deep Neural Networks (DNN) have
A Voice Enabled Procedure Browser for the International Space Station
NASA Technical Reports Server (NTRS)
Rayner, Manny; Chatzichrisafis, Nikos; Hockey, Beth Ann; Farrell, Kim; Renders, Jean-Michel
2005-01-01
Clarissa, an experimental voice enabled procedure browser that has recently been deployed on the International Space Station (ISS), is to the best of our knowledge the first spoken dialog system in space. This paper gives background on the system and the ISS procedures, then discusses the research developed to address three key problems: grammar-based speech recognition using the Regulus toolkit; SVM based methods for open microphone speech recognition; and robust side-effect free dialogue management for handling undos, corrections and confirmations.
The Timing and Effort of Lexical Access in Natural and Degraded Speech
Wagner, Anita E.; Toffanin, Paolo; Başkent, Deniz
2016-01-01
Understanding speech is effortless in ideal situations, and although adverse conditions, such as caused by hearing impairment, often render it an effortful task, they do not necessarily suspend speech comprehension. A prime example of this is speech perception by cochlear implant users, whose hearing prostheses transmit speech as a significantly degraded signal. It is yet unknown how mechanisms of speech processing deal with such degraded signals, and whether they are affected by effortful processing of speech. This paper compares the automatic process of lexical competition between natural and degraded speech, and combines gaze fixations, which capture the course of lexical disambiguation, with pupillometry, which quantifies the mental effort involved in processing speech. Listeners’ ocular responses were recorded during disambiguation of lexical embeddings with matching and mismatching durational cues. Durational cues were selected due to their substantial role in listeners’ quick limitation of the number of lexical candidates for lexical access in natural speech. Results showed that lexical competition increased mental effort in processing natural stimuli in particular in presence of mismatching cues. Signal degradation reduced listeners’ ability to quickly integrate durational cues in lexical selection, and delayed and prolonged lexical competition. The effort of processing degraded speech was increased overall, and because it had its sources at the pre-lexical level this effect can be attributed to listening to degraded speech rather than to lexical disambiguation. In sum, the course of lexical competition was largely comparable for natural and degraded speech, but showed crucial shifts in timing, and different sources of increased mental effort. We argue that well-timed progress of information from sensory to pre-lexical and lexical stages of processing, which is the result of perceptual adaptation during speech development, is the reason why in ideal situations speech is perceived as an undemanding task. Degradation of the signal or the receiver channel can quickly bring this well-adjusted timing out of balance and lead to increase in mental effort. Incomplete and effortful processing at the early pre-lexical stages has its consequences on lexical processing as it adds uncertainty to the forming and revising of lexical hypotheses. PMID:27065901
A robust automatic phase correction method for signal dense spectra
NASA Astrophysics Data System (ADS)
Bao, Qingjia; Feng, Jiwen; Chen, Li; Chen, Fang; Liu, Zao; Jiang, Bin; Liu, Chaoyang
2013-09-01
A robust automatic phase correction method for Nuclear Magnetic Resonance (NMR) spectra is presented. In this work, a new strategy combining ‘coarse tuning' with ‘fine tuning' is introduced to correct various spectra accurately. In the ‘coarse tuning' procedure, a new robust baseline recognition method is proposed for determining the positions of the tail ends of the peaks, and then the preliminary phased spectra are obtained by minimizing the objective function based on the height difference of these tail ends. After the ‘coarse tuning', the peaks in the preliminary corrected spectra can be categorized into three classes: positive, negative, and distorted. Based on the classification result, a new custom negative penalty function used in the step of ‘fine tuning' is constructed to avoid the negative peak points in the spectra excluded in the negative peaks and distorted peaks. Finally, the fine phased spectra can be obtained by minimizing the custom negative penalty function. This method is proven to be very robust for it is tolerant to low signal-to-noise ratio, large baseline distortion and independent of the starting search points of phasing parameters. The experimental results on both 1D metabonomics spectra with over-crowded peaks and 2D spectra demonstrate the high efficiency of this automatic method.
Revis, J; Galant, C; Fredouille, C; Ghio, A; Giovanni, A
2012-01-01
Widely studied in terms of perception, acoustics or aerodynamics, dysphonia stays nevertheless a speech phenomenon, closely related to the phonetic composition of the message conveyed by the voice. In this paper, we present a series of three works with the aim to understand the implications of the phonetic manifestation of dysphonia. Our first study proposes a new approach to the perceptual analysis of dysphonia (the phonetic labeling), which principle is to listen and evaluate each phoneme in a sentence separately. This study confirms the hypothesis of Laver that the dysphonia is not a constant noise added to the speech signal, but a discontinuous phenomenon, occurring on certain phonemes, based on the phonetic context. However, the burden of executing the task has led us to look to the techniques of automatic speaker recognition (ASR) to automate the procedure. With the collaboration of the LIA, we have developed a system for automatic classification of dysphonia from the techniques of ASR. This is the subject of our second study. The first results obtained with this system suggest that the unvoiced consonants show predominant performance in the task of automatic classification of dysphonia. This result is surprising since it is often assumed that dysphonia occurs only on laryngeal vibration. We started looking for explanations of this phenomenon and we present our assumptions and experiences in the third work we present.
NASA Technical Reports Server (NTRS)
1979-01-01
The pilot's perception and performance in flight simulators is examined. The areas investigated include: vestibular stimulation, flight management and man cockpit information interfacing, and visual perception in flight simulation. The effects of higher levels of rotary acceleration on response time to constant acceleration, tracking performance, and thresholds for angular acceleration are examined. Areas of flight management examined are cockpit display of traffic information, work load, synthetic speech call outs during the landing phase of flight, perceptual factors in the use of a microwave landing system, automatic speech recognition, automation of aircraft operation, and total simulation of flight training.
Kouider, Sid; Dupoux, Emmanuel
2005-08-01
We present a novel subliminal priming technique that operates in the auditory modality. Masking is achieved by hiding a spoken word within a stream of time-compressed speechlike sounds with similar spectral characteristics. Participants were unable to consciously identify the hidden words, yet reliable repetition priming was found. This effect was unaffected by a change in the speaker's voice and remained restricted to lexical processing. The results show that the speech modality, like the written modality, involves the automatic extraction of abstract word-form representations that do not include nonlinguistic details. In both cases, priming operates at the level of discrete and abstract lexical entries and is little influenced by overlap in form or semantics.
Robust automatic measurement of 3D scanned models for the human body fat estimation.
Giachetti, Andrea; Lovato, Christian; Piscitelli, Francesco; Milanese, Chiara; Zancanaro, Carlo
2015-03-01
In this paper, we present an automatic tool for estimating geometrical parameters from 3-D human scans independent on pose and robustly against the topological noise. It is based on an automatic segmentation of body parts exploiting curve skeleton processing and ad hoc heuristics able to remove problems due to different acquisition poses and body types. The software is able to locate body trunk and limbs, detect their directions, and compute parameters like volumes, areas, girths, and lengths. Experimental results demonstrate that measurements provided by our system on 3-D body scans of normal and overweight subjects acquired in different poses are highly correlated with the body fat estimates obtained on the same subjects with dual-energy X-rays absorptiometry (DXA) scanning. In particular, maximal lengths and girths, not requiring precise localization of anatomical landmarks, demonstrate a good correlation (up to 96%) with the body fat and trunk fat. Regression models based on our automatic measurements can be used to predict body fat values reasonably well.
The Roles of Suprasegmental Features in Predicting English Oral Proficiency with an Automated System
ERIC Educational Resources Information Center
Kang, Okim; Johnson, David
2018-01-01
Suprasegmental features have received growing attention in the field of oral assessment. In this article we describe a set of computer algorithms that automatically scores the oral proficiency of non-native speakers using unconstrained English speech. The algorithms employ machine learning and 11 suprasegmental measures divided into four groups…
Spoken Grammar Practice and Feedback in an ASR-Based CALL System
ERIC Educational Resources Information Center
de Vries, Bart Penning; Cucchiarini, Catia; Bodnar, Stephen; Strik, Helmer; van Hout, Roeland
2015-01-01
Speaking practice is important for learners of a second language. Computer assisted language learning (CALL) systems can provide attractive opportunities for speaking practice when combined with automatic speech recognition (ASR) technology. In this paper, we present a CALL system that offers spoken practice of word order, an important aspect of…
Effectiveness of Feedback for Enhancing English Pronunciation in an ASR-Based CALL System
ERIC Educational Resources Information Center
Wang, Y.-H.; Young, S. S.-C.
2015-01-01
This paper presents a study on implementing the ASR-based CALL (computer-assisted language learning based upon automatic speech recognition) system embedded with both formative and summative feedback approaches and using implicit and explicit strategies to enhance adult and young learners' English pronunciation. Two groups of learners including 18…
The Influence of Anticipation of Word Misrecognition on the Likelihood of Stuttering
ERIC Educational Resources Information Center
Brocklehurst, Paul H.; Lickley, Robin J.; Corley, Martin
2012-01-01
This study investigates whether the experience of stuttering can result from the speaker's anticipation of his words being misrecognized. Twelve adults who stutter (AWS) repeated single words into what appeared to be an automatic speech-recognition system. Following each iteration of each word, participants provided a self-rating of whether they…
Exploration of Acoustic Features for Automatic Vowel Discrimination in Spontaneous Speech
ERIC Educational Resources Information Center
Tyson, Na'im R.
2012-01-01
In an attempt to understand what acoustic/auditory feature sets motivated transcribers towards certain labeling decisions, I built machine learning models that were capable of discriminating between canonical and non-canonical vowels excised from the Buckeye Corpus. Specifically, I wanted to model when the dictionary form and the transcribed-form…
Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C
2015-12-01
The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types.
The influence of tone inventory on ERP without focal attention: a cross-language study.
Zheng, Hong-Ying; Peng, Gang; Chen, Jian-Yong; Zhang, Caicai; Minett, James W; Wang, William S-Y
2014-01-01
This study investigates the effect of tone inventories on brain activities underlying pitch without focal attention. We find that the electrophysiological responses to across-category stimuli are larger than those to within-category stimuli when the pitch contours are superimposed on nonspeech stimuli; however, there is no electrophysiological response difference associated with category status in speech stimuli. Moreover, this category effect in nonspeech stimuli is stronger for Cantonese speakers. Results of previous and present studies lead us to conclude that brain activities to the same native lexical tone contrasts are modulated by speakers' language experiences not only in active phonological processing but also in automatic feature detection without focal attention. In contrast to the condition with focal attention, where phonological processing is stronger for speech stimuli, the feature detection (pitch contours in this study) without focal attention as shaped by language background is superior in relatively regular stimuli, that is, the nonspeech stimuli. The results suggest that Cantonese listeners outperform Mandarin listeners in automatic detection of pitch features because of the denser Cantonese tone system.
Hisagi, Miwako; Shafer, Valerie L; Miyagawa, Shigeru; Kotek, Hadas; Sugawara, Ayaka; Pantazis, Dimitrios
2016-12-01
We examined discrimination of a second-language (L2) vowel duration contrast in English learners of Japanese (JP) with different amounts of experience using the magnetoencephalography mismatch field (MMF) component. Twelve L2 learners were tested before and after a second semester of college-level JP; half attended a regular rate course and half an accelerated course with more hours per week. Results showed no significant change in MMF for either the regular or accelerated learning group from beginning to end of the course. We also compared these groups against nine L2 learners who had completed four semesters of college-level JP. These 4-semester learners did not significantly differ from 2-semester learners, in that only a difference in hemisphere activation (interacting with time) between the two groups approached significance. These findings suggest that targeted training of L2 phonology may be necessary to allow for changes in processing of L2 speech contrasts at an early, automatic level. Copyright © 2016 Elsevier B.V. All rights reserved.
Zendel, Benjamin Rich; Tremblay, Charles-David; Belleville, Sylvie; Peretz, Isabelle
2015-05-01
Musicians have enhanced auditory processing abilities. In some studies, these abilities are paralleled by an improved understanding of speech in noisy environments, partially due to more robust encoding of speech signals in noise at the level of the brainstem. Little is known about the impact of musicianship on attention-dependent cortical activity related to lexical access during a speech-in-noise task. To address this issue, we presented musicians and nonmusicians with single words mixed with three levels of background noise, across two conditions, while monitoring electrical brain activity. In the active condition, listeners repeated the words aloud, and in the passive condition, they ignored the words and watched a silent film. When background noise was most intense, musicians repeated more words correctly compared with nonmusicians. Auditory evoked responses were attenuated and delayed with the addition of background noise. In musicians, P1 amplitude was marginally enhanced during active listening and was related to task performance in the most difficult listening condition. By comparing ERPs from the active and passive conditions, we isolated an N400 related to lexical access. The amplitude of the N400 was not influenced by the level of background noise in musicians, whereas N400 amplitude increased with the level of background noise in nonmusicians. In nonmusicians, the increase in N400 amplitude was related to a reduction in task performance. In musicians only, there was a rightward shift of the sources contributing to the N400 as the level of background noise increased. This pattern of results supports the hypothesis that encoding of speech in noise is more robust in musicians and suggests that this facilitates lexical access. Moreover, the shift in sources suggests that musicians, to a greater extent than nonmusicians, may increasingly rely on acoustic cues to understand speech in noise.
Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions
NASA Astrophysics Data System (ADS)
Wang, Longbiao; Minami, Kazue; Yamamoto, Kazumasa; Nakagawa, Seiichi
In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
Intra-oral pressure-based voicing control of electrolaryngeal speech with intra-oral vibrator.
Takahashi, Hirokazu; Nakao, Masayuki; Kikuchi, Yataro; Kaga, Kimitaka
2008-07-01
In normal speech, coordinated activities of intrinsic laryngeal muscles suspend a glottal sound at utterance of voiceless consonants, automatically realizing a voicing control. In electrolaryngeal speech, however, the lack of voicing control is one of the causes of unclear voice, voiceless consonants tending to be misheard as the corresponding voiced consonants. In the present work, we developed an intra-oral vibrator with an intra-oral pressure sensor that detected utterance of voiceless phonemes during the intra-oral electrolaryngeal speech, and demonstrated that an intra-oral pressure-based voicing control could improve the intelligibility of the speech. The test voices were obtained from one electrolaryngeal speaker and one normal speaker. We first investigated on the speech analysis software how a voice onset time (VOT) and first formant (F1) transition of the test consonant-vowel syllables contributed to voiceless/voiced contrasts, and developed an adequate voicing control strategy. We then compared the intelligibility of consonant-vowel syllables among the intra-oral electrolaryngeal speech with and without online voicing control. The increase of intra-oral pressure, typically with a peak ranging from 10 to 50 gf/cm2, could reliably identify utterance of voiceless consonants. The speech analysis and intelligibility test then demonstrated that a short VOT caused the misidentification of the voiced consonants due to a clear F1 transition. Finally, taking these results together, the online voicing control, which suspended the prosthetic tone while the intra-oral pressure exceeded 2.5 gf/cm2 and during the 35 milliseconds that followed, proved efficient to improve the voiceless/voiced contrast.
Carlson, Matthew T
2018-04-01
Language-specific restrictions on sound sequences in words can lead to automatic perceptual repair of illicit sound sequences. As an example, no Spanish words begin with /s/-consonant sequences ([#sC]), and where necessary (e.g., foreign loanwords) [#sC] is repaired by inserting an initial [e], (e.g. foreign loanwords, cf., esnob, from English snob). As a result, Spanish speakers tend to perceive an illusory [e] before [#sC] sequences. Interestingly, this perceptual illusion is weaker in early Spanish-English bilinguals, whose other language, English, allows [#sC]. The present study explored whether this apparent influence of the English language on Spanish is restricted to early bilinguals, whose early language experience includes a mixture of both languages, or whether later learning of second language (L2) English can also induce a weakening of the first language (L1) perceptual illusion. Two groups of late Spanish-English bilinguals, immersed in Spanish or English, were tested on the same Spanish AX (same-different) discrimination task used in a study by Carlson et al., (2016) and their results compared with the Spanish monolinguals from Carlson et al.'s study. Like early bilinguals, late bilinguals exhibited a reduced impact of perceptual prothesis on discrimination accuracy. Additionally, late bilinguals, particularly in English immersion, were slowest when responding against the Spanish perceptual illusion. Robust L1 perceptual illusions thus appear to be malleable in the face of later L2 learning. It is argued that these results are consonant with the need for late bilinguals to navigate alternative, conflicting representations of the same acoustic material, even in unilingual L1 speech perception tasks.
Sensory Intelligence for Extraction of an Abstract Auditory Rule: A Cross-Linguistic Study.
Guo, Xiao-Tao; Wang, Xiao-Dong; Liang, Xiu-Yuan; Wang, Ming; Chen, Lin
2018-02-21
In a complex linguistic environment, while speech sounds can greatly vary, some shared features are often invariant. These invariant features constitute so-called abstract auditory rules. Our previous study has shown that with auditory sensory intelligence, the human brain can automatically extract the abstract auditory rules in the speech sound stream, presumably serving as the neural basis for speech comprehension. However, whether the sensory intelligence for extraction of abstract auditory rules in speech is inherent or experience-dependent remains unclear. To address this issue, we constructed a complex speech sound stream using auditory materials in Mandarin Chinese, in which syllables had a flat lexical tone but differed in other acoustic features to form an abstract auditory rule. This rule was occasionally and randomly violated by the syllables with the rising, dipping or falling tone. We found that both Chinese and foreign speakers detected the violations of the abstract auditory rule in the speech sound stream at a pre-attentive stage, as revealed by the whole-head recordings of mismatch negativity (MMN) in a passive paradigm. However, MMNs peaked earlier in Chinese speakers than in foreign speakers. Furthermore, Chinese speakers showed different MMN peak latencies for the three deviant types, which paralleled recognition points. These findings indicate that the sensory intelligence for extraction of abstract auditory rules in speech sounds is innate but shaped by language experience. Copyright © 2018 IBRO. Published by Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.
2016-10-01
Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.
Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker
2014-11-23
The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.
NASA Astrophysics Data System (ADS)
Wang, Hongcui; Kawahara, Tatsuya
CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.
Malavasi, Massimiliano; Turri, Enrico; Atria, Jose Joaquin; Christensen, Heidi; Marxer, Ricard; Desideri, Lorenzo; Coy, Andre; Tamburini, Fabio; Green, Phil
2017-01-01
A better use of the increasing functional capabilities of home automation systems and Internet of Things (IoT) devices to support the needs of users with disability, is the subject of a research project currently conducted by Area Ausili (Assistive Technology Area), a department of Polo Tecnologico Regionale Corte Roncati of the Local Health Trust of Bologna (Italy), in collaboration with AIAS Ausilioteca Assistive Technology (AT) Team. The main aim of the project is to develop experimental low cost systems for environmental control through simplified and accessible user interfaces. Many of the activities are focused on automatic speech recognition and are developed in the framework of the CloudCAST project. In this paper we report on the first technical achievements of the project and discuss future possible developments and applications within and outside CloudCAST.
Automatic mechanisms for measuring subjective unit of discomfort.
Hartanto, D W I; Kang, Ni; Brinkman, Willem-Paul; Kampmann, Isabel L; Morina, Nexhmedin; Emmelkamp, Paul G M; Neerincx, Mark A
2012-01-01
Current practice in Virtual Reality Exposure Therapy (VRET) is that therapists ask patients about their anxiety level by means of the Subjective Unit of Discomfort (SUD) scale. With an aim of developing a home-based VRET system, this measurement ideally should be done using speech technology. In a VRET system for social phobia with scripted avatar-patient dialogues, the timing of asking patients to give their SUD score becomes relevant. This study examined three timing mechanisms: (1) dialogue dependent (i.e. naturally in the flow of the dialogue); (2) speech dependent (i.e. when both patient and avatar are silent); and (3) context independent (i.e. randomly). Results of an experiment with non-patients (n = 24) showed a significant effect for the timing mechanisms on the perceived dialogue flow, user preference, reported presence and user dialog replies. Overall, dialogue dependent timing mechanism seems superior followed by the speech dependent and context independent timing mechanism.
Analysis of high-speed digital phonoscopy pediatric images
NASA Astrophysics Data System (ADS)
Unnikrishnan, Harikrishnan; Donohue, Kevin D.; Patel, Rita R.
2012-02-01
The quantitative characterization of vocal fold (VF) motion can greatly enhance the diagnosis and treatment of speech pathologies. The recent availability of high-speed systems has created new opportunities to understand VF dynamics. This paper presents quantitative methods for analyzing VF dynamics with high-speed digital phonoscopy, with a focus on expected VF changes during childhood. A robust method for automatic VF edge tracking during phonation is introduced and evaluated against 4 expert human observers. Results from 100 test frames show a subpixel difference between the VF edges selected by algorithm and expert observers. Waveforms created from the VF edge displacement are used to created motion features with limited sensitivity to variations of camera resolution on the imaging plane. New features are introduced based on acceleration ratios of critical points over each phonation cycle, which have the potential for studying issues related to impact stress. A novel denoising and hybrid interpolation/extrapolation scheme is also introduced to reduce the impact of quantization errors and large sampling intervals relative to the phonation cycle. Features extracted from groups of 4 adults and 5 children show large differences for features related to asymmetry between the right and left fold and consistent differences for impact acceleration ratio.
Intelligibility in Context Scale: Normative and Validation Data for English-Speaking Preschoolers.
McLeod, Sharynne; Crowe, Kathryn; Shahaeian, Ameneh
2015-07-01
The purpose of this study was to describe normative and validation data on the Intelligibility in Context Scale (ICS; McLeod, Harrison, & McCormack, 2012c) for English-speaking children. The ICS is a 7-item, parent-report measure of children's speech intelligibility with a range of communicative partners. Data were collected from the parents of 803 Australian English-speaking children ranging in age from 4;0 (years;months) to 5;5 (37.0% were multilingual). The mean ICS score was 4.4 (SD = 0.7) out of a possible total score of 5. Children's speech was reported to be most intelligible to their parents, followed by their immediate family, friends, and teachers; children's speech was least intelligible to strangers. The ICS had high internal consistency (α = .94). Significant differences in scores were identified on the basis of sex and age but not on the basis of socioeconomic status or the number of languages spoken. There were significant differences in scores between children whose parents had concerns about their child's speech (M = 3.9) and those who did not (M = 4.6). A sensitivity of .82 and a specificity of .58 were established as the optimal cutoff. Test-retest reliability and criterion validity were established for 184 children with a speech sound disorder. There was a significant low correlation between the ICS mean score and percentage of phonemes correct (r = .30), percentage of consonants correct (r = .24), and percentage of vowels correct (r = .30) on the Diagnostic Evaluation of Articulation and Phonology (Dodd, Hua, Crosbie, Holm, & Ozanne, 2002). Thirty-one parents completed the ICS related to English and another language spoken by their child with a speech sound disorder. The significant correlations between the scores suggest that the ICS may be robust between languages. This article provides normative ICS data for English-speaking children and additional validation of the psychometric properties of the ICS. The robustness of the ICS was suggested because mean ICS scores were not affected by socioeconomic status, number of languages spoken, or whether the ICS was completed in relation to English or another language. The ICS is recommended as a screening measure of children's speech intelligibility.
Evaluation of an Expert System for the Generation of Speech and Language Therapy Plans.
Robles-Bykbaev, Vladimir; López-Nores, Martín; García-Duque, Jorge; Pazos-Arias, José J; Arévalo-Lucero, Daysi
2016-07-01
Speech and language pathologists (SLPs) deal with a wide spectrum of disorders, arising from many different conditions, that affect voice, speech, language, and swallowing capabilities in different ways. Therefore, the outcomes of Speech and Language Therapy (SLT) are highly dependent on the accurate, consistent, and complete design of personalized therapy plans. However, SLPs often have very limited time to work with their patients and to browse the large (and growing) catalogue of activities and specific exercises that can be put into therapy plans. As a consequence, many plans are suboptimal and fail to address the specific needs of each patient. We aimed to evaluate an expert system that automatically generates plans for speech and language therapy, containing semiannual activities in the five areas of hearing, oral structure and function, linguistic formulation, expressive language and articulation, and receptive language. The goal was to assess whether the expert system speeds up the SLPs' work and leads to more accurate, consistent, and complete therapy plans for their patients. We examined the evaluation results of the SPELTA expert system in supporting the decision making of 4 SLPs treating children in three special education institutions in Ecuador. The expert system was first trained with data from 117 cases, including medical data; diagnosis for voice, speech, language and swallowing capabilities; and therapy plans created manually by the SLPs. It was then used to automatically generate new therapy plans for 13 new patients. The SLPs were finally asked to evaluate the accuracy, consistency, and completeness of those plans. A four-fold cross-validation experiment was also run on the original corpus of 117 cases in order to assess the significance of the results. The evaluation showed that 87% of the outputs provided by the SPELTA expert system were considered valid therapy plans for the different areas. The SLPs rated the overall accuracy, consistency, and completeness of the proposed activities with 4.65, 4.6, and 4.6 points (to a maximum of 5), respectively. The ratings for the subplans generated for the areas of hearing, oral structure and function, and linguistic formulation were nearly perfect, whereas the subplans for expressive language and articulation and for receptive language failed to deal properly with some of the subject cases. Overall, the SLPs indicated that over 90% of the subplans generated automatically were "better than" or "as good as" what the SLPs would have created manually if given the average time they can devote to the task. The cross-validation experiment yielded very similar results. The results show that the SPELTA expert system provides valuable input for SLPs to design proper therapy plans for their patients, in a shorter time and considering a larger set of activities than proceeding manually. The algorithms worked well even in the presence of a sparse corpus, and the evidence suggests that the system will become more reliable as it is trained with more subjects.
Evaluation of an Expert System for the Generation of Speech and Language Therapy Plans
López-Nores, Martín; García-Duque, Jorge; Pazos-Arias, José J; Arévalo-Lucero, Daysi
2016-01-01
Background Speech and language pathologists (SLPs) deal with a wide spectrum of disorders, arising from many different conditions, that affect voice, speech, language, and swallowing capabilities in different ways. Therefore, the outcomes of Speech and Language Therapy (SLT) are highly dependent on the accurate, consistent, and complete design of personalized therapy plans. However, SLPs often have very limited time to work with their patients and to browse the large (and growing) catalogue of activities and specific exercises that can be put into therapy plans. As a consequence, many plans are suboptimal and fail to address the specific needs of each patient. Objective We aimed to evaluate an expert system that automatically generates plans for speech and language therapy, containing semiannual activities in the five areas of hearing, oral structure and function, linguistic formulation, expressive language and articulation, and receptive language. The goal was to assess whether the expert system speeds up the SLPs’ work and leads to more accurate, consistent, and complete therapy plans for their patients. Methods We examined the evaluation results of the SPELTA expert system in supporting the decision making of 4 SLPs treating children in three special education institutions in Ecuador. The expert system was first trained with data from 117 cases, including medical data; diagnosis for voice, speech, language and swallowing capabilities; and therapy plans created manually by the SLPs. It was then used to automatically generate new therapy plans for 13 new patients. The SLPs were finally asked to evaluate the accuracy, consistency, and completeness of those plans. A four-fold cross-validation experiment was also run on the original corpus of 117 cases in order to assess the significance of the results. Results The evaluation showed that 87% of the outputs provided by the SPELTA expert system were considered valid therapy plans for the different areas. The SLPs rated the overall accuracy, consistency, and completeness of the proposed activities with 4.65, 4.6, and 4.6 points (to a maximum of 5), respectively. The ratings for the subplans generated for the areas of hearing, oral structure and function, and linguistic formulation were nearly perfect, whereas the subplans for expressive language and articulation and for receptive language failed to deal properly with some of the subject cases. Overall, the SLPs indicated that over 90% of the subplans generated automatically were “better than” or “as good as” what the SLPs would have created manually if given the average time they can devote to the task. The cross-validation experiment yielded very similar results. Conclusions The results show that the SPELTA expert system provides valuable input for SLPs to design proper therapy plans for their patients, in a shorter time and considering a larger set of activities than proceeding manually. The algorithms worked well even in the presence of a sparse corpus, and the evidence suggests that the system will become more reliable as it is trained with more subjects. PMID:27370070
Wireless sensor network effectively controls center pivot irrigation of sorghum
USDA-ARS?s Scientific Manuscript database
Robust automatic irrigation scheduling has been demonstrated using wired sensors and sensor network systems with subsurface drip and moving irrigation systems. However, there are limited studies that report on crop yield and water use efficiency resulting from the use of wireless networks to automat...
Partanen, Eino; Leminen, Alina; de Paoli, Stine; Bundgaard, Anette; Kingo, Osman Skjold; Krøjgaard, Peter; Shtyrov, Yury
2017-07-15
Children learn new words and word forms with ease, often acquiring a new word after very few repetitions. Recent neurophysiological research on word form acquisition in adults indicates that novel words can be acquired within minutes of repetitive exposure to them, regardless of the individual's focused attention on the speech input. Although it is well-known that children surpass adults in language acquisition, the developmental aspects of such rapid and automatic neural acquisition mechanisms remain unexplored. To address this open question, we used magnetoencephalography (MEG) to scrutinise brain dynamics elicited by spoken words and word-like sounds in healthy monolingual (Danish) children throughout a 20-min repetitive passive exposure session. We found rapid neural dynamics manifested as an enhancement of early (~100ms) brain activity over the short exposure session, with distinct spatiotemporal patterns for different novel sounds. For novel Danish word forms, signs of such enhancement were seen in the left temporal regions only, suggesting reliance on pre-existing language circuits for acquisition of novel word forms with native phonology. In contrast, exposure both to novel word forms with non-native phonology and to novel non-speech sounds led to activity enhancement in both left and right hemispheres, suggesting that more wide-spread cortical networks contribute to the build-up of memory traces for non-native and non-speech sounds. Similar studies in adults have previously reported more sluggish (~15-25min, as opposed to 4min in the present study) or non-existent neural dynamics for non-native sound acquisition, which might be indicative of a higher degree of plasticity in the children's brain. Overall, the results indicate a rapid and highly plastic mechanism for a dynamic build-up of memory traces for novel acoustic information in the children's brain that operates automatically and recruits bilateral temporal cortical circuits. Copyright © 2017 Elsevier Inc. All rights reserved.
Tavano, Alessandro; Pesarin, Anna; Murino, Vittorio; Cristani, Marco
2014-01-01
Individuals with Asperger syndrome/High Functioning Autism fail to spontaneously attribute mental states to the self and others, a life-long phenotypic characteristic known as mindblindness. We hypothesized that mindblindness would affect the dynamics of conversational interaction. Using generative models, in particular Gaussian mixture models and observed influence models, conversations were coded as interacting Markov processes, operating on novel speech/silence patterns, termed Steady Conversational Periods (SCPs). SCPs assume that whenever an agent's process changes state (e.g., from silence to speech), it causes a general transition of the entire conversational process, forcing inter-actant synchronization. SCPs fed into observed influence models, which captured the conversational dynamics of children and adolescents with Asperger syndrome/High Functioning Autism, and age-matched typically developing participants. Analyzing the parameters of the models by means of discriminative classifiers, the dialogs of patients were successfully distinguished from those of control participants. We conclude that meaning-free speech/silence sequences, reflecting inter-actant synchronization, at least partially encode typical and atypical conversational dynamics. This suggests a direct influence of theory of mind abilities onto basic speech initiative behavior.
Automated Depression Analysis Using Convolutional Neural Networks from Speech.
He, Lang; Cao, Cui
2018-05-28
To help clinicians to efficiently diagnose the severity of a person's depression, the affective computing community and the artificial intelligence field have shown a growing interest in designing automated systems. The speech features have useful information for the diagnosis of depression. However, manually designing and domain knowledge are still important for the selection of the feature, which makes the process labor consuming and subjective. In recent years, deep-learned features based on neural networks have shown superior performance to hand-crafted features in various areas. In this paper, to overcome the difficulties mentioned above, we propose a combination of hand-crafted and deep-learned features which can effectively measure the severity of depression from speech. In the proposed method, Deep Convolutional Neural Networks (DCNN) are firstly built to learn deep-learned features from spectrograms and raw speech waveforms. Then we manually extract the state-of-the-art texture descriptors named median robust extended local binary patterns (MRELBP) from spectrograms. To capture the complementary information within the hand-crafted features and deep-learned features, we propose joint fine-tuning layers to combine the raw and spectrogram DCNN to boost the depression recognition performance. Moreover, to address the problems with small samples, a data augmentation method was proposed. Experiments conducted on AVEC2013 and AVEC2014 depression databases show that our approach is robust and effective for the diagnosis of depression when compared to state-of-the-art audio-based methods. Copyright © 2018. Published by Elsevier Inc.
Envelope responses in single-trial EEG indicate attended speaker in a “cocktail party”
Horton, Cort; Srinivasan, Ramesh; D’Zmura, Michael
2014-01-01
Objective Recent studies have shown that auditory cortex better encodes the envelope of attended speech than that of unattended speech during multi-speaker (“cocktail party”) situations. We investigated whether these differences were sufficiently robust within single-trial EEG data to accurately determine where subjects attended. Additionally, we compared this measure to other established EEG markers of attention. Approach High-resolution EEG was recorded while subjects engaged in a two-speaker “cocktail party” task. Cortical responses to speech envelopes were extracted by cross-correlating the envelopes with each EEG channel. We also measured steady-state responses (elicited via high-frequency amplitude modulation of the speech) and alpha-band power, both of which have been sensitive to attention in previous studies. Using linear classifiers, we then examined how well each of these features could be used to predict the subjects’ side of attention at various epoch lengths. Main Results We found that the attended speaker could be determined reliably from the envelope responses calculated from short periods of EEG, with accuracy improving as a function of sample length. Furthermore, envelope responses were far better indicators of attention than changes in either alpha power or steady-state responses. Significance These results suggest that envelope-related signals recorded in EEG data can be used to form robust auditory BCI’s that do not require artificial manipulation (e.g., amplitude modulation) of stimuli to function. PMID:24963838
Scarbel, Lucie; Beautemps, Denis; Schwartz, Jean-Luc; Sato, Marc
2017-07-01
Speech communication can be viewed as an interactive process involving a functional coupling between sensory and motor systems. One striking example comes from phonetic convergence, when speakers automatically tend to mimic their interlocutor's speech during communicative interaction. The goal of this study was to investigate sensory-motor linkage in speech production in postlingually deaf cochlear implanted participants and normal hearing elderly adults through phonetic convergence and imitation. To this aim, two vowel production tasks, with or without instruction to imitate an acoustic vowel, were proposed to three groups of young adults with normal hearing, elderly adults with normal hearing and post-lingually deaf cochlear-implanted patients. Measure of the deviation of each participant's f 0 from their own mean f 0 was measured to evaluate the ability to converge to each acoustic target. showed that cochlear-implanted participants have the ability to converge to an acoustic target, both intentionally and unintentionally, albeit with a lower degree than young and elderly participants with normal hearing. By providing evidence for phonetic convergence and speech imitation, these results suggest that, as in young adults, perceptuo-motor relationships are efficient in elderly adults with normal hearing and that cochlear-implanted adults recovered significant perceptuo-motor abilities following cochlear implantation. Copyright © 2017 Elsevier Ltd. All rights reserved.
Noise Suppression Methods for Robust Speech Processing
1981-04-01
1]. Techniques available for voice processor modification to account for noise contamination are being developed [4]. Preprocessor noise reduction...analysis window function. Principles governing discrete implementation of the transform pair are discussed, and relationships are formalized which specify
Cost-sensitive learning for emotion robust speaker recognition.
Li, Dongdong; Yang, Yingchun; Dai, Weihui
2014-01-01
In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved.
Cost-Sensitive Learning for Emotion Robust Speaker Recognition
Li, Dongdong; Yang, Yingchun
2014-01-01
In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved. PMID:24999492
A real-time freehand ultrasound calibration system with automatic accuracy feedback and control.
Chen, Thomas Kuiran; Thurston, Adrian D; Ellis, Randy E; Abolmaesumi, Purang
2009-01-01
This article describes a fully automatic, real-time, freehand ultrasound calibration system. The system was designed to be simple and sterilizable, intended for operating-room usage. The calibration system employed an automatic-error-retrieval and accuracy-control mechanism based on a set of ground-truth data. Extensive validations were conducted on a data set of 10,000 images in 50 independent calibration trials to thoroughly investigate the accuracy, robustness, and performance of the calibration system. On average, the calibration accuracy (measured in three-dimensional reconstruction error against a known ground truth) of all 50 trials was 0.66 mm. In addition, the calibration errors converged to submillimeter in 98% of all trials within 12.5 s on average. Overall, the calibration system was able to consistently, efficiently and robustly achieve high calibration accuracy with real-time performance.
Long Term Suboxone™ Emotional Reactivity As Measured by Automatic Detection in Speech
Hill, Edward; Han, David; Dumouchel, Pierre; Dehak, Najim; Quatieri, Thomas; Moehs, Charles; Oscar-Berman, Marlene; Giordano, John; Simpatico, Thomas; Blum, Kenneth
2013-01-01
Addictions to illicit drugs are among the nation’s most critical public health and societal problems. The current opioid prescription epidemic and the need for buprenorphine/naloxone (Suboxone®; SUBX) as an opioid maintenance substance, and its growing street diversion provided impetus to determine affective states (“true ground emotionality”) in long-term SUBX patients. Toward the goal of effective monitoring, we utilized emotion-detection in speech as a measure of “true” emotionality in 36 SUBX patients compared to 44 individuals from the general population (GP) and 33 members of Alcoholics Anonymous (AA). Other less objective studies have investigated emotional reactivity of heroin, methadone and opioid abstinent patients. These studies indicate that current opioid users have abnormal emotional experience, characterized by heightened response to unpleasant stimuli and blunted response to pleasant stimuli. However, this is the first study to our knowledge to evaluate “true ground” emotionality in long-term buprenorphine/naloxone combination (Suboxone™). We found in long-term SUBX patients a significantly flat affect (p<0.01), and they had less self-awareness of being happy, sad, and anxious compared to both the GP and AA groups. We caution definitive interpretation of these seemingly important results until we compare the emotional reactivity of an opioid abstinent control using automatic detection in speech. These findings encourage continued research strategies in SUBX patients to target the specific brain regions responsible for relapse prevention of opioid addiction. PMID:23874860
Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902
Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.
Speech privacy performance of a new hospital and medical office building
NASA Astrophysics Data System (ADS)
Roy, Kenneth P.; Good, Kenneth W.; Snader, Anita M.; Hatzel, Sharon K.
2005-09-01
Shortly after the occupation of a new hospital and medical office building, both objective and subjective evaluations of the acoustic performance of these facilities were made. The goals of this work were twofold: first, to survey the occupants' subjective perception of the acoustic environment relative to noise, distractions, speech privacy, etc; and second, to relate the subjective perception to objective measures of noise isolation rating (NIC), background noise (dBA), and speech privacy rating (PI). Knowing the construction details of the walls, ceiling, doors, etc. also allowed a comparison of the measured NIC to the expected STC for each type of construction. In this way it was possible to identify robust architectural systems versus weak systems with inherent flanking and leakage paths.
A variable rate speech compressor for mobile applications
NASA Technical Reports Server (NTRS)
Yeldener, S.; Kondoz, A. M.; Evans, B. G.
1990-01-01
One of the most promising speech coder at the bit rate of 9.6 to 4.8 kbits/s is CELP. Code Excited Linear Prediction (CELP) has been dominating 9.6 to 4.8 kbits/s region during the past 3 to 4 years. Its set back however, is its expensive implementation. As an alternative to CELP, the Base-Band CELP (CELP-BB) was developed which produced good quality speech comparable to CELP and a single chip implementable complexity as reported previously. Its robustness was also improved to tolerate errors up to 1.0 pct. and maintain intelligibility up to 5.0 pct. and more. Although, CELP-BB produces good quality speech at around 4.8 kbits/s, it has a fundamental problem when updating the pitch filter memory. A sub-optimal solution is proposed for this problem. Below 4.8 kbits/s, however, CELP-BB suffers from noticeable quantization noise as a result of the large vector dimensions used. Efficient representation of speech below 4.8 kbits/s is reported by introducing Sinusoidal Transform Coding (STC) to represent the LPC excitation which is called Sine Wave Excited LPC (SWELP). In this case, natural sounding good quality synthetic speech is obtained at around 2.4 kbits/s.
Dimension-Based Statistical Learning Affects Both Speech Perception and Production.
Lehet, Matthew; Holt, Lori L
2017-04-01
Multiple acoustic dimensions signal speech categories. However, dimensions vary in their informativeness; some are more diagnostic of category membership than others. Speech categorization reflects these dimensional regularities such that diagnostic dimensions carry more "perceptual weight" and more effectively signal category membership to native listeners. Yet perceptual weights are malleable. When short-term experience deviates from long-term language norms, such as in a foreign accent, the perceptual weight of acoustic dimensions in signaling speech category membership rapidly adjusts. The present study investigated whether rapid adjustments in listeners' perceptual weights in response to speech that deviates from the norms also affects listeners' own speech productions. In a word recognition task, the correlation between two acoustic dimensions signaling consonant categories, fundamental frequency (F0) and voice onset time (VOT), matched the correlation typical of English, and then shifted to an "artificial accent" that reversed the relationship, and then shifted back. Brief, incidental exposure to the artificial accent caused participants to down-weight perceptual reliance on F0, consistent with previous research. Throughout the task, participants were intermittently prompted with pictures to produce these same words. In the block in which listeners heard the artificial accent with a reversed F0 × VOT correlation, F0 was a less robust cue to voicing in listeners' own speech productions. The statistical regularities of short-term speech input affect both speech perception and production, as evidenced via shifts in how acoustic dimensions are weighted. Copyright © 2016 Cognitive Science Society, Inc.
Musicians change their tune: how hearing loss alters the neural code.
Parbery-Clark, Alexandra; Anderson, Samira; Kraus, Nina
2013-08-01
Individuals with sensorineural hearing loss have difficulty understanding speech, especially in background noise. This deficit remains even when audibility is restored through amplification, suggesting that mechanisms beyond a reduction in peripheral sensitivity contribute to the perceptual difficulties associated with hearing loss. Given that normal-hearing musicians have enhanced auditory perceptual skills, including speech-in-noise perception, coupled with heightened subcortical responses to speech, we aimed to determine whether similar advantages could be observed in middle-aged adults with hearing loss. Results indicate that musicians with hearing loss, despite self-perceptions of average performance for understanding speech in noise, have a greater ability to hear in noise relative to nonmusicians. This is accompanied by more robust subcortical encoding of sound (e.g., stimulus-to-response correlations and response consistency) as well as more resilient neural responses to speech in the presence of background noise (e.g., neural timing). Musicians with hearing loss also demonstrate unique neural signatures of spectral encoding relative to nonmusicians: enhanced neural encoding of the speech-sound's fundamental frequency but not of its upper harmonics. This stands in contrast to previous outcomes in normal-hearing musicians, who have enhanced encoding of the harmonics but not the fundamental frequency. Taken together, our data suggest that although hearing loss modifies a musician's spectral encoding of speech, the musician advantage for perceiving speech in noise persists in a hearing-impaired population by adaptively strengthening underlying neural mechanisms for speech-in-noise perception. Copyright © 2013 Elsevier B.V. All rights reserved.
Dimension-based statistical learning affects both speech perception and production
Lehet, Matthew; Holt, Lori L.
2016-01-01
Multiple acoustic dimensions signal speech categories. However, dimensions vary in their informativeness; some are more diagnostic of category membership than others. Speech categorization reflects these dimensional regularities such that diagnostic dimensions carry more “perceptual weight” and more effectively signal category membership to native listeners. Yet, perceptual weights are malleable. When short-term experience deviates from long-term language norms, such as in a foreign accent, the perceptual weight of acoustic dimensions in signaling speech category membership rapidly adjusts. The present study investigated whether rapid adjustments in listeners’ perceptual weights in response to speech that deviates from the norms also affects listeners’ own speech productions. In a word recognition task, the correlation between two acoustic dimensions signaling consonant categories, fundamental frequency (F0) and voice onset time (VOT), matched the correlation typical of English, then shifted to an “artificial accent” that reversed the relationship, and then shifted back. Brief, incidental exposure to the artificial accent caused participants to down-weight perceptual reliance on F0, consistent with previous research. Throughout the task, participants were intermittently prompted with pictures to produce these same words. In the block in which listeners heard the artificial accent with a reversed F0 x VOT correlation, F0 was a less robust cue to voicing in listeners’ own speech productions. The statistical regularities of short-term speech input affect both speech perception and production, as evidenced via shifts in how acoustic dimensions are weighted. PMID:27666146
Performance of a wireless sensor network for crop monitoring and irrigation control
USDA-ARS?s Scientific Manuscript database
Robust automatic irrigation scheduling has been demonstrated using wired sensors and sensor network systems with subsurface drip and moving irrigation systems. However, there are limited studies that report on crop yield and water use efficiency resulting from the use of wireless networks to automat...
Başkent, Deniz; Eiler, Cheryl L; Edwards, Brent
2007-06-01
To present a comprehensive analysis of the feasibility of genetic algorithms (GA) for finding the best fit of hearing aids or cochlear implants for individual users in clinical or research settings, where the algorithm is solely driven by subjective human input. Due to varying pathology, the best settings of an auditory device differ for each user. It is also likely that listening preferences vary at the same time. The settings of a device customized for a particular user can only be evaluated by the user. When optimization algorithms are used for fitting purposes, this situation poses a difficulty for a systematic and quantitative evaluation of the suitability of the fitting parameters produced by the algorithm. In the present study, an artificial listening environment was generated by distorting speech using a noiseband vocoder. The settings produced by the GA for this listening problem could objectively be evaluated by measuring speech recognition and comparing the performance to the best vocoder condition where speech was least distorted. Nine normal-hearing subjects participated in the study. The parameters to be optimized were the number of vocoder channels, the shift between the input frequency range and the synthesis frequency range, and the compression-expansion of the input frequency range over the synthesis frequency range. The subjects listened to pairs of sentences processed with the vocoder, and entered a preference for the sentence with better intelligibility. The GA modified the solutions iteratively according to the subject preferences. The program converged when the user ranked the same set of parameters as the best in three consecutive steps. The results produced by the GA were analyzed for quality by measuring speech intelligibility, for test-retest reliability by running the GA three times with each subject, and for convergence properties. Speech recognition scores averaged across subjects were similar for the best vocoder solution and for the solutions produced by the GA. The average number of iterations was 8 and the average convergence time was 25.5 minutes. The settings produced by different GA runs for the same subject were slightly different; however, speech recognition scores measured with these settings were similar. Individual data from subjects showed that in each run, a small number of GA solutions produced poorer speech intelligibility than for the best setting. This was probably a result of the combination of the inherent randomness of the GA, the convergence criterion used in the present study, and possible errors that the users might have made during the paired comparisons. On the other hand, the effect of these errors was probably small compared to the other two factors, as a comparison between subjective preferences and objective measures showed that for many subjects the two were in good agreement. The results showed that the GA was able to produce good solutions by using listener preferences in a relatively short time. For practical applications, the program can be made more robust by running the GA twice or by not using an automatic stopping criterion, and it can be made faster by optimizing the number of the paired comparisons completed in each iteration.
Robust audio-visual speech recognition under noisy audio-video conditions.
Stewart, Darryl; Seymour, Rowan; Pass, Adrian; Ming, Ji
2014-02-01
This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.
NASA Astrophysics Data System (ADS)
Chupina, K. V.; Kataev, E. V.; Khannanov, A. M.; Korshunov, V. N.; Sennikov, I. A.
2018-05-01
The paper is devoted to a problem of synthesis of the robust control system for a distributed parameters plant. The vessel descent-rise device has a heave compensation function for stabilization of the towed underwater vehicle on a set depth. A sea state code, parameters of the underwater vehicle and cable vary during underwater operations, the vessel heave is a stochastic process. It means that the plant and external disturbances have uncertainty. That is why it is necessary to use the robust theory for synthesis of an automatic control system, but without use of traditional methods of optimization, because this cable has distributed parameters. The offered technique has allowed one to design an effective control system for stabilization of immersion depth of the towed underwater vehicle for various degrees of sea roughness and to provide its robustness to deviations of parameters of the vehicle and cable’s length.
ERIC Educational Resources Information Center
Saetrevik, Bjorn; Specht, Karsten
2012-01-01
It has previously been shown that task performance and frontal cortical activation increase after cognitive conflict. This has been argued to support a model of attention where the level of conflict automatically adjusts the amount of cognitive control applied. Conceivably, conflict could also modulate lower-level processing pathways, which would…
Developing Swedish Spelling Exercises on the ICALL Platform Lärka
ERIC Educational Resources Information Center
Pijetlovic, Dijana; Volodina, Elena
2013-01-01
In this project we developed web services on the ICALL platform Lärka for automatic generation of Swedish spelling exercises using Text-To-Speech (TTS) technology which allows L2 learners to train their spelling and listening individually at home. The spelling exercises contain five different linguistic levels, whereby the language learner has the…
The Game Embedded CALL System to Facilitate English Vocabulary Acquisition and Pronunciation
ERIC Educational Resources Information Center
Young, Shelley Shwu-Ching; Wang, Yi-Hsuan
2014-01-01
The aim of this study is to make a new attempt to explore the potential of integrating game strategies with automatic speech recognition technologies to provide learners with individual opportunities for English pronunciation learning. The study developed the Game Embedded CALL (GeCALL) system with two activities for on-line speaking practice. For…
Frequency of Use Leads to Automaticity of Production: Evidence from Repair in Conversation
ERIC Educational Resources Information Center
Kapatsinski, Vsevolod
2010-01-01
In spontaneous speech, speakers sometimes replace a word they have just produced or started producing by another word. The present study reports that in these replacement repairs, low-frequency replaced words are more likely to be interrupted prior to completion than high-frequency words, providing support to the hypothesis that the production of…
Arabic Language Modeling with Stem-Derived Morphemes for Automatic Speech Recognition
ERIC Educational Resources Information Center
Heintz, Ilana
2010-01-01
The goal of this dissertation is to introduce a method for deriving morphemes from Arabic words using stem patterns, a feature of Arabic morphology. The motivations are three-fold: modeling with morphemes rather than words should help address the out-of-vocabulary problem; working with stem patterns should prove to be a cross-dialectally valid…
ERIC Educational Resources Information Center
Callahan, Sarah M.; Walenski, Matthew; Love, Tracy
2012-01-01
Purpose: To examine children's comprehension of verb phrase (VP) ellipsis constructions in light of their automatic, online structural processing abilities and conscious, metalinguistic reflective skill. Method: Forty-two children ages 5 through 12 years listened to VP ellipsis constructions involving the strict/sloppy ambiguity (e.g., "The…
ERIC Educational Resources Information Center
Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya
2017-01-01
This study proposes a method to detect problematic speech segments automatically for second language (L2) listeners, considering both lexical and acoustic aspects. It introduces a tool, Partial and Synchronized Caption (PSC), which provides assistance for language learners and fosters L2 listening skills. PSC presents purposively selected words…
Comparison of Classification Methods for Detecting Emotion from Mandarin Speech
NASA Astrophysics Data System (ADS)
Pao, Tsang-Long; Chen, Yu-Te; Yeh, Jun-Heng
It is said that technology comes out from humanity. What is humanity? The very definition of humanity is emotion. Emotion is the basis for all human expression and the underlying theme behind everything that is done, said, thought or imagined. Making computers being able to perceive and respond to human emotion, the human-computer interaction will be more natural. Several classifiers are adopted for automatically assigning an emotion category, such as anger, happiness or sadness, to a speech utterance. These classifiers were designed independently and tested on various emotional speech corpora, making it difficult to compare and evaluate their performance. In this paper, we first compared several popular classification methods and evaluated their performance by applying them to a Mandarin speech corpus consisting of five basic emotions, including anger, happiness, boredom, sadness and neutral. The extracted feature streams contain MFCC, LPCC, and LPC. The experimental results show that the proposed WD-MKNN classifier achieves an accuracy of 81.4% for the 5-class emotion recognition and outperforms other classification techniques, including KNN, MKNN, DW-KNN, LDA, QDA, GMM, HMM, SVM, and BPNN. Then, to verify the advantage of the proposed method, we compared these classifiers by applying them to another Mandarin expressive speech corpus consisting of two emotions. The experimental results still show that the proposed WD-MKNN outperforms others.
Oral reading fluency analysis in patients with Alzheimer disease and asymptomatic control subjects.
Martínez-Sánchez, F; Meilán, J J G; García-Sevilla, J; Carro, J; Arana, J M
2013-01-01
Many studies highlight that an impaired ability to communicate is one of the key clinical features of Alzheimer disease (AD). To study temporal organisation of speech in an oral reading task in patients with AD and in matched healthy controls using a semi-automatic method, and evaluate that method's ability to discriminate between the 2 groups. A test with an oral reading task was administered to 70 subjects, comprising 35 AD patients and 35 controls. Before speech samples were recorded, participants completed a battery of neuropsychological tests. There were no differences between groups with regard to age, sex, or educational level. All of the study variables showed impairment in the AD group. According to the results, AD patients' oral reading was marked by reduced speech and articulation rates, low effectiveness of phonation time, and increases in the number and proportion of pauses. Signal processing algorithms applied to reading fluency recordings were shown to be capable of differentiating between AD patients and controls with an accuracy of 80% (specificity 74.2%, sensitivity 77.1%) based on speech rate. Analysis of oral reading fluency may be useful as a tool for the objective study and quantification of speech deficits in AD. Copyright © 2012 Sociedad Española de Neurología. Published by Elsevier Espana. All rights reserved.
The influence of ambient speech on adult speech productions through unintentional imitation.
Delvaux, Véronique; Soquet, Alain
2007-01-01
This paper deals with the influence of ambient speech on individual speech productions. A methodological framework is defined to gather the experimental data necessary to feed computer models simulating self-organisation in phonological systems. Two experiments were carried out. Experiment 1 was run on French native speakers from two regiolects of Belgium: two from Liège and two from Brussels. When exposed to the way of speaking of the other regiolect via loudspeakers, the speakers of one regiolect produced vowels that were significantly different from their typical realisations, and significantly closer to the way of speaking specific of the other regiolect. Experiment 2 achieved a replication of the results for 8 Mons speakers hearing a Liège speaker. A significant part of the imitative effect remained up to 10 min after the end of the exposure to the other regiolect productions. As a whole, the results suggest that: (i) imitation occurs automatically and unintentionally, (ii) the modified realisations leave a memory trace, in which case the mechanism may be better defined as 'mimesis' than as 'imitation'. The potential effects of multiple imitative speech interactions on sound change are discussed in this paper, as well as the implications for a general theory of phonetic implementation and phonetic representation.
Stimulated Deep Neural Network for Speech Recognition
2016-09-08
making network regularization and robust adaptation challenging. Stimulated training has recently been proposed to address this problem by encouraging...potential to improve regularization and adaptation. This paper investigates stimulated training of DNNs for both of these options. These schemes take
Modification of computational auditory scene analysis (CASA) for noise-robust acoustic feature
NASA Astrophysics Data System (ADS)
Kwon, Minseok
While there have been many attempts to mitigate interferences of background noise, the performance of automatic speech recognition (ASR) still can be deteriorated by various factors with ease. However, normal hearing listeners can accurately perceive sounds of their interests, which is believed to be a result of Auditory Scene Analysis (ASA). As a first attempt, the simulation of the human auditory processing, called computational auditory scene analysis (CASA), was fulfilled through physiological and psychological investigations of ASA. CASA comprised of Zilany-Bruce auditory model, followed by tracking fundamental frequency for voice segmentation and detecting pairs of onset/offset at each characteristic frequency (CF) for unvoiced segmentation. The resulting Time-Frequency (T-F) representation of acoustic stimulation was converted into acoustic feature, gammachirp-tone frequency cepstral coefficients (GFCC). 11 keywords with various environmental conditions are used and the robustness of GFCC was evaluated by spectral distance (SD) and dynamic time warping distance (DTW). In "clean" and "noisy" conditions, the application of CASA generally improved noise robustness of the acoustic feature compared to a conventional method with or without noise suppression using MMSE estimator. The intial study, however, not only showed the noise-type dependency at low SNR, but also called the evaluation methods in question. Some modifications were made to capture better spectral continuity from an acoustic feature matrix, to obtain faster processing speed, and to describe the human auditory system more precisely. The proposed framework includes: 1) multi-scale integration to capture more accurate continuity in feature extraction, 2) contrast enhancement (CE) of each CF by competition with neighboring frequency bands, and 3) auditory model modifications. The model modifications contain the introduction of higher Q factor, middle ear filter more analogous to human auditory system, the regulation of time constant update for filters in signal/control path as well as level-independent frequency glides with fixed frequency modulation. First, we scrutinized performance development in keyword recognition using the proposed methods in quiet and noise-corrupted environments. The results argue that multi-scale integration should be used along with CE in order to avoid ambiguous continuity in unvoiced segments. Moreover, the inclusion of the all modifications was observed to guarantee the noise-type-independent robustness particularly with severe interference. Moreover, the CASA with the auditory model was implemented into a single/dual-channel ASR using reference TIMIT corpus so as to get more general result. Hidden Markov model (HTK) toolkit was used for phone recognition in various environmental conditions. In a single-channel ASR, the results argue that unmasked acoustic features (unmasked GFCC) should combine with target estimates from the mask to compensate for missing information. From the observation of a dual-channel ASR, the combined GFCC guarantees the highest performance regardless of interferences within speech. Moreover, consistent improvement of noise robustness by GFCC (unmasked or combined) shows the validity of our proposed CASA implementation in dual microphone system. In conclusion, the proposed framework proves the robustness of the acoustic features in various background interferences via both direct distance evaluation and statistical assessment. In addition, the introduction of dual microphone system using the framework in this study shows the potential of the effective implementation of the auditory model-based CASA in ASR.
Telephony-based voice pathology assessment using automated speech analysis.
Moran, Rosalyn J; Reilly, Richard B; de Chazal, Philip; Lacy, Peter D
2006-03-01
A system for remotely detecting vocal fold pathologies using telephone-quality speech is presented. The system uses a linear classifier, processing measurements of pitch perturbation, amplitude perturbation and harmonic-to-noise ratio derived from digitized speech recordings. Voice recordings from the Disordered Voice Database Model 4337 system were used to develop and validate the system. Results show that while a sustained phonation, recorded in a controlled environment, can be classified as normal or pathologic with accuracy of 89.1%, telephone-quality speech can be classified as normal or pathologic with an accuracy of 74.2%, using the same scheme. Amplitude perturbation features prove most robust for telephone-quality speech. The pathologic recordings were then subcategorized into four groups, comprising normal, neuromuscular pathologic, physical pathologic and mixed (neuromuscular with physical) pathologic. A separate classifier was developed for classifying the normal group from each pathologic subcategory. Results show that neuromuscular disorders could be detected remotely with an accuracy of 87%, physical abnormalities with an accuracy of 78% and mixed pathology voice with an accuracy of 61%. This study highlights the real possibility for remote detection and diagnosis of voice pathology.
Lip movements affect infants' audiovisual speech perception.
Yeung, H Henny; Werker, Janet F
2013-05-01
Speech is robustly audiovisual from early in infancy. Here we show that audiovisual speech perception in 4.5-month-old infants is influenced by sensorimotor information related to the lip movements they make while chewing or sucking. Experiment 1 consisted of a classic audiovisual matching procedure, in which two simultaneously displayed talking faces (visual [i] and [u]) were presented with a synchronous vowel sound (audio /i/ or /u/). Infants' looking patterns were selectively biased away from the audiovisual matching face when the infants were producing lip movements similar to those needed to produce the heard vowel. Infants' looking patterns returned to those of a baseline condition (no lip movements, looking longer at the audiovisual matching face) when they were producing lip movements that did not match the heard vowel. Experiment 2 confirmed that these sensorimotor effects interacted with the heard vowel, as looking patterns differed when infants produced these same lip movements while seeing and hearing a talking face producing an unrelated vowel (audio /a/). These findings suggest that the development of speech perception and speech production may be mutually informative.
Content-based TV sports video retrieval using multimodal analysis
NASA Astrophysics Data System (ADS)
Yu, Yiqing; Liu, Huayong; Wang, Hongbin; Zhou, Dongru
2003-09-01
In this paper, we propose content-based video retrieval, which is a kind of retrieval by its semantical contents. Because video data is composed of multimodal information streams such as video, auditory and textual streams, we describe a strategy of using multimodal analysis for automatic parsing sports video. The paper first defines the basic structure of sports video database system, and then introduces a new approach that integrates visual stream analysis, speech recognition, speech signal processing and text extraction to realize video retrieval. The experimental results for TV sports video of football games indicate that the multimodal analysis is effective for video retrieval by quickly browsing tree-like video clips or inputting keywords within predefined domain.
Temporal processing of speech in a time-feature space
NASA Astrophysics Data System (ADS)
Avendano, Carlos
1997-09-01
The performance of speech communication systems often degrades under realistic environmental conditions. Adverse environmental factors include additive noise sources, room reverberation, and transmission channel distortions. This work studies the processing of speech in the temporal-feature or modulation spectrum domain, aiming for alleviation of the effects of such disturbances. Speech reflects the geometry of the vocal organs, and the linguistically dominant component is in the shape of the vocal tract. At any given point in time, the shape of the vocal tract is reflected in the short-time spectral envelope of the speech signal. The rate of change of the vocal tract shape appears to be important for the identification of linguistic components. This rate of change, or the rate of change of the short-time spectral envelope can be described by the modulation spectrum, i.e. the spectrum of the time trajectories described by the short-time spectral envelope. For a wide range of frequency bands, the modulation spectrum of speech exhibits a maximum at about 4 Hz, the average syllabic rate. Disturbances often have modulation frequency components outside the speech range, and could in principle be attenuated without significantly affecting the range with relevant linguistic information. Early efforts for exploiting the modulation spectrum domain (temporal processing), such as the dynamic cepstrum or the RASTA processing, used ad hoc designed processing and appear to be suboptimal. As a major contribution, in this dissertation we aim for a systematic data-driven design of temporal processing. First we analytically derive and discuss some properties and merits of temporal processing for speech signals. We attempt to formalize the concept and provide a theoretical background which has been lacking in the field. In the experimental part we apply temporal processing to a number of problems including adaptive noise reduction in cellular telephone environments, reduction of reverberation for speech enhancement, and improvements on automatic recognition of speech degraded by linear distortions and reverberation.
Semi Automatic Ontology Instantiation in the domain of Risk Management
NASA Astrophysics Data System (ADS)
Makki, Jawad; Alquier, Anne-Marie; Prince, Violaine
One of the challenging tasks in the context of Ontological Engineering is to automatically or semi-automatically support the process of Ontology Learning and Ontology Population from semi-structured documents (texts). In this paper we describe a Semi-Automatic Ontology Instantiation method from natural language text, in the domain of Risk Management. This method is composed from three steps 1 ) Annotation with part-of-speech tags, 2) Semantic Relation Instances Extraction, 3) Ontology instantiation process. It's based on combined NLP techniques using human intervention between steps 2 and 3 for control and validation. Since it heavily relies on linguistic knowledge it is not domain dependent which is a good feature for portability between the different fields of risk management application. The proposed methodology uses the ontology of the PRIMA1 project (supported by the European community) as a Generic Domain Ontology and populates it via an available corpus. A first validation of the approach is done through an experiment with Chemical Fact Sheets from Environmental Protection Agency2.
Four-Channel Biosignal Analysis and Feature Extraction for Automatic Emotion Recognition
NASA Astrophysics Data System (ADS)
Kim, Jonghwa; André, Elisabeth
This paper investigates the potential of physiological signals as a reliable channel for automatic recognition of user's emotial state. For the emotion recognition, little attention has been paid so far to physiological signals compared to audio-visual emotion channels such as facial expression or speech. All essential stages of automatic recognition system using biosignals are discussed, from recording physiological dataset up to feature-based multiclass classification. Four-channel biosensors are used to measure electromyogram, electrocardiogram, skin conductivity and respiration changes. A wide range of physiological features from various analysis domains, including time/frequency, entropy, geometric analysis, subband spectra, multiscale entropy, etc., is proposed in order to search the best emotion-relevant features and to correlate them with emotional states. The best features extracted are specified in detail and their effectiveness is proven by emotion recognition results.
Rohlfing, Katharina J.; Nachtigäller, Kerstin
2016-01-01
The learning of spatial prepositions is assumed to be based on experience in space. In a slow mapping study, we investigated whether 31 German 28-month-old children could robustly learn the German spatial prepositions hinter [behind] and neben [next to] from pictures, and whether a narrative input can compensate for a lack of immediate experience in space. One group of children received pictures with a narrative input as a training to understand spatial prepositions. In two further groups, we controlled (a) for the narrative input by providing unconnected speech during the training and (b) for the learning material by training the children on toys rather than pictures. We assessed children’s understanding of spatial prepositions at three different time points: pretest, immediate test, and delayed posttest. Results showed improved word retention in children from the narrative but not the control group receiving unconnected speech. Neither of the trained groups succeeded in generalization to novel referents. Finally, all groups were instructed to deal with untrained material in the test to investigate the robustness of learning across tasks. None of the groups succeeded in this task transfer. PMID:27471479
Obermeier, Christian; Holle, Henning; Gunter, Thomas C
2011-07-01
The present series of experiments explores several issues related to gesture-speech integration and synchrony during sentence processing. To be able to more precisely manipulate gesture-speech synchrony, we used gesture fragments instead of complete gestures, thereby avoiding the usual long temporal overlap of gestures with their coexpressive speech. In a pretest, the minimal duration of an iconic gesture fragment needed to disambiguate a homonym (i.e., disambiguation point) was therefore identified. In three subsequent ERP experiments, we then investigated whether the gesture information available at the disambiguation point has immediate as well as delayed consequences on the processing of a temporarily ambiguous spoken sentence, and whether these gesture-speech integration processes are susceptible to temporal synchrony. Experiment 1, which used asynchronous stimuli as well as an explicit task, showed clear N400 effects at the homonym as well as at the target word presented further downstream, suggesting that asynchrony does not prevent integration under explicit task conditions. No such effects were found when asynchronous stimuli were presented using a more shallow task (Experiment 2). Finally, when gesture fragment and homonym were synchronous, similar results as in Experiment 1 were found, even under shallow task conditions (Experiment 3). We conclude that when iconic gesture fragments and speech are in synchrony, their interaction is more or less automatic. When they are not, more controlled, active memory processes are necessary to be able to combine the gesture fragment and speech context in such a way that the homonym is disambiguated correctly.
Zamaninezhad, Ladan; Hohmann, Volker; Büchner, Andreas; Schädler, Marc René; Jürgens, Tim
2017-02-01
This study introduces a speech intelligibility model for cochlear implant users with ipsilateral preserved acoustic hearing that aims at simulating the observed speech-in-noise intelligibility benefit when receiving simultaneous electric and acoustic stimulation (EA-benefit). The model simulates the auditory nerve spiking in response to electric and/or acoustic stimulation. The temporally and spatially integrated spiking patterns were used as the final internal representation of noisy speech. Speech reception thresholds (SRTs) in stationary noise were predicted for a sentence test using an automatic speech recognition framework. The model was employed to systematically investigate the effect of three physiologically relevant model factors on simulated SRTs: (1) the spatial spread of the electric field which co-varies with the number of electrically stimulated auditory nerves, (2) the "internal" noise simulating the deprivation of auditory system, and (3) the upper bound frequency limit of acoustic hearing. The model results show that the simulated SRTs increase monotonically with increasing spatial spread for fixed internal noise, and also increase with increasing the internal noise strength for a fixed spatial spread. The predicted EA-benefit does not follow such a systematic trend and depends on the specific combination of the model parameters. Beyond 300 Hz, the upper bound limit for preserved acoustic hearing is less influential on speech intelligibility of EA-listeners in stationary noise. The proposed model-predicted EA-benefits are within the range of EA-benefits shown by 18 out of 21 actual cochlear implant listeners with preserved acoustic hearing. Copyright © 2016 Elsevier B.V. All rights reserved.
Distributed neural signatures of natural audiovisual speech and music in the human auditory cortex.
Salmi, Juha; Koistinen, Olli-Pekka; Glerean, Enrico; Jylänki, Pasi; Vehtari, Aki; Jääskeläinen, Iiro P; Mäkelä, Sasu; Nummenmaa, Lauri; Nummi-Kuisma, Katarina; Nummi, Ilari; Sams, Mikko
2017-08-15
During a conversation or when listening to music, auditory and visual information are combined automatically into audiovisual objects. However, it is still poorly understood how specific type of visual information shapes neural processing of sounds in lifelike stimulus environments. Here we applied multi-voxel pattern analysis to investigate how naturally matching visual input modulates supratemporal cortex activity during processing of naturalistic acoustic speech, singing and instrumental music. Bayesian logistic regression classifiers with sparsity-promoting priors were trained to predict whether the stimulus was audiovisual or auditory, and whether it contained piano playing, speech, or singing. The predictive performances of the classifiers were tested by leaving one participant at a time for testing and training the model using the remaining 15 participants. The signature patterns associated with unimodal auditory stimuli encompassed distributed locations mostly in the middle and superior temporal gyrus (STG/MTG). A pattern regression analysis, based on a continuous acoustic model, revealed that activity in some of these MTG and STG areas were associated with acoustic features present in speech and music stimuli. Concurrent visual stimulus modulated activity in bilateral MTG (speech), lateral aspect of right anterior STG (singing), and bilateral parietal opercular cortex (piano). Our results suggest that specific supratemporal brain areas are involved in processing complex natural speech, singing, and piano playing, and other brain areas located in anterior (facial speech) and posterior (music-related hand actions) supratemporal cortex are influenced by related visual information. Those anterior and posterior supratemporal areas have been linked to stimulus identification and sensory-motor integration, respectively. Copyright © 2017 Elsevier Inc. All rights reserved.
TRECVID: the utility of a content-based video retrieval evaluation
NASA Astrophysics Data System (ADS)
Hauptmann, Alexander G.
2006-01-01
TRECVID, an annual retrieval evaluation benchmark organized by NIST, encourages research in information retrieval from digital video. TRECVID benchmarking covers both interactive and manual searching by end users, as well as the benchmarking of some supporting technologies including shot boundary detection, extraction of semantic features, and the automatic segmentation of TV news broadcasts. Evaluations done in the context of the TRECVID benchmarks show that generally, speech transcripts and annotations provide the single most important clue for successful retrieval. However, automatically finding the individual images is still a tremendous and unsolved challenge. The evaluations repeatedly found that none of the multimedia analysis and retrieval techniques provide a significant benefit over retrieval using only textual information such as from automatic speech recognition transcripts or closed captions. In interactive systems, we do find significant differences among the top systems, indicating that interfaces can make a huge difference for effective video/image search. For interactive tasks efficient interfaces require few key clicks, but display large numbers of images for visual inspection by the user. The text search finds the right context region in the video in general, but to select specific relevant images we need good interfaces to easily browse the storyboard pictures. In general, TRECVID has motivated the video retrieval community to be honest about what we don't know how to do well (sometimes through painful failures), and has focused us to work on the actual task of video retrieval, as opposed to flashy demos based on technological capabilities.
Optimization and automation of quantitative NMR data extraction.
Bernstein, Michael A; Sýkora, Stan; Peng, Chen; Barba, Agustín; Cobas, Carlos
2013-06-18
NMR is routinely used to quantitate chemical species. The necessary experimental procedures to acquire quantitative data are well-known, but relatively little attention has been applied to data processing and analysis. We describe here a robust expert system that can be used to automatically choose the best signals in a sample for overall concentration determination and determine analyte concentration using all accepted methods. The algorithm is based on the complete deconvolution of the spectrum which makes it tolerant of cases where signals are very close to one another and includes robust methods for the automatic classification of NMR resonances and molecule-to-spectrum multiplets assignments. With the functionality in place and optimized, it is then a relatively simple matter to apply the same workflow to data in a fully automatic way. The procedure is desirable for both its inherent performance and applicability to NMR data acquired for very large sample sets.
An automatic rat brain extraction method based on a deformable surface model.
Li, Jiehua; Liu, Xiaofeng; Zhuo, Jiachen; Gullapalli, Rao P; Zara, Jason M
2013-08-15
The extraction of the brain from the skull in medical images is a necessary first step before image registration or segmentation. While pre-clinical MR imaging studies on small animals, such as rats, are increasing, fully automatic imaging processing techniques specific to small animal studies remain lacking. In this paper, we present an automatic rat brain extraction method, the Rat Brain Deformable model method (RBD), which adapts the popular human brain extraction tool (BET) through the incorporation of information on the brain geometry and MR image characteristics of the rat brain. The robustness of the method was demonstrated on T2-weighted MR images of 64 rats and compared with other brain extraction methods (BET, PCNN, PCNN-3D). The results demonstrate that RBD reliably extracts the rat brain with high accuracy (>92% volume overlap) and is robust against signal inhomogeneity in the images. Copyright © 2013 Elsevier B.V. All rights reserved.
Inoue, Kentaro; Maeda, Kazuhiro; Miyabe, Takaaki; Matsuoka, Yu; Kurata, Hiroyuki
2014-09-01
Mathematical modeling has become a standard technique to understand the dynamics of complex biochemical systems. To promote the modeling, we had developed the CADLIVE dynamic simulator that automatically converted a biochemical map into its associated mathematical model, simulated its dynamic behaviors and analyzed its robustness. To enhance the feasibility by CADLIVE and extend its functions, we propose the CADLIVE toolbox available for MATLAB, which implements not only the existing functions of the CADLIVE dynamic simulator, but also the latest tools including global parameter search methods with robustness analysis. The seamless, bottom-up processes consisting of biochemical network construction, automatic construction of its dynamic model, simulation, optimization, and S-system analysis greatly facilitate dynamic modeling, contributing to the research of systems biology and synthetic biology. This application can be freely downloaded from http://www.cadlive.jp/CADLIVE_MATLAB/ together with an instruction.
NASA Astrophysics Data System (ADS)
Sleeman, Reinoud; van Eck, Torild
1999-06-01
The onset of a seismic signal is determined through joint AR modeling of the noise and the seismic signal, and the application of the Akaike Information Criterion (AIC) using the onset time as parameter. This so-called AR-AIC phase picker has been tested successfully and implemented on the Z-component of the broadband station HGN to provide automatic P-phase picks for a rapid warning system. The AR-AIC picker is shown to provide accurate and robust automatic picks on a large experimental database. Out of 1109 P-phase onsets with signal-to-noise ratio (SNR) above 1 from local, regional and teleseismic earthquakes, our implementation detects 71% and gives a mean difference with manual picks of 0.1 s. An optimal version of the well-established picker of Baer and Kradolfer [Baer, M., Kradolfer, U., An automatic phase picker for local and teleseismic events, Bull. Seism. Soc. Am. 77 (1987) 1437-1445] detects less than 41% and gives a mean difference with manual picks of 0.3 s using the same dataset.
Robust Spacecraft Component Detection in Point Clouds.
Wei, Quanmao; Jiang, Zhiguo; Zhang, Haopeng
2018-03-21
Automatic component detection of spacecraft can assist in on-orbit operation and space situational awareness. Spacecraft are generally composed of solar panels and cuboidal or cylindrical modules. These components can be simply represented by geometric primitives like plane, cuboid and cylinder. Based on this prior, we propose a robust automatic detection scheme to automatically detect such basic components of spacecraft in three-dimensional (3D) point clouds. In the proposed scheme, cylinders are first detected in the iteration of the energy-based geometric model fitting and cylinder parameter estimation. Then, planes are detected by Hough transform and further described as bounded patches with their minimum bounding rectangles. Finally, the cuboids are detected with pair-wise geometry relations from the detected patches. After successive detection of cylinders, planar patches and cuboids, a mid-level geometry representation of the spacecraft can be delivered. We tested the proposed component detection scheme on spacecraft 3D point clouds synthesized by computer-aided design (CAD) models and those recovered by image-based reconstruction, respectively. Experimental results illustrate that the proposed scheme can detect the basic geometric components effectively and has fine robustness against noise and point distribution density.
Robust Spacecraft Component Detection in Point Clouds
Wei, Quanmao; Jiang, Zhiguo
2018-01-01
Automatic component detection of spacecraft can assist in on-orbit operation and space situational awareness. Spacecraft are generally composed of solar panels and cuboidal or cylindrical modules. These components can be simply represented by geometric primitives like plane, cuboid and cylinder. Based on this prior, we propose a robust automatic detection scheme to automatically detect such basic components of spacecraft in three-dimensional (3D) point clouds. In the proposed scheme, cylinders are first detected in the iteration of the energy-based geometric model fitting and cylinder parameter estimation. Then, planes are detected by Hough transform and further described as bounded patches with their minimum bounding rectangles. Finally, the cuboids are detected with pair-wise geometry relations from the detected patches. After successive detection of cylinders, planar patches and cuboids, a mid-level geometry representation of the spacecraft can be delivered. We tested the proposed component detection scheme on spacecraft 3D point clouds synthesized by computer-aided design (CAD) models and those recovered by image-based reconstruction, respectively. Experimental results illustrate that the proposed scheme can detect the basic geometric components effectively and has fine robustness against noise and point distribution density. PMID:29561828
NASA Technical Reports Server (NTRS)
Brenner, Malcolm; Shipp, Thomas
1988-01-01
In a study of the validity of eight candidate voice measures (fundamental frequency, amplitude, speech rate, frequency jitter, amplitude shimmer, Psychological Stress Evaluator scores, energy distribution, and the derived measure of the above measures) for determining psychological stress, 17 males age 21 to 35 were subjected to a tracking task on a microcomputer CRT while parameters of vocal production as well as heart rate were measured. Findings confirm those of earlier studies that increases in fundamental frequency, amplitude, and speech rate are found in speakers involved in extreme levels of stress. In addition, it was found that the same changes appear to occur in a regular fashion within a more subtle level of stress that may be characteristic, for example, of routine flying situations. None of the individual speech measures performed as robustly as did heart rate.
Neurophysiology underlying influence of stimulus reliability on audiovisual integration.
Shatzer, Hannah; Shen, Stanley; Kerlin, Jess R; Pitt, Mark A; Shahin, Antoine J
2018-01-24
We tested the predictions of the dynamic reweighting model (DRM) of audiovisual (AV) speech integration, which posits that spectrotemporally reliable (informative) AV speech stimuli induce a reweighting of processing from low-level to high-level auditory networks. This reweighting decreases sensitivity to acoustic onsets and in turn increases tolerance to AV onset asynchronies (AVOA). EEG was recorded while subjects watched videos of a speaker uttering trisyllabic nonwords that varied in spectrotemporal reliability and asynchrony of the visual and auditory inputs. Subjects judged the stimuli as in-sync or out-of-sync. Results showed that subjects exhibited greater AVOA tolerance for non-blurred than blurred visual speech and for less than more degraded acoustic speech. Increased AVOA tolerance was reflected in reduced amplitude of the P1-P2 auditory evoked potentials, a neurophysiological indication of reduced sensitivity to acoustic onsets and successful AV integration. There was also sustained visual alpha band (8-14 Hz) suppression (desynchronization) following acoustic speech onsets for non-blurred vs. blurred visual speech, consistent with continuous engagement of the visual system as the speech unfolds. The current findings suggest that increased spectrotemporal reliability of acoustic and visual speech promotes robust AV integration, partly by suppressing sensitivity to acoustic onsets, in support of the DRM's reweighting mechanism. Increased visual signal reliability also sustains the engagement of the visual system with the auditory system to maintain alignment of information across modalities. © 2018 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Statistical properties of Chinese phonemic networks
NASA Astrophysics Data System (ADS)
Yu, Shuiyuan; Liu, Haitao; Xu, Chunshan
2011-04-01
The study of properties of speech sound systems is of great significance in understanding the human cognitive mechanism and the working principles of speech sound systems. Some properties of speech sound systems, such as the listener-oriented feature and the talker-oriented feature, have been unveiled with the statistical study of phonemes in human languages and the research of the interrelations between human articulatory gestures and the corresponding acoustic parameters. With all the phonemes of speech sound systems treated as a coherent whole, our research, which focuses on the dynamic properties of speech sound systems in operation, investigates some statistical parameters of Chinese phoneme networks based on real text and dictionaries. The findings are as follows: phonemic networks have high connectivity degrees and short average distances; the degrees obey normal distribution and the weighted degrees obey power law distribution; vowels enjoy higher priority than consonants in the actual operation of speech sound systems; the phonemic networks have high robustness against targeted attacks and random errors. In addition, for investigating the structural properties of a speech sound system, a statistical study of dictionaries is conducted, which shows the higher frequency of shorter words and syllables and the tendency that the longer a word is, the shorter the syllables composing it are. From these structural properties and dynamic properties one can derive the following conclusion: the static structure of a speech sound system tends to promote communication efficiency and save articulation effort while the dynamic operation of this system gives preference to reliable transmission and easy recognition. In short, a speech sound system is an effective, efficient and reliable communication system optimized in many aspects.
Intonation and dialog context as constraints for speech recognition.
Taylor, P; King, S; Isard, S; Wright, H
1998-01-01
This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.
Integrating Speech and Iconic Gestures in a Stroop-Like Task: Evidence for Automatic Processing
ERIC Educational Resources Information Center
Kelly, Spencer D.; Creigh, Peter; Bartolotti, James
2010-01-01
Previous research has demonstrated a link between language and action in the brain. The present study investigates the strength of this neural relationship by focusing on a potential interface between the two systems: cospeech iconic gesture. Participants performed a Stroop-like task in which they watched videos of a man and a woman speaking and…
ERIC Educational Resources Information Center
Santos-Oliveira, Daniela Cristina
2017-01-01
Models of speech perception suggest a dorsal stream connecting the temporal and inferior parietal lobe with the inferior frontal gyrus. This stream is thought to involve an auditory motor loop that translates acoustic information into motor/articulatory commands and is further influenced by decision making processes that involve maintenance of…
ERIC Educational Resources Information Center
Ali, Saandia
2016-01-01
This paper reports on the early stages of a locally funded research and development project taking place at Rennes 2 university. It aims at developing a comprehensive pedagogical framework for pronunciation training for adult learners of English. This framework will combine a direct approach to pronunciation training (face-to-face teaching) with…
Automatic speech recognition in air-ground data link
NASA Technical Reports Server (NTRS)
Armstrong, Herbert B.
1989-01-01
In the present air traffic system, information presented to the transport aircraft cockpit crew may originate from a variety of sources and may be presented to the crew in visual or aural form, either through cockpit instrument displays or, most often, through voice communication. Voice radio communications are the most error prone method for air-ground data link. Voice messages can be misstated or misunderstood and radio frequency congestion can delay or obscure important messages. To prevent proliferation, a multiplexed data link display can be designed to present information from multiple data link sources on a shared cockpit display unit (CDU) or multi-function display (MFD) or some future combination of flight management and data link information. An aural data link which incorporates an automatic speech recognition (ASR) system for crew response offers several advantages over visual displays. The possibility of applying ASR to the air-ground data link was investigated. The first step was to review current efforts in ASR applications in the cockpit and in air traffic control and evaluated their possible data line application. Next, a series of preliminary research questions is to be developed for possible future collaboration.
Measures of voiced frication for automatic classification
NASA Astrophysics Data System (ADS)
Jackson, Philip J. B.; Jesus, Luis M. T.; Shadle, Christine H.; Pincas, Jonathan
2004-05-01
As an approach to understanding the characteristics of the acoustic sources in voiced fricatives, it seems apt to draw on knowledge of vowels and voiceless fricatives, which have been relatively well studied. However, the presence of both phonation and frication in these mixed-source sounds offers the possibility of mutual interaction effects, with variations across place of articulation. This paper examines the acoustic and articulatory consequences of these interactions and explores automatic techniques for finding parametric and statistical descriptions of these phenomena. A reliable and consistent set of such acoustic cues could be used for phonetic classification or speech recognition. Following work on devoicing of European Portuguese voiced fricatives [Jesus and Shadle, in Mamede et al. (eds.) (Springer-Verlag, Berlin, 2003), pp. 1-8]. and the modulating effect of voicing on frication [Jackson and Shadle, J. Acoust. Soc. Am. 108, 1421-1434 (2000)], the present study focuses on three types of information: (i) sequences and durations of acoustic events in VC transitions, (ii) temporal, spectral and modulation measures from the periodic and aperiodic components of the acoustic signal, and (iii) voicing activity derived from simultaneous EGG data. Analysis of interactions observed in British/American English and European Portuguese speech corpora will be compared, and the principal findings discussed.
Automated Vocal Analysis of Children with Hearing Loss and Their Typical and Atypical Peers
VanDam, Mark; Oller, D. Kimbrough; Ambrose, Sophie E.; Gray, Sharmistha; Richards, Jeffrey A.; Xu, Dongxin; Gilkerson, Jill; Silbert, Noah H.; Moeller, Mary Pat
2014-01-01
Objectives This study investigated automatic assessment of vocal development in children with hearing loss as compared with children who are typically developing, have language delays, and autism spectrum disorder. Statistical models are examined for performance in a classification model and to predict age within the four groups of children. Design The vocal analysis system analyzed over 1900 whole-day, naturalistic acoustic recordings from 273 toddlers and preschoolers comprising children who were typically developing, hard of hearing, language delayed, or autistic. Results Samples from children who were hard-of-hearing patterned more similarly to those of typically-developing children than to the language-delayed or autistic samples. The statistical models were able to classify children from the four groups examined and estimate developmental age based on automated vocal analysis. Conclusions This work shows a broad similarity between children with hearing loss and typically developing children, although children with hearing loss show some delay in their production of speech. Automatic acoustic analysis can now be used to quantitatively compare vocal development in children with and without speech-related disorders. The work may serve to better distinguish among various developmental disorders and ultimately contribute to improved intervention. PMID:25587667
Khan, Arif Ul Maula; Torelli, Angelo; Wolf, Ivo; Gretz, Norbert
2018-05-08
In biological assays, automated cell/colony segmentation and counting is imperative owing to huge image sets. Problems occurring due to drifting image acquisition conditions, background noise and high variation in colony features in experiments demand a user-friendly, adaptive and robust image processing/analysis method. We present AutoCellSeg (based on MATLAB) that implements a supervised automatic and robust image segmentation method. AutoCellSeg utilizes multi-thresholding aided by a feedback-based watershed algorithm taking segmentation plausibility criteria into account. It is usable in different operation modes and intuitively enables the user to select object features interactively for supervised image segmentation method. It allows the user to correct results with a graphical interface. This publicly available tool outperforms tools like OpenCFU and CellProfiler in terms of accuracy and provides many additional useful features for end-users.
Bartos, Anthony L; Cipr, Tomas; Nelson, Douglas J; Schwarz, Petr; Banowetz, John; Jerabek, Ladislav
2018-04-01
A method is presented in which conventional speech algorithms are applied, with no modifications, to improve their performance in extremely noisy environments. It has been demonstrated that, for eigen-channel algorithms, pre-training multiple speaker identification (SID) models at a lattice of signal-to-noise-ratio (SNR) levels and then performing SID using the appropriate SNR dependent model was successful in mitigating noise at all SNR levels. In those tests, it was found that SID performance was optimized when the SNR of the testing and training data were close or identical. In this current effort multiple i-vector algorithms were used, greatly improving both processing throughput and equal error rate classification accuracy. Using identical approaches in the same noisy environment, performance of SID, language identification, gender identification, and diarization were significantly improved. A critical factor in this improvement is speech activity detection (SAD) that performs reliably in extremely noisy environments, where the speech itself is barely audible. To optimize SAD operation at all SNR levels, two algorithms were employed. The first maximized detection probability at low levels (-10 dB ≤ SNR < +10 dB) using just the voiced speech envelope, and the second exploited features extracted from the original speech to improve overall accuracy at higher quality levels (SNR ≥ +10 dB).
Tavano, Alessandro; Pesarin, Anna; Murino, Vittorio; Cristani, Marco
2014-01-01
Individuals with Asperger syndrome/High Functioning Autism fail to spontaneously attribute mental states to the self and others, a life-long phenotypic characteristic known as mindblindness. We hypothesized that mindblindness would affect the dynamics of conversational interaction. Using generative models, in particular Gaussian mixture models and observed influence models, conversations were coded as interacting Markov processes, operating on novel speech/silence patterns, termed Steady Conversational Periods (SCPs). SCPs assume that whenever an agent's process changes state (e.g., from silence to speech), it causes a general transition of the entire conversational process, forcing inter-actant synchronization. SCPs fed into observed influence models, which captured the conversational dynamics of children and adolescents with Asperger syndrome/High Functioning Autism, and age-matched typically developing participants. Analyzing the parameters of the models by means of discriminative classifiers, the dialogs of patients were successfully distinguished from those of control participants. We conclude that meaning-free speech/silence sequences, reflecting inter-actant synchronization, at least partially encode typical and atypical conversational dynamics. This suggests a direct influence of theory of mind abilities onto basic speech initiative behavior. PMID:24489674
Action Unit Models of Facial Expression of Emotion in the Presence of Speech
Shah, Miraj; Cooper, David G.; Cao, Houwei; Gur, Ruben C.; Nenkova, Ani; Verma, Ragini
2014-01-01
Automatic recognition of emotion using facial expressions in the presence of speech poses a unique challenge because talking reveals clues for the affective state of the speaker but distorts the canonical expression of emotion on the face. We introduce a corpus of acted emotion expression where speech is either present (talking) or absent (silent). The corpus is uniquely suited for analysis of the interplay between the two conditions. We use a multimodal decision level fusion classifier to combine models of emotion from talking and silent faces as well as from audio to recognize five basic emotions: anger, disgust, fear, happy and sad. Our results strongly indicate that emotion prediction in the presence of speech from action unit facial features is less accurate when the person is talking. Modeling talking and silent expressions separately and fusing the two models greatly improves accuracy of prediction in the talking setting. The advantages are most pronounced when silent and talking face models are fused with predictions from audio features. In this multi-modal prediction both the combination of modalities and the separate models of talking and silent facial expression of emotion contribute to the improvement. PMID:25525561
Wren, Yvonne; Humphries, Kerry; Stock, Nicola Marie; Rumsey, Nichola; Lewis, Sarah; Davies, Amy; Bennett, Rhiannon; Sandy, Jonathan
2018-05-01
Efforts to increase the evidence base in speech and language therapy are often limited by methodological factors that have restricted the strength of the evidence to the lower levels of the evidence hierarchy. Where higher graded studies, such as randomized controlled trials, have been carried out, it has sometimes been difficult to obtain sufficient power to detect a potential effect of intervention owing to small sample sizes or heterogeneity in the participants. With certain clinical groups such as cleft lip and palate, systematic reviews of intervention studies have shown that there is no robust evidence to support the efficacy of any one intervention protocol over another. To describe the setting up of an observational clinical cohort study and to present this as an alternative design for answering research questions relating to prevalence, risk factors and outcomes from intervention. The Cleft Collective Speech and Language (CC-SL) study is a national cohort study of children born with cleft palate. Working in partnership with regional clinical cleft centres, a sample size of over 600 children and 600 parents is being recruited and followed up from birth to age 5 years. Variables being collected include demographic, psychological, surgical, hearing, and speech and language data. The process of setting up the study has led to the creation of a unique, large-scale data set which is available for researchers to access now and in future. As well as exploring predictive factors, the data can be used to explore the impact of interventions in relation to individual differences. Findings from these investigations can be used to provide information on sample criteria and definitions of intervention and dosage which can be used in future trials. The observational cohort study is a useful alternative design to explore questions around prevalence, risk factors and intervention for clinical groups where robust research data are not yet available. Findings from such a study can be used to guide service-delivery decisions and to determine power for future clinical trials. © 2017 Royal College of Speech and Language Therapists.
EMG-based speech recognition using hidden markov models with global control variables.
Lee, Ki-Seung
2008-03-01
It is well known that a strong relationship exists between human voices and the movement of articulatory facial muscles. In this paper, we utilize this knowledge to implement an automatic speech recognition scheme which uses solely surface electromyogram (EMG) signals. The sequence of EMG signals for each word is modelled by a hidden Markov model (HMM) framework. The main objective of the work involves building a model for state observation density when multichannel observation sequences are given. The proposed model reflects the dependencies between each of the EMG signals, which are described by introducing a global control variable. We also develop an efficient model training method, based on a maximum likelihood criterion. In a preliminary study, 60 isolated words were used as recognition variables. EMG signals were acquired from three articulatory facial muscles. The findings indicate that such a system may have the capacity to recognize speech signals with an accuracy of up to 87.07%, which is superior to the independent probabilistic model.
NASA Astrophysics Data System (ADS)
Wang, Longbiao; Odani, Kyohei; Kai, Atsuhiko
2012-12-01
A blind dereverberation method based on power spectral subtraction (SS) using a multi-channel least mean squares algorithm was previously proposed to suppress the reverberant speech without additive noise. The results of isolated word speech recognition experiments showed that this method achieved significant improvements over conventional cepstral mean normalization (CMN) in a reverberant environment. In this paper, we propose a blind dereverberation method based on generalized spectral subtraction (GSS), which has been shown to be effective for noise reduction, instead of power SS. Furthermore, we extend the missing feature theory (MFT), which was initially proposed to enhance the robustness of additive noise, to dereverberation. A one-stage dereverberation and denoising method based on GSS is presented to simultaneously suppress both the additive noise and nonstationary multiplicative noise (reverberation). The proposed dereverberation method based on GSS with MFT is evaluated on a large vocabulary continuous speech recognition task. When the additive noise was absent, the dereverberation method based on GSS with MFT using only 2 microphones achieves a relative word error reduction rate of 11.4 and 32.6% compared to the dereverberation method based on power SS and the conventional CMN, respectively. For the reverberant and noisy speech, the dereverberation and denoising method based on GSS achieves a relative word error reduction rate of 12.8% compared to the conventional CMN with GSS-based additive noise reduction method. We also analyze the effective factors of the compensation parameter estimation for the dereverberation method based on SS, such as the number of channels (the number of microphones), the length of reverberation to be suppressed, and the length of the utterance used for parameter estimation. The experimental results showed that the SS-based method is robust in a variety of reverberant environments for both isolated and continuous speech recognition and under various parameter estimation conditions.