Speech enhancement on smartphone voice recording
NASA Astrophysics Data System (ADS)
Tris Atmaja, Bagus; Nur Farid, Mifta; Arifianto, Dhany
2016-11-01
Speech enhancement is challenging task in audio signal processing to enhance the quality of targeted speech signal while suppress other noises. In the beginning, the speech enhancement algorithm growth rapidly from spectral subtraction, Wiener filtering, spectral amplitude MMSE estimator to Non-negative Matrix Factorization (NMF). Smartphone as revolutionary device now is being used in all aspect of life including journalism; personally and professionally. Although many smartphones have two microphones (main and rear) the only main microphone is widely used for voice recording. This is why the NMF algorithm widely used for this purpose of speech enhancement. This paper evaluate speech enhancement on smartphone voice recording by using some algorithms mentioned previously. We also extend the NMF algorithm to Kulback-Leibler NMF with supervised separation. The last algorithm shows improved result compared to others by spectrogram and PESQ score evaluation.
Subjective comparison and evaluation of speech enhancement algorithms
Hu, Yi; Loizou, Philipos C.
2007-01-01
Making meaningful comparisons between the performance of the various speech enhancement algorithms proposed over the years, has been elusive due to lack of a common speech database, differences in the types of noise used and differences in the testing methodology. To facilitate such comparisons, we report on the development of a noisy speech corpus suitable for evaluation of speech enhancement algorithms. This corpus is subsequently used for the subjective evaluation of 13 speech enhancement methods encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms. The subjective evaluation was performed by Dynastat, Inc. using the ITU-T P.835 methodology designed to evaluate the speech quality along three dimensions: signal distortion, noise distortion and overall quality. This paper reports the results of the subjective tests. PMID:18046463
Low-dimensional recurrent neural network-based Kalman filter for speech enhancement.
Xia, Youshen; Wang, Jun
2015-07-01
This paper proposes a new recurrent neural network-based Kalman filter for speech enhancement, based on a noise-constrained least squares estimate. The parameters of speech signal modeled as autoregressive process are first estimated by using the proposed recurrent neural network and the speech signal is then recovered from Kalman filtering. The proposed recurrent neural network is globally asymptomatically stable to the noise-constrained estimate. Because the noise-constrained estimate has a robust performance against non-Gaussian noise, the proposed recurrent neural network-based speech enhancement algorithm can minimize the estimation error of Kalman filter parameters in non-Gaussian noise. Furthermore, having a low-dimensional model feature, the proposed neural network-based speech enhancement algorithm has a much faster speed than two existing recurrent neural networks-based speech enhancement algorithms. Simulation results show that the proposed recurrent neural network-based speech enhancement algorithm can produce a good performance with fast computation and noise reduction. Copyright © 2015 Elsevier Ltd. All rights reserved.
Speech enhancement based on modified phase-opponency detectors
NASA Astrophysics Data System (ADS)
Deshmukh, Om D.; Espy-Wilson, Carol Y.
2005-09-01
A speech enhancement algorithm based on a neural model was presented by Deshmukh et al., [149th meeting of the Acoustical Society America, 2005]. The algorithm consists of a bank of Modified Phase Opponency (MPO) filter pairs tuned to different center frequencies. This algorithm is able to enhance salient spectral features in speech signals even at low signal-to-noise ratios. However, the algorithm introduces musical noise and sometimes misses a spectral peak that is close in frequency to a stronger spectral peak. Refinement in the design of the MPO filters was recently made that takes advantage of the falling spectrum of the speech signal in sonorant regions. The modified set of filters leads to better separation of the noise and speech signals, and more accurate enhancement of spectral peaks. The improvements also lead to a significant reduction in musical noise. Continuity algorithms based on the properties of speech signals are used to further reduce the musical noise effect. The efficiency of the proposed method in enhancing the speech signal when the level of the background noise is fluctuating will be demonstrated. The performance of the improved speech enhancement method will be compared with various spectral subtraction-based methods. [Work supported by NSF BCS0236707.
Real-time spectrum estimation–based dual-channel speech-enhancement algorithm for cochlear implant
2012-01-01
Background Improvement of the cochlear implant (CI) front-end signal acquisition is needed to increase speech recognition in noisy environments. To suppress the directional noise, we introduce a speech-enhancement algorithm based on microphone array beamforming and spectral estimation. The experimental results indicate that this method is robust to directional mobile noise and strongly enhances the desired speech, thereby improving the performance of CI devices in a noisy environment. Methods The spectrum estimation and the array beamforming methods were combined to suppress the ambient noise. The directivity coefficient was estimated in the noise-only intervals, and was updated to fit for the mobile noise. Results The proposed algorithm was realized in the CI speech strategy. For actual parameters, we use Maxflat filter to obtain fractional sampling points and cepstrum method to differentiate the desired speech frame and the noise frame. The broadband adjustment coefficients were added to compensate the energy loss in the low frequency band. Discussions The approximation of the directivity coefficient is tested and the errors are discussed. We also analyze the algorithm constraint for noise estimation and distortion in CI processing. The performance of the proposed algorithm is analyzed and further be compared with other prevalent methods. Conclusions The hardware platform was constructed for the experiments. The speech-enhancement results showed that our algorithm can suppresses the non-stationary noise with high SNR. Excellent performance of the proposed algorithm was obtained in the speech enhancement experiments and mobile testing. And signal distortion results indicate that this algorithm is robust with high SNR improvement and low speech distortion. PMID:23006896
Wang, Yulin; Tian, Xuelong
2014-08-01
In order to improve the speech quality and auditory perceptiveness of electronic cochlear implant under strong noise background, a speech enhancement system used for electronic cochlear implant front-end was constructed. Taking digital signal processing (DSP) as the core, the system combines its multi-channel buffered serial port (McBSP) data transmission channel with extended audio interface chip TLV320AIC10, so speech signal acquisition and output with high speed are realized. Meanwhile, due to the traditional speech enhancement method which has the problems as bad adaptability, slow convergence speed and big steady-state error, versiera function and de-correlation principle were used to improve the existing adaptive filtering algorithm, which effectively enhanced the quality of voice communications. Test results verified the stability of the system and the de-noising performance of the algorithm, and it also proved that they could provide clearer speech signals for the deaf or tinnitus patients.
Chen, Yung-Yue
2018-05-08
Mobile devices are often used in our daily lives for the purposes of speech and communication. The speech quality of mobile devices is always degraded due to the environmental noises surrounding mobile device users. Regretfully, an effective background noise reduction solution cannot easily be developed for this speech enhancement problem. Due to these depicted reasons, a methodology is systematically proposed to eliminate the effects of background noises for the speech communication of mobile devices. This methodology integrates a dual microphone array with a background noise elimination algorithm. The proposed background noise elimination algorithm includes a whitening process, a speech modelling method and an H ₂ estimator. Due to the adoption of the dual microphone array, a low-cost design can be obtained for the speech enhancement of mobile devices. Practical tests have proven that this proposed method is immune to random background noises, and noiseless speech can be obtained after executing this denoise process.
A comparative intelligibility study of single-microphone noise reduction algorithms.
Hu, Yi; Loizou, Philipos C
2007-09-01
The evaluation of intelligibility of noise reduction algorithms is reported. IEEE sentences and consonants were corrupted by four types of noise including babble, car, street and train at two signal-to-noise ratio levels (0 and 5 dB), and then processed by eight speech enhancement methods encompassing four classes of algorithms: spectral subtractive, sub-space, statistical model based and Wiener-type algorithms. The enhanced speech was presented to normal-hearing listeners for identification. With the exception of a single noise condition, no algorithm produced significant improvements in speech intelligibility. Information transmission analysis of the consonant confusion matrices indicated that no algorithm improved significantly the place feature score, significantly, which is critically important for speech recognition. The algorithms which were found in previous studies to perform the best in terms of overall quality, were not the same algorithms that performed the best in terms of speech intelligibility. The subspace algorithm, for instance, was previously found to perform the worst in terms of overall quality, but performed well in the present study in terms of preserving speech intelligibility. Overall, the analysis of consonant confusion matrices suggests that in order for noise reduction algorithms to improve speech intelligibility, they need to improve the place and manner feature scores.
NASA Astrophysics Data System (ADS)
Dat, Tran Huy; Takeda, Kazuya; Itakura, Fumitada
We present a multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner. The utilization of a more general prior distribution with its online adaptive estimation is shown to be effective for speech spectral estimation in noisy environments. Furthermore, the multi-channel information in terms of cross-channel statistics are shown to be useful to better adapt the prior distribution parameters to the actual observation, resulting in better performance of speech enhancement algorithm. We tested the proposed algorithm in an in-car speech database and obtained significant improvements of the speech recognition performance, particularly under non-stationary noise conditions such as music, air-conditioner and open window.
Shao, Yu; Chang, Chip-Hong
2007-08-01
We present a new speech enhancement scheme for a single-microphone system to meet the demand for quality noise reduction algorithms capable of operating at a very low signal-to-noise ratio. A psychoacoustic model is incorporated into the generalized perceptual wavelet denoising method to reduce the residual noise and improve the intelligibility of speech. The proposed method is a generalized time-frequency subtraction algorithm, which advantageously exploits the wavelet multirate signal representation to preserve the critical transient information. Simultaneous masking and temporal masking of the human auditory system are modeled by the perceptual wavelet packet transform via the frequency and temporal localization of speech components. The wavelet coefficients are used to calculate the Bark spreading energy and temporal spreading energy, from which a time-frequency masking threshold is deduced to adaptively adjust the subtraction parameters of the proposed method. An unvoiced speech enhancement algorithm is also integrated into the system to improve the intelligibility of speech. Through rigorous objective and subjective evaluations, it is shown that the proposed speech enhancement system is capable of reducing noise with little speech degradation in adverse noise environments and the overall performance is superior to several competitive methods.
Goehring, Tobias; Bolner, Federico; Monaghan, Jessica J M; van Dijk, Bas; Zarowski, Andrzej; Bleeck, Stefan
2017-02-01
Speech understanding in noisy environments is still one of the major challenges for cochlear implant (CI) users in everyday life. We evaluated a speech enhancement algorithm based on neural networks (NNSE) for improving speech intelligibility in noise for CI users. The algorithm decomposes the noisy speech signal into time-frequency units, extracts a set of auditory-inspired features and feeds them to the neural network to produce an estimation of which frequency channels contain more perceptually important information (higher signal-to-noise ratio, SNR). This estimate is used to attenuate noise-dominated and retain speech-dominated CI channels for electrical stimulation, as in traditional n-of-m CI coding strategies. The proposed algorithm was evaluated by measuring the speech-in-noise performance of 14 CI users using three types of background noise. Two NNSE algorithms were compared: a speaker-dependent algorithm, that was trained on the target speaker used for testing, and a speaker-independent algorithm, that was trained on different speakers. Significant improvements in the intelligibility of speech in stationary and fluctuating noises were found relative to the unprocessed condition for the speaker-dependent algorithm in all noise types and for the speaker-independent algorithm in 2 out of 3 noise types. The NNSE algorithms used noise-specific neural networks that generalized to novel segments of the same noise type and worked over a range of SNRs. The proposed algorithm has the potential to improve the intelligibility of speech in noise for CI users while meeting the requirements of low computational complexity and processing delay for application in CI devices. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Reference-free automatic quality assessment of tracheoesophageal speech.
Huang, Andy; Falk, Tiago H; Chan, Wai-Yip; Parsa, Vijay; Doyle, Philip
2009-01-01
Evaluation of the quality of tracheoesophageal (TE) speech using machines instead of human experts can enhance the voice rehabilitation process for patients who have undergone total laryngectomy and voice restoration. Towards the goal of devising a reference-free TE speech quality estimation algorithm, we investigate the efficacy of speech signal features that are used in standard telephone-speech quality assessment algorithms, in conjunction with a recently introduced speech modulation spectrum measure. Tests performed on two TE speech databases demonstrate that the modulation spectral measure and a subset of features in the standard ITU-T P.563 algorithm estimate TE speech quality with better correlation (up to 0.9) than previously proposed features.
Speech Enhancement, Gain, and Noise Spectrum Adaptation Using Approximate Bayesian Estimation
Hao, Jiucang; Attias, Hagai; Nagarajan, Srikantan; Lee, Te-Won; Sejnowski, Terrence J.
2010-01-01
This paper presents a new approximate Bayesian estimator for enhancing a noisy speech signal. The speech model is assumed to be a Gaussian mixture model (GMM) in the log-spectral domain. This is in contrast to most current models in frequency domain. Exact signal estimation is a computationally intractable problem. We derive three approximations to enhance the efficiency of signal estimation. The Gaussian approximation transforms the log-spectral domain GMM into the frequency domain using minimal Kullback–Leiber (KL)-divergency criterion. The frequency domain Laplace method computes the maximum a posteriori (MAP) estimator for the spectral amplitude. Correspondingly, the log-spectral domain Laplace method computes the MAP estimator for the log-spectral amplitude. Further, the gain and noise spectrum adaptation are implemented using the expectation–maximization (EM) algorithm within the GMM under Gaussian approximation. The proposed algorithms are evaluated by applying them to enhance the speeches corrupted by the speech-shaped noise (SSN). The experimental results demonstrate that the proposed algorithms offer improved signal-to-noise ratio, lower word recognition error rate, and less spectral distortion. PMID:20428253
Power Spectral Density Error Analysis of Spectral Subtraction Type of Speech Enhancement Methods
NASA Astrophysics Data System (ADS)
Händel, Peter
2006-12-01
A theoretical framework for analysis of speech enhancement algorithms is introduced for performance assessment of spectral subtraction type of methods. The quality of the enhanced speech is related to physical quantities of the speech and noise (such as stationarity time and spectral flatness), as well as to design variables of the noise suppressor. The derived theoretical results are compared with the outcome of subjective listening tests as well as successful design strategies, performed by independent research groups.
Deep neural network and noise classification-based speech enhancement
NASA Astrophysics Data System (ADS)
Shi, Wenhua; Zhang, Xiongwei; Zou, Xia; Han, Wei
2017-07-01
In this paper, a speech enhancement method using noise classification and Deep Neural Network (DNN) was proposed. Gaussian mixture model (GMM) was employed to determine the noise type in speech-absent frames. DNN was used to model the relationship between noisy observation and clean speech. Once the noise type was determined, the corresponding DNN model was applied to enhance the noisy speech. GMM was trained with mel-frequency cepstrum coefficients (MFCC) and the parameters were estimated with an iterative expectation-maximization (EM) algorithm. Noise type was updated by spectrum entropy-based voice activity detection (VAD). Experimental results demonstrate that the proposed method could achieve better objective speech quality and smaller distortion under stationary and non-stationary conditions.
A novel speech processing algorithm based on harmonicity cues in cochlear implant
NASA Astrophysics Data System (ADS)
Wang, Jian; Chen, Yousheng; Zhang, Zongping; Chen, Yan; Zhang, Weifeng
2017-08-01
This paper proposed a novel speech processing algorithm in cochlear implant, which used harmonicity cues to enhance tonal information in Mandarin Chinese speech recognition. The input speech was filtered by a 4-channel band-pass filter bank. The frequency ranges for the four bands were: 300-621, 621-1285, 1285-2657, and 2657-5499 Hz. In each pass band, temporal envelope and periodicity cues (TEPCs) below 400 Hz were extracted by full wave rectification and low-pass filtering. The TEPCs were modulated by a sinusoidal carrier, the frequency of which was fundamental frequency (F0) and its harmonics most close to the center frequency of each band. Signals from each band were combined together to obtain an output speech. Mandarin tone, word, and sentence recognition in quiet listening conditions were tested for the extensively used continuous interleaved sampling (CIS) strategy and the novel F0-harmonic algorithm. Results found that the F0-harmonic algorithm performed consistently better than CIS strategy in Mandarin tone, word, and sentence recognition. In addition, sentence recognition rate was higher than word recognition rate, as a result of contextual information in the sentence. Moreover, tone 3 and 4 performed better than tone 1 and tone 2, due to the easily identified features of the former. In conclusion, the F0-harmonic algorithm could enhance tonal information in cochlear implant speech processing due to the use of harmonicity cues, thereby improving Mandarin tone, word, and sentence recognition. Further study will focus on the test of the F0-harmonic algorithm in noisy listening conditions.
NASA Astrophysics Data System (ADS)
Yousefian Jazi, Nima
Spatial filtering and directional discrimination has been shown to be an effective pre-processing approach for noise reduction in microphone array systems. In dual-microphone hearing aids, fixed and adaptive beamforming techniques are the most common solutions for enhancing the desired speech and rejecting unwanted signals captured by the microphones. In fact, beamformers are widely utilized in systems where spatial properties of target source (usually in front of the listener) is assumed to be known. In this dissertation, some dual-microphone coherence-based speech enhancement techniques applicable to hearing aids are proposed. All proposed algorithms operate in the frequency domain and (like traditional beamforming techniques) are purely based on the spatial properties of the desired speech source and does not require any knowledge of noise statistics for calculating the noise reduction filter. This benefit gives our algorithms the ability to address adverse noise conditions, such as situations where interfering talker(s) speaks simultaneously with the target speaker. In such cases, the (adaptive) beamformers lose their effectiveness in suppressing interference, since the noise channel (reference) cannot be built and updated accordingly. This difference is the main advantage of the proposed techniques in the dissertation over traditional adaptive beamformers. Furthermore, since the suggested algorithms are independent of noise estimation, they offer significant improvement in scenarios that the power level of interfering sources are much more than that of target speech. The dissertation also shows the premise behind the proposed algorithms can be extended and employed to binaural hearing aids. The main purpose of the investigated techniques is to enhance the intelligibility level of speech, measured through subjective listening tests with normal hearing and cochlear implant listeners. However, the improvement in quality of the output speech achieved by the algorithms are also presented to show that the proposed methods can be potential candidates for future use in commercial hearing aids and cochlear implant devices.
Robust Speech Enhancement Using Two-Stage Filtered Minima Controlled Recursive Averaging
NASA Astrophysics Data System (ADS)
Ghourchian, Negar; Selouani, Sid-Ahmed; O'Shaughnessy, Douglas
In this paper we propose an algorithm for estimating noise in highly non-stationary noisy environments, which is a challenging problem in speech enhancement. This method is based on minima-controlled recursive averaging (MCRA) whereby an accurate, robust and efficient noise power spectrum estimation is demonstrated. We propose a two-stage technique to prevent the appearance of musical noise after enhancement. This algorithm filters the noisy speech to achieve a robust signal with minimum distortion in the first stage. Subsequently, it estimates the residual noise using MCRA and removes it with spectral subtraction. The proposed Filtered MCRA (FMCRA) performance is evaluated using objective tests on the Aurora database under various noisy environments. These measures indicate the higher output SNR and lower output residual noise and distortion.
Simulation for noise cancellation using LMS adaptive filter
NASA Astrophysics Data System (ADS)
Lee, Jia-Haw; Ooi, Lu-Ean; Ko, Ying-Hao; Teoh, Choe-Yung
2017-06-01
In this paper, the fundamental algorithm of noise cancellation, Least Mean Square (LMS) algorithm is studied and enhanced with adaptive filter. The simulation of the noise cancellation using LMS adaptive filter algorithm is developed. The noise corrupted speech signal and the engine noise signal are used as inputs for LMS adaptive filter algorithm. The filtered signal is compared to the original noise-free speech signal in order to highlight the level of attenuation of the noise signal. The result shows that the noise signal is successfully canceled by the developed adaptive filter. The difference of the noise-free speech signal and filtered signal are calculated and the outcome implies that the filtered signal is approaching the noise-free speech signal upon the adaptive filtering. The frequency range of the successfully canceled noise by the LMS adaptive filter algorithm is determined by performing Fast Fourier Transform (FFT) on the signals. The LMS adaptive filter algorithm shows significant noise cancellation at lower frequency range.
Enhancing speech recognition using improved particle swarm optimization based hidden Markov model.
Selvaraj, Lokesh; Ganesan, Balakrishnan
2014-01-01
Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO) is suggested. The suggested methodology contains four stages, namely, (i) denoising, (ii) feature mining (iii), vector quantization, and (iv) IPSO based hidden Markov model (HMM) technique (IP-HMM). At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC), mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; O'Shaughnessy, Douglas
2003-12-01
Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.
Speech Enhancement Using Gaussian Scale Mixture Models
Hao, Jiucang; Lee, Te-Won; Sejnowski, Terrence J.
2011-01-01
This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress. PMID:21359139
Goldsworthy, Raymond L.; Delhorne, Lorraine A.; Desloge, Joseph G.; Braida, Louis D.
2014-01-01
This article introduces and provides an assessment of a spatial-filtering algorithm based on two closely-spaced (∼1 cm) microphones in a behind-the-ear shell. The evaluated spatial-filtering algorithm used fast (∼10 ms) temporal-spectral analysis to determine the location of incoming sounds and to enhance sounds arriving from straight ahead of the listener. Speech reception thresholds (SRTs) were measured for eight cochlear implant (CI) users using consonant and vowel materials under three processing conditions: An omni-directional response, a dipole-directional response, and the spatial-filtering algorithm. The background noise condition used three simultaneous time-reversed speech signals as interferers located at 90°, 180°, and 270°. Results indicated that the spatial-filtering algorithm can provide speech reception benefits of 5.8 to 10.7 dB SRT compared to an omni-directional response in a reverberant room with multiple noise sources. Given the observed SRT benefits, coupled with an efficient design, the proposed algorithm is promising as a CI noise-reduction solution. PMID:25096120
Frequency-domain beamformers using conjugate gradient techniques for speech enhancement.
Zhao, Shengkui; Jones, Douglas L; Khoo, Suiyang; Man, Zhihong
2014-09-01
A multiple-iteration constrained conjugate gradient (MICCG) algorithm and a single-iteration constrained conjugate gradient (SICCG) algorithm are proposed to realize the widely used frequency-domain minimum-variance-distortionless-response (MVDR) beamformers and the resulting algorithms are applied to speech enhancement. The algorithms are derived based on the Lagrange method and the conjugate gradient techniques. The implementations of the algorithms avoid any form of explicit or implicit autocorrelation matrix inversion. Theoretical analysis establishes formal convergence of the algorithms. Specifically, the MICCG algorithm is developed based on a block adaptation approach and it generates a finite sequence of estimates that converge to the MVDR solution. For limited data records, the estimates of the MICCG algorithm are better than the conventional estimators and equivalent to the auxiliary vector algorithms. The SICCG algorithm is developed based on a continuous adaptation approach with a sample-by-sample updating procedure and the estimates asymptotically converge to the MVDR solution. An illustrative example using synthetic data from a uniform linear array is studied and an evaluation on real data recorded by an acoustic vector sensor array is demonstrated. Performance of the MICCG algorithm and the SICCG algorithm are compared with the state-of-the-art approaches.
Na, Sung Dae; Wei, Qun; Seong, Ki Woong; Cho, Jin Ho; Kim, Myoung Nam
2018-01-01
The conventional methods of speech enhancement, noise reduction, and voice activity detection are based on the suppression of noise or non-speech components of the target air-conduction signals. However, air-conduced speech is hard to differentiate from babble or white noise signals. To overcome this problem, the proposed algorithm uses the bone-conduction speech signals and soft thresholding based on the Shannon entropy principle and cross-correlation of air- and bone-conduction signals. A new algorithm for speech detection and noise reduction is proposed, which makes use of the Shannon entropy principle and cross-correlation with the bone-conduction speech signals to threshold the wavelet packet coefficients of the noisy speech. The proposed method can be get efficient result by objective quality measure that are PESQ, RMSE, Correlation, SNR. Each threshold is generated by the entropy and cross-correlation approaches in the decomposed bands using the wavelet packet decomposition. As a result, the noise is reduced by the proposed method using the MATLAB simulation. To verify the method feasibility, we compared the air- and bone-conduction speech signals and their spectra by the proposed method. As a result, high performance of the proposed method is confirmed, which makes it quite instrumental to future applications in communication devices, noisy environment, construction, and military operations.
Techniques for the Enhancement of Linear Predictive Speech Coding in Adverse Conditions
NASA Astrophysics Data System (ADS)
Wrench, Alan A.
Available from UMI in association with The British Library. Requires signed TDF. The Linear Prediction model was first applied to speech two and a half decades ago. Since then it has been the subject of intense research and continues to be one of the principal tools in the analysis of speech. Its mathematical tractability makes it a suitable subject for study and its proven success in practical applications makes the study worthwhile. The model is known to be unsuited to speech corrupted by background noise. This has led many researchers to investigate ways of enhancing the speech signal prior to Linear Predictive analysis. In this thesis this body of work is extended. The chosen application is low bit-rate (2.4 kbits/sec) speech coding. For this task the performance of the Linear Prediction algorithm is crucial because there is insufficient bandwidth to encode the error between the modelled speech and the original input. A review of the fundamentals of Linear Prediction and an independent assessment of the relative performance of methods of Linear Prediction modelling are presented. A new method is proposed which is fast and facilitates stability checking, however, its stability is shown to be unacceptably poorer than existing methods. A novel supposition governing the positioning of the analysis frame relative to a voiced speech signal is proposed and supported by observation. The problem of coding noisy speech is examined. Four frequency domain speech processing techniques are developed and tested. These are: (i) Combined Order Linear Prediction Spectral Estimation; (ii) Frequency Scaling According to an Aural Model; (iii) Amplitude Weighting Based on Perceived Loudness; (iv) Power Spectrum Squaring. These methods are compared with the Recursive Linearised Maximum a Posteriori method. Following on from work done in the frequency domain, a time domain implementation of spectrum squaring is developed. In addition, a new method of power spectrum estimation is developed based on the Minimum Variance approach. This new algorithm is shown to be closely related to Linear Prediction but produces slightly broader spectral peaks. Spectrum squaring is applied to both the new algorithm and standard Linear Prediction and their relative performance is assessed. (Abstract shortened by UMI.).
The Speech multi features fusion perceptual hash algorithm based on tensor decomposition
NASA Astrophysics Data System (ADS)
Huang, Y. B.; Fan, M. H.; Zhang, Q. Y.
2018-03-01
With constant progress in modern speech communication technologies, the speech data is prone to be attacked by the noise or maliciously tampered. In order to make the speech perception hash algorithm has strong robustness and high efficiency, this paper put forward a speech perception hash algorithm based on the tensor decomposition and multi features is proposed. This algorithm analyses the speech perception feature acquires each speech component wavelet packet decomposition. LPCC, LSP and ISP feature of each speech component are extracted to constitute the speech feature tensor. Speech authentication is done by generating the hash values through feature matrix quantification which use mid-value. Experimental results showing that the proposed algorithm is robust for content to maintain operations compared with similar algorithms. It is able to resist the attack of the common background noise. Also, the algorithm is highly efficiency in terms of arithmetic, and is able to meet the real-time requirements of speech communication and complete the speech authentication quickly.
Baumgärtel, Regina M; Hu, Hongmei; Krawczyk-Becker, Martin; Marquardt, Daniel; Herzke, Tobias; Coleman, Graham; Adiloğlu, Kamil; Bomke, Katrin; Plotz, Karsten; Gerkmann, Timo; Doclo, Simon; Kollmeier, Birger; Hohmann, Volker; Dietz, Mathias
2015-12-30
Several binaural audio signal enhancement algorithms were evaluated with respect to their potential to improve speech intelligibility in noise for users of bilateral cochlear implants (CIs). 50% speech reception thresholds (SRT50) were assessed using an adaptive procedure in three distinct, realistic noise scenarios. All scenarios were highly nonstationary, complex, and included a significant amount of reverberation. Other aspects, such as the perfectly frontal target position, were idealized laboratory settings, allowing the algorithms to perform better than in corresponding real-world conditions. Eight bilaterally implanted CI users, wearing devices from three manufacturers, participated in the study. In all noise conditions, a substantial improvement in SRT50 compared to the unprocessed signal was observed for most of the algorithms tested, with the largest improvements generally provided by binaural minimum variance distortionless response (MVDR) beamforming algorithms. The largest overall improvement in speech intelligibility was achieved by an adaptive binaural MVDR in a spatially separated, single competing talker noise scenario. A no-pre-processing condition and adaptive differential microphones without a binaural link served as the two baseline conditions. SRT50 improvements provided by the binaural MVDR beamformers surpassed the performance of the adaptive differential microphones in most cases. Speech intelligibility improvements predicted by instrumental measures were shown to account for some but not all aspects of the perceptually obtained SRT50 improvements measured in bilaterally implanted CI users. © The Author(s) 2015.
Automated Speech Rate Measurement in Dysarthria.
Martens, Heidi; Dekens, Tomas; Van Nuffelen, Gwen; Latacz, Lukas; Verhelst, Werner; De Bodt, Marc
2015-06-01
In this study, a new algorithm for automated determination of speech rate (SR) in dysarthric speech is evaluated. We investigated how reliably the algorithm calculates the SR of dysarthric speech samples when compared with calculation performed by speech-language pathologists. The new algorithm was trained and tested using Dutch speech samples of 36 speakers with no history of speech impairment and 40 speakers with mild to moderate dysarthria. We tested the algorithm under various conditions: according to speech task type (sentence reading, passage reading, and storytelling) and algorithm optimization method (speaker group optimization and individual speaker optimization). Correlations between automated and human SR determination were calculated for each condition. High correlations between automated and human SR determination were found in the various testing conditions. The new algorithm measures SR in a sufficiently reliable manner. It is currently being integrated in a clinical software tool for assessing and managing prosody in dysarthric speech. Further research is needed to fine-tune the algorithm to severely dysarthric speech, to make the algorithm less sensitive to background noise, and to evaluate how the algorithm deals with syllabic consonants.
de Taillez, Tobias; Grimm, Giso; Kollmeier, Birger; Neher, Tobias
2018-06-01
To investigate the influence of an algorithm designed to enhance or magnify interaural difference cues on speech signals in noisy, spatially complex conditions using both technical and perceptual measurements. To also investigate the combination of interaural magnification (IM), monaural microphone directionality (DIR), and binaural coherence-based noise reduction (BC). Speech-in-noise stimuli were generated using virtual acoustics. A computational model of binaural hearing was used to analyse the spatial effects of IM. Predicted speech quality changes and signal-to-noise-ratio (SNR) improvements were also considered. Additionally, a listening test was carried out to assess speech intelligibility and quality. Listeners aged 65-79 years with and without sensorineural hearing loss (N = 10 each). IM increased the horizontal separation of concurrent directional sound sources without introducing any major artefacts. In situations with diffuse noise, however, the interaural difference cues were distorted. Preprocessing the binaural input signals with DIR reduced distortion. IM influenced neither speech intelligibility nor speech quality. The IM algorithm tested here failed to improve speech perception in noise, probably because of the dispersion and inconsistent magnification of interaural difference cues in complex environments.
Automated Speech Rate Measurement in Dysarthria
ERIC Educational Resources Information Center
Martens, Heidi; Dekens, Tomas; Van Nuffelen, Gwen; Latacz, Lukas; Verhelst, Werner; De Bodt, Marc
2015-01-01
Purpose: In this study, a new algorithm for automated determination of speech rate (SR) in dysarthric speech is evaluated. We investigated how reliably the algorithm calculates the SR of dysarthric speech samples when compared with calculation performed by speech-language pathologists. Method: The new algorithm was trained and tested using Dutch…
NASA Astrophysics Data System (ADS)
Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui
2012-04-01
Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.
Alternative Speech Communication System for Persons with Severe Speech Disorders
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; Sidi Yakoub, Mohammed; O'Shaughnessy, Douglas
2009-12-01
Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders. The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality of communication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and close to the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a grafting technique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system in order to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and two Acadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposed methods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achieved by the speech synthesis systems that deal with SSD and dysarthria, respectively.
Comparing Binaural Pre-processing Strategies II
Hu, Hongmei; Krawczyk-Becker, Martin; Marquardt, Daniel; Herzke, Tobias; Coleman, Graham; Adiloğlu, Kamil; Bomke, Katrin; Plotz, Karsten; Gerkmann, Timo; Doclo, Simon; Kollmeier, Birger; Hohmann, Volker; Dietz, Mathias
2015-01-01
Several binaural audio signal enhancement algorithms were evaluated with respect to their potential to improve speech intelligibility in noise for users of bilateral cochlear implants (CIs). 50% speech reception thresholds (SRT50) were assessed using an adaptive procedure in three distinct, realistic noise scenarios. All scenarios were highly nonstationary, complex, and included a significant amount of reverberation. Other aspects, such as the perfectly frontal target position, were idealized laboratory settings, allowing the algorithms to perform better than in corresponding real-world conditions. Eight bilaterally implanted CI users, wearing devices from three manufacturers, participated in the study. In all noise conditions, a substantial improvement in SRT50 compared to the unprocessed signal was observed for most of the algorithms tested, with the largest improvements generally provided by binaural minimum variance distortionless response (MVDR) beamforming algorithms. The largest overall improvement in speech intelligibility was achieved by an adaptive binaural MVDR in a spatially separated, single competing talker noise scenario. A no-pre-processing condition and adaptive differential microphones without a binaural link served as the two baseline conditions. SRT50 improvements provided by the binaural MVDR beamformers surpassed the performance of the adaptive differential microphones in most cases. Speech intelligibility improvements predicted by instrumental measures were shown to account for some but not all aspects of the perceptually obtained SRT50 improvements measured in bilaterally implanted CI users. PMID:26721921
Razza, Sergio; Zaccone, Monica; Meli, Aannalisa; Cristofari, Eliana
2017-12-01
Children affected by hearing loss can experience difficulties in challenging and noisy environments even when deafness is corrected by Cochlear implant (CI) devices. These patients have a selective attention deficit in multiple listening conditions. At present, the most effective ways to improve the performance of speech recognition in noise consists of providing CI processors with noise reduction algorithms and of providing patients with bilateral CIs. The aim of this study was to compare speech performances in noise, across increasing noise levels, in CI recipients using two kinds of wireless remote-microphone radio systems that use digital radio frequency transmission: the Roger Inspiro accessory and the Cochlear Wireless Mini Microphone accessory. Eleven Nucleus Cochlear CP910 CI young user subjects were studied. The signal/noise ratio, at a speech reception threshold (SRT) value of 50%, was measured in different conditions for each patient: with CI only, with the Roger or with the MiniMic accessory. The effect of the application of the SNR-noise reduction algorithm in each of these conditions was also assessed. The tests were performed with the subject positioned in front of the main speaker, at a distance of 2.5 m. Another two speakers were positioned at 3.50 m. The main speaker at 65 dB issued disyllabic words. Babble noise signal was delivered through the other speakers, with variable intensity. The use of both wireless remote microphones improved the SRT results. Both systems improved gain of speech performances. The gain was higher with the Mini Mic system (SRT = -4.76) than the Roger system (SRT = -3.01). The addition of the NR algorithm did not statistically further improve the results. There is significant improvement in speech recognition results with both wireless digital remote microphone accessories, in particular with the Mini Mic system when used with the CP910 processor. The use of a remote microphone accessory surpasses the benefit of application of NR algorithm. Copyright © 2017. Published by Elsevier B.V.
Shin, Young Hoon; Seo, Jiwon
2016-01-01
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867
Shin, Young Hoon; Seo, Jiwon
2016-10-29
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.
Pitch-Learning Algorithm For Speech Encoders
NASA Technical Reports Server (NTRS)
Bhaskar, B. R. Udaya
1988-01-01
Adaptive algorithm detects and corrects errors in sequence of estimates of pitch period of speech. Algorithm operates in conjunction with techniques used to estimate pitch period. Used in such parametric and hybrid speech coders as linear predictive coders and adaptive predictive coders.
Zhang, Xiaoheng; Wang, Lirui; Cao, Yao; Wang, Pin; Zhang, Cheng; Yang, Liuyang; Li, Yongming; Zhang, Yanling; Cheng, Oumei
2018-02-01
Diagnosis of Parkinson's disease (PD) based on speech data has been proved to be an effective way in recent years. However, current researches just care about the feature extraction and classifier design, and do not consider the instance selection. Former research by authors showed that the instance selection can lead to improvement on classification accuracy. However, no attention is paid on the relationship between speech sample and feature until now. Therefore, a new diagnosis algorithm of PD is proposed in this paper by simultaneously selecting speech sample and feature based on relevant feature weighting algorithm and multiple kernel method, so as to find their synergy effects, thereby improving classification accuracy. Experimental results showed that this proposed algorithm obtained apparent improvement on classification accuracy. It can obtain mean classification accuracy of 82.5%, which was 30.5% higher than the relevant algorithm. Besides, the proposed algorithm detected the synergy effects of speech sample and feature, which is valuable for speech marker extraction.
Chung, King; Nelson, Lance; Teske, Melissa
2012-09-01
The purpose of this study was to investigate whether a multichannel adaptive directional microphone and a modulation-based noise reduction algorithm could enhance cochlear implant performance in reverberant noise fields. A hearing aid was modified to output electrical signals (ePreprocessor) and a cochlear implant speech processor was modified to receive electrical signals (eProcessor). The ePreprocessor was programmed to flat frequency response and linear amplification. Cochlear implant listeners wore the ePreprocessor-eProcessor system in three reverberant noise fields: 1) one noise source with variable locations; 2) three noise sources with variable locations; and 3) eight evenly spaced noise sources from 0° to 360°. Listeners' speech recognition scores were tested when the ePreprocessor was programmed to omnidirectional microphone (OMNI), omnidirectional microphone plus noise reduction algorithm (OMNI + NR), and adaptive directional microphone plus noise reduction algorithm (ADM + NR). They were also tested with their own cochlear implant speech processor (CI_OMNI) in the three noise fields. Additionally, listeners rated overall sound quality preferences on recordings made in the noise fields. Results indicated that ADM+NR produced the highest speech recognition scores and the most preferable rating in all noise fields. Factors requiring attention in the hearing aid-cochlear implant integration process are discussed. Copyright © 2012 Elsevier B.V. All rights reserved.
An algorithm that improves speech intelligibility in noise for normal-hearing listeners.
Kim, Gibak; Lu, Yang; Hu, Yi; Loizou, Philipos C
2009-09-01
Traditional noise-suppression algorithms have been shown to improve speech quality, but not speech intelligibility. Motivated by prior intelligibility studies of speech synthesized using the ideal binary mask, an algorithm is proposed that decomposes the input signal into time-frequency (T-F) units and makes binary decisions, based on a Bayesian classifier, as to whether each T-F unit is dominated by the target or the masker. Speech corrupted at low signal-to-noise ratio (SNR) levels (-5 and 0 dB) using different types of maskers is synthesized by this algorithm and presented to normal-hearing listeners for identification. Results indicated substantial improvements in intelligibility (over 60% points in -5 dB babble) over that attained by human listeners with unprocessed stimuli. The findings from this study suggest that algorithms that can estimate reliably the SNR in each T-F unit can improve speech intelligibility.
Impact of Noise Reduction Algorithm in Cochlear Implant Processing on Music Enjoyment.
Kohlberg, Gavriel D; Mancuso, Dean M; Griffin, Brianna M; Spitzer, Jaclyn B; Lalwani, Anil K
2016-06-01
Noise reduction algorithm (NRA) in speech processing strategy has positive impact on speech perception among cochlear implant (CI) listeners. We sought to evaluate the effect of NRA on music enjoyment. Prospective analysis of music enjoyment. Academic medical center. Normal-hearing (NH) adults (N = 16) and CI listeners (N = 9). Subjective rating of music excerpts. NH and CI listeners evaluated country music piece on three enjoyment modalities: pleasantness, musicality, and naturalness. Participants listened to the original version and 20 modified, less complex versions created by including subsets of musical instruments from the original song. NH participants listened to the segments through CI simulation and CI listeners listened to the segments with their usual speech processing strategy, with and without NRA. Decreasing the number of instruments was significantly associated with increase in the pleasantness and naturalness in both NH and CI subjects (p < 0.05). However, there was no difference in music enjoyment with or without NRA for either NH listeners with CI simulation or CI listeners across all three modalities of pleasantness, musicality, and naturalness (p > 0.05): this was true for the original and the modified music segments with one to three instruments (p > 0.05). NRA does not affect music enjoyment in CI listener or NH individual with CI simulation. This suggests that strategies to enhance speech processing will not necessarily have a positive impact on music enjoyment. However, reducing the complexity of music shows promise in enhancing music enjoyment and should be further explored.
Incorporating Auditory Models in Speech/Audio Applications
NASA Astrophysics Data System (ADS)
Krishnamoorthi, Harish
2011-12-01
Following the success in incorporating perceptual models in audio coding algorithms, their application in other speech/audio processing systems is expanding. In general, all perceptual speech/audio processing algorithms involve minimization of an objective function that directly/indirectly incorporates properties of human perception. This dissertation primarily investigates the problems associated with directly embedding an auditory model in the objective function formulation and proposes possible solutions to overcome high complexity issues for use in real-time speech/audio algorithms. Specific problems addressed in this dissertation include: 1) the development of approximate but computationally efficient auditory model implementations that are consistent with the principles of psychoacoustics, 2) the development of a mapping scheme that allows synthesizing a time/frequency domain representation from its equivalent auditory model output. The first problem is aimed at addressing the high computational complexity involved in solving perceptual objective functions that require repeated application of auditory model for evaluation of different candidate solutions. In this dissertation, a frequency pruning and a detector pruning algorithm is developed that efficiently implements the various auditory model stages. The performance of the pruned model is compared to that of the original auditory model for different types of test signals in the SQAM database. Experimental results indicate only a 4-7% relative error in loudness while attaining up to 80-90 % reduction in computational complexity. Similarly, a hybrid algorithm is developed specifically for use with sinusoidal signals and employs the proposed auditory pattern combining technique together with a look-up table to store representative auditory patterns. The second problem obtains an estimate of the auditory representation that minimizes a perceptual objective function and transforms the auditory pattern back to its equivalent time/frequency representation. This avoids the repeated application of auditory model stages to test different candidate time/frequency vectors in minimizing perceptual objective functions. In this dissertation, a constrained mapping scheme is developed by linearizing certain auditory model stages that ensures obtaining a time/frequency mapping corresponding to the estimated auditory representation. This paradigm was successfully incorporated in a perceptual speech enhancement algorithm and a sinusoidal component selection task.
Adaptive spatial filtering improves speech reception in noise while preserving binaural cues.
Bissmeyer, Susan R S; Goldsworthy, Raymond L
2017-09-01
Hearing loss greatly reduces an individual's ability to comprehend speech in the presence of background noise. Over the past decades, numerous signal-processing algorithms have been developed to improve speech reception in these situations for cochlear implant and hearing aid users. One challenge is to reduce background noise while not introducing interaural distortion that would degrade binaural hearing. The present study evaluates a noise reduction algorithm, referred to as binaural Fennec, that was designed to improve speech reception in background noise while preserving binaural cues. Speech reception thresholds were measured for normal-hearing listeners in a simulated environment with target speech generated in front of the listener and background noise originating 90° to the right of the listener. Lateralization thresholds were also measured in the presence of background noise. These measures were conducted in anechoic and reverberant environments. Results indicate that the algorithm improved speech reception thresholds, even in highly reverberant environments. Results indicate that the algorithm also improved lateralization thresholds for the anechoic environment while not affecting lateralization thresholds for the reverberant environments. These results provide clear evidence that this algorithm can improve speech reception in background noise while preserving binaural cues used to lateralize sound.
Wolfe, Jace; Morais, Mila; Schafer, Erin; Agrawal, Smita; Koch, Dawn
2015-05-01
Cochlear implant recipients often experience difficulty with understanding speech in the presence of noise. Cochlear implant manufacturers have developed sound processing algorithms designed to improve speech recognition in noise, and research has shown these technologies to be effective. Remote microphone technology utilizing adaptive, digital wireless radio transmission has also been shown to provide significant improvement in speech recognition in noise. There are no studies examining the potential improvement in speech recognition in noise when these two technologies are used simultaneously. The goal of this study was to evaluate the potential benefits and limitations associated with the simultaneous use of a sound processing algorithm designed to improve performance in noise (Advanced Bionics ClearVoice) and a remote microphone system that incorporates adaptive, digital wireless radio transmission (Phonak Roger). A two-by-two way repeated measures design was used to examine performance differences obtained without these technologies compared to the use of each technology separately as well as the simultaneous use of both technologies. Eleven Advanced Bionics (AB) cochlear implant recipients, ages 11 to 68 yr. AzBio sentence recognition was measured in quiet and in the presence of classroom noise ranging in level from 50 to 80 dBA in 5-dB steps. Performance was evaluated in four conditions: (1) No ClearVoice and no Roger, (2) ClearVoice enabled without the use of Roger, (3) ClearVoice disabled with Roger enabled, and (4) simultaneous use of ClearVoice and Roger. Speech recognition in quiet was better than speech recognition in noise for all conditions. Use of ClearVoice and Roger each provided significant improvement in speech recognition in noise. The best performance in noise was obtained with the simultaneous use of ClearVoice and Roger. ClearVoice and Roger technology each improves speech recognition in noise, particularly when used at the same time. Because ClearVoice does not degrade performance in quiet settings, clinicians should consider recommending ClearVoice for routine, full-time use for AB implant recipients. Roger should be used in all instances in which remote microphone technology may assist the user in understanding speech in the presence of noise. American Academy of Audiology.
An algorithm to improve speech recognition in noise for hearing-impaired listeners
Healy, Eric W.; Yoho, Sarah E.; Wang, Yuxuan; Wang, DeLiang
2013-01-01
Despite considerable effort, monaural (single-microphone) algorithms capable of increasing the intelligibility of speech in noise have remained elusive. Successful development of such an algorithm is especially important for hearing-impaired (HI) listeners, given their particular difficulty in noisy backgrounds. In the current study, an algorithm based on binary masking was developed to separate speech from noise. Unlike the ideal binary mask, which requires prior knowledge of the premixed signals, the masks used to segregate speech from noise in the current study were estimated by training the algorithm on speech not used during testing. Sentences were mixed with speech-shaped noise and with babble at various signal-to-noise ratios (SNRs). Testing using normal-hearing and HI listeners indicated that intelligibility increased following processing in all conditions. These increases were larger for HI listeners, for the modulated background, and for the least-favorable SNRs. They were also often substantial, allowing several HI listeners to improve intelligibility from scores near zero to values above 70%. PMID:24116438
NASA Astrophysics Data System (ADS)
Thoonsaengngam, Rattapol; Tangsangiumvisai, Nisachon
This paper proposes an enhanced method for estimating the a priori Signal-to-Disturbance Ratio (SDR) to be employed in the Acoustic Echo and Noise Suppression (AENS) system for full-duplex hands-free communications. The proposed a priori SDR estimation technique is modified based upon the Two-Step Noise Reduction (TSNR) algorithm to suppress the background noise while preserving speech spectral components. In addition, a practical approach to determine accurately the Echo Spectrum Variance (ESV) is presented based upon the linear relationship assumption between the power spectrum of far-end speech and acoustic echo signals. The ESV estimation technique is then employed to alleviate the acoustic echo problem. The performance of the AENS system that employs these two proposed estimation techniques is evaluated through the Echo Attenuation (EA), Noise Attenuation (NA), and two speech distortion measures. Simulation results based upon real speech signals guarantee that our improved AENS system is able to mitigate efficiently the problem of acoustic echo and background noise, while preserving the speech quality and speech intelligibility.
Li, Junfeng; Yang, Lin; Zhang, Jianping; Yan, Yonghong; Hu, Yi; Akagi, Masato; Loizou, Philipos C
2011-05-01
A large number of single-channel noise-reduction algorithms have been proposed based largely on mathematical principles. Most of these algorithms, however, have been evaluated with English speech. Given the different perceptual cues used by native listeners of different languages including tonal languages, it is of interest to examine whether there are any language effects when the same noise-reduction algorithm is used to process noisy speech in different languages. A comparative evaluation and investigation is taken in this study of various single-channel noise-reduction algorithms applied to noisy speech taken from three languages: Chinese, Japanese, and English. Clean speech signals (Chinese words and Japanese words) were first corrupted by three types of noise at two signal-to-noise ratios and then processed by five single-channel noise-reduction algorithms. The processed signals were finally presented to normal-hearing listeners for recognition. Intelligibility evaluation showed that the majority of noise-reduction algorithms did not improve speech intelligibility. Consistent with a previous study with the English language, the Wiener filtering algorithm produced small, but statistically significant, improvements in intelligibility for car and white noise conditions. Significant differences between the performances of noise-reduction algorithms across the three languages were observed.
Noise suppression methods for robust speech processing
NASA Astrophysics Data System (ADS)
Boll, S. F.; Ravindra, H.; Randall, G.; Armantrout, R.; Power, R.
1980-05-01
Robust speech processing in practical operating environments requires effective environmental and processor noise suppression. This report describes the technical findings and accomplishments during this reporting period for the research program funded to develop real time, compressed speech analysis synthesis algorithms whose performance in invariant under signal contamination. Fulfillment of this requirement is necessary to insure reliable secure compressed speech transmission within realistic military command and control environments. Overall contributions resulting from this research program include the understanding of how environmental noise degrades narrow band, coded speech, development of appropriate real time noise suppression algorithms, and development of speech parameter identification methods that consider signal contamination as a fundamental element in the estimation process. This report describes the current research and results in the areas of noise suppression using the dual input adaptive noise cancellation using the short time Fourier transform algorithms, articulation rate change techniques, and a description of an experiment which demonstrated that the spectral subtraction noise suppression algorithm can improve the intelligibility of 2400 bps, LPC 10 coded, helicopter speech by 10.6 point.
Zhang, He-Hua; Yang, Liuyang; Liu, Yuchuan; Wang, Pin; Yin, Jun; Li, Yongming; Qiu, Mingguo; Zhu, Xueru; Yan, Fang
2016-11-16
The use of speech based data in the classification of Parkinson disease (PD) has been shown to provide an effect, non-invasive mode of classification in recent years. Thus, there has been an increased interest in speech pattern analysis methods applicable to Parkinsonism for building predictive tele-diagnosis and tele-monitoring models. One of the obstacles in optimizing classifications is to reduce noise within the collected speech samples, thus ensuring better classification accuracy and stability. While the currently used methods are effect, the ability to invoke instance selection has been seldomly examined. In this study, a PD classification algorithm was proposed and examined that combines a multi-edit-nearest-neighbor (MENN) algorithm and an ensemble learning algorithm. First, the MENN algorithm is applied for selecting optimal training speech samples iteratively, thereby obtaining samples with high separability. Next, an ensemble learning algorithm, random forest (RF) or decorrelated neural network ensembles (DNNE), is used to generate trained samples from the collected training samples. Lastly, the trained ensemble learning algorithms are applied to the test samples for PD classification. This proposed method was examined using a more recently deposited public datasets and compared against other currently used algorithms for validation. Experimental results showed that the proposed algorithm obtained the highest degree of improved classification accuracy (29.44%) compared with the other algorithm that was examined. Furthermore, the MENN algorithm alone was found to improve classification accuracy by as much as 45.72%. Moreover, the proposed algorithm was found to exhibit a higher stability, particularly when combining the MENN and RF algorithms. This study showed that the proposed method could improve PD classification when using speech data and can be applied to future studies seeking to improve PD classification methods.
Isaacson, M D; Srinivasan, S; Lloyd, L L
2010-01-01
MathSpeak is a set of rules for non speaking of mathematical expressions. These rules have been incorporated into a computerised module that translates printed mathematics into the non-ambiguous MathSpeak form for synthetic speech rendering. Differences between individual utterances produced with the translator module are difficult to discern because of insufficient pausing between utterances; hence, the purpose of this study was to develop an algorithm for improving the synthetic speech rendering of MathSpeak. To improve synthetic speech renderings, an algorithm for inserting pauses was developed based upon recordings of middle and high school math teachers speaking mathematic expressions. Efficacy testing of this algorithm was conducted with college students without disabilities and high school/college students with visual impairments. Parameters measured included reception accuracy, short-term memory retention, MathSpeak processing capacity and various rankings concerning the quality of synthetic speech renderings. All parameters measured showed statistically significant improvements when the algorithm was used. The algorithm improves the quality and information processing capacity of synthetic speech renderings of MathSpeak. This increases the capacity of individuals with print disabilities to perform mathematical activities and to successfully fulfill science, technology, engineering and mathematics academic and career objectives.
NASA Astrophysics Data System (ADS)
Pishravian, Arash; Aghabozorgi Sahaf, Masoud Reza
2012-12-01
In this paper speech-music separation using Blind Source Separation is discussed. The separating algorithm is based on the mutual information minimization where the natural gradient algorithm is used for minimization. In order to do that, score function estimation from observation signals (combination of speech and music) samples is needed. The accuracy and the speed of the mentioned estimation will affect on the quality of the separated signals and the processing time of the algorithm. The score function estimation in the presented algorithm is based on Gaussian mixture based kernel density estimation method. The experimental results of the presented algorithm on the speech-music separation and comparing to the separating algorithm which is based on the Minimum Mean Square Error estimator, indicate that it can cause better performance and less processing time
Harlander, Niklas; Rosenkranz, Tobias; Hohmann, Volker
2012-08-01
Single channel noise reduction has been well investigated and seems to have reached its limits in terms of speech intelligibility improvement, however, the quality of such schemes can still be advanced. This study tests to what extent novel model-based processing schemes might improve performance in particular for non-stationary noise conditions. Two prototype model-based algorithms, a speech-model-based, and a auditory-model-based algorithm were compared to a state-of-the-art non-parametric minimum statistics algorithm. A speech intelligibility test, preference rating, and listening effort scaling were performed. Additionally, three objective quality measures for the signal, background, and overall distortions were applied. For a better comparison of all algorithms, particular attention was given to the usage of the similar Wiener-based gain rule. The perceptual investigation was performed with fourteen hearing-impaired subjects. The results revealed that the non-parametric algorithm and the auditory model-based algorithm did not affect speech intelligibility, whereas the speech-model-based algorithm slightly decreased intelligibility. In terms of subjective quality, both model-based algorithms perform better than the unprocessed condition and the reference in particular for highly non-stationary noise environments. Data support the hypothesis that model-based algorithms are promising for improving performance in non-stationary noise conditions.
Comparing Binaural Pre-processing Strategies I: Instrumental Evaluation.
Baumgärtel, Regina M; Krawczyk-Becker, Martin; Marquardt, Daniel; Völker, Christoph; Hu, Hongmei; Herzke, Tobias; Coleman, Graham; Adiloğlu, Kamil; Ernst, Stephan M A; Gerkmann, Timo; Doclo, Simon; Kollmeier, Birger; Hohmann, Volker; Dietz, Mathias
2015-12-30
In a collaborative research project, several monaural and binaural noise reduction algorithms have been comprehensively evaluated. In this article, eight selected noise reduction algorithms were assessed using instrumental measures, with a focus on the instrumental evaluation of speech intelligibility. Four distinct, reverberant scenarios were created to reflect everyday listening situations: a stationary speech-shaped noise, a multitalker babble noise, a single interfering talker, and a realistic cafeteria noise. Three instrumental measures were employed to assess predicted speech intelligibility and predicted sound quality: the intelligibility-weighted signal-to-noise ratio, the short-time objective intelligibility measure, and the perceptual evaluation of speech quality. The results show substantial improvements in predicted speech intelligibility as well as sound quality for the proposed algorithms. The evaluated coherence-based noise reduction algorithm was able to provide improvements in predicted audio signal quality. For the tested single-channel noise reduction algorithm, improvements in intelligibility-weighted signal-to-noise ratio were observed in all but the nonstationary cafeteria ambient noise scenario. Binaural minimum variance distortionless response beamforming algorithms performed particularly well in all noise scenarios. © The Author(s) 2015.
Comparing Binaural Pre-processing Strategies I
Krawczyk-Becker, Martin; Marquardt, Daniel; Völker, Christoph; Hu, Hongmei; Herzke, Tobias; Coleman, Graham; Adiloğlu, Kamil; Ernst, Stephan M. A.; Gerkmann, Timo; Doclo, Simon; Kollmeier, Birger; Hohmann, Volker; Dietz, Mathias
2015-01-01
In a collaborative research project, several monaural and binaural noise reduction algorithms have been comprehensively evaluated. In this article, eight selected noise reduction algorithms were assessed using instrumental measures, with a focus on the instrumental evaluation of speech intelligibility. Four distinct, reverberant scenarios were created to reflect everyday listening situations: a stationary speech-shaped noise, a multitalker babble noise, a single interfering talker, and a realistic cafeteria noise. Three instrumental measures were employed to assess predicted speech intelligibility and predicted sound quality: the intelligibility-weighted signal-to-noise ratio, the short-time objective intelligibility measure, and the perceptual evaluation of speech quality. The results show substantial improvements in predicted speech intelligibility as well as sound quality for the proposed algorithms. The evaluated coherence-based noise reduction algorithm was able to provide improvements in predicted audio signal quality. For the tested single-channel noise reduction algorithm, improvements in intelligibility-weighted signal-to-noise ratio were observed in all but the nonstationary cafeteria ambient noise scenario. Binaural minimum variance distortionless response beamforming algorithms performed particularly well in all noise scenarios. PMID:26721920
Toward dynamic magnetic resonance imaging of the vocal tract during speech production.
Ventura, Sandra M Rua; Freitas, Diamantino Rui S; Tavares, João Manuel R S
2011-07-01
The most recent and significant magnetic resonance imaging (MRI) improvements allow for the visualization of the vocal tract during speech production, which has been revealed to be a powerful tool in dynamic speech research. However, a synchronization technique with enhanced temporal resolution is still required. The study design was transversal in nature. Throughout this work, a technique for the dynamic study of the vocal tract with MRI by using the heart's signal to synchronize and trigger the imaging-acquisition process is presented and described. The technique in question is then used in the measurement of four speech articulatory parameters to assess three different syllables (articulatory gestures) of European Portuguese Language. The acquired MR images are automatically reconstructed so as to result in a variable sequence of images (slices) of different vocal tract shapes in articulatory positions associated with Portuguese speech sounds. The knowledge obtained as a result of the proposed technique represents a direct contribution to the improvement of speech synthesis algorithms, thereby allowing for novel perceptions in coarticulation studies, in addition to providing further efficient clinical guidelines in the pursuit of more proficient speech rehabilitation processes. Copyright © 2011 The Voice Foundation. Published by Mosby, Inc. All rights reserved.
Speech endpoint detection with non-language speech sounds for generic speech processing applications
NASA Astrophysics Data System (ADS)
McClain, Matthew; Romanowski, Brian
2009-05-01
Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.
Effects of Computer Architecture on FFT (Fast Fourier Transform) Algorithm Performance.
1983-12-01
Criteria for Efficient Implementation of FFT Algorithms," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 107-109, Feb...1982. Burrus, C. S. and P. W. Eschenbacher. "An In-Place, In-Order Prime Factor FFT Algorithm," IEEE Transactions on Acoustics, Speech, and Signal... Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-30, pp. 217-226, Apr. 1982. Control Data Corporation. CDC Cyber 170 Computer Systems
A real-time phoneme counting algorithm and application for speech rate monitoring.
Aharonson, Vered; Aharonson, Eran; Raichlin-Levi, Katia; Sotzianu, Aviv; Amir, Ofer; Ovadia-Blechman, Zehava
2017-03-01
Adults who stutter can learn to control and improve their speech fluency by modifying their speaking rate. Existing speech therapy technologies can assist this practice by monitoring speaking rate and providing feedback to the patient, but cannot provide an accurate, quantitative measurement of speaking rate. Moreover, most technologies are too complex and costly to be used for home practice. We developed an algorithm and a smartphone application that monitor a patient's speaking rate in real time and provide user-friendly feedback to both patient and therapist. Our speaking rate computation is performed by a phoneme counting algorithm which implements spectral transition measure extraction to estimate phoneme boundaries. The algorithm is implemented in real time in a mobile application that presents its results in a user-friendly interface. The application incorporates two modes: one provides the patient with visual feedback of his/her speech rate for self-practice and another provides the speech therapist with recordings, speech rate analysis and tools to manage the patient's practice. The algorithm's phoneme counting accuracy was validated on ten healthy subjects who read a paragraph at slow, normal and fast paces, and was compared to manual counting of speech experts. Test-retest and intra-counter reliability were assessed. Preliminary results indicate differences of -4% to 11% between automatic and human phoneme counting. Differences were largest for slow speech. The application can thus provide reliable, user-friendly, real-time feedback for speaking rate control practice. Copyright © 2017 Elsevier Inc. All rights reserved.
FPGA implementation of ICA algorithm for blind signal separation and adaptive noise canceling.
Kim, Chang-Min; Park, Hyung-Min; Kim, Taesu; Choi, Yoon-Kyung; Lee, Soo-Young
2003-01-01
An field programmable gate array (FPGA) implementation of independent component analysis (ICA) algorithm is reported for blind signal separation (BSS) and adaptive noise canceling (ANC) in real time. In order to provide enormous computing power for ICA-based algorithms with multipath reverberation, a special digital processor is designed and implemented in FPGA. The chip design fully utilizes modular concept and several chips may be put together for complex applications with a large number of noise sources. Experimental results with a fabricated test board are reported for ANC only, BSS only, and simultaneous ANC/BSS, which demonstrates successful speech enhancement in real environments in real time.
A Comparison of LBG and ADPCM Speech Compression Techniques
NASA Astrophysics Data System (ADS)
Bachu, Rajesh G.; Patel, Jignasa; Barkana, Buket D.
Speech compression is the technology of converting human speech into an efficiently encoded representation that can later be decoded to produce a close approximation of the original signal. In all speech there is a degree of predictability and speech coding techniques exploit this to reduce bit rates yet still maintain a suitable level of quality. This paper is a study and implementation of Linde-Buzo-Gray Algorithm (LBG) and Adaptive Differential Pulse Code Modulation (ADPCM) algorithms to compress speech signals. In here we implemented the methods using MATLAB 7.0. The methods we used in this study gave good results and performance in compressing the speech and listening tests showed that efficient and high quality coding is achieved.
2014-01-01
This study evaluates a spatial-filtering algorithm as a method to improve speech reception for cochlear-implant (CI) users in reverberant environments with multiple noise sources. The algorithm was designed to filter sounds using phase differences between two microphones situated 1 cm apart in a behind-the-ear hearing-aid capsule. Speech reception thresholds (SRTs) were measured using a Coordinate Response Measure for six CI users in 27 listening conditions including each combination of reverberation level (T60 = 0, 270, and 540 ms), number of noise sources (1, 4, and 11), and signal-processing algorithm (omnidirectional response, dipole-directional response, and spatial-filtering algorithm). Noise sources were time-reversed speech segments randomly drawn from the Institute of Electrical and Electronics Engineers sentence recordings. Target speech and noise sources were processed using a room simulation method allowing precise control over reverberation times and sound-source locations. The spatial-filtering algorithm was found to provide improvements in SRTs on the order of 6.5 to 11.0 dB across listening conditions compared with the omnidirectional response. This result indicates that such phase-based spatial filtering can improve speech reception for CI users even in highly reverberant conditions with multiple noise sources. PMID:25330772
Improved Speech Coding Based on Open-Loop Parameter Estimation
NASA Technical Reports Server (NTRS)
Juang, Jer-Nan; Chen, Ya-Chin; Longman, Richard W.
2000-01-01
A nonlinear optimization algorithm for linear predictive speech coding was developed early that not only optimizes the linear model coefficients for the open loop predictor, but does the optimization including the effects of quantization of the transmitted residual. It also simultaneously optimizes the quantization levels used for each speech segment. In this paper, we present an improved method for initialization of this nonlinear algorithm, and demonstrate substantial improvements in performance. In addition, the new procedure produces monotonically improving speech quality with increasing numbers of bits used in the transmitted error residual. Examples of speech encoding and decoding are given for 8 speech segments and signal to noise levels as high as 47 dB are produced. As in typical linear predictive coding, the optimization is done on the open loop speech analysis model. Here we demonstrate that minimizing the error of the closed loop speech reconstruction, instead of the simpler open loop optimization, is likely to produce negligible improvement in speech quality. The examples suggest that the algorithm here is close to giving the best performance obtainable from a linear model, for the chosen order with the chosen number of bits for the codebook.
Hansen, J H; Nandkumar, S
1995-01-01
The formulation of reliable signal processing algorithms for speech coding and synthesis require the selection of a prior criterion of performance. Though coding efficiency (bits/second) or computational requirements can be used, a final performance measure must always include speech quality. In this paper, three objective speech quality measures are considered with respect to quality assessment for American English, noisy American English, and noise-free versions of seven languages. The purpose is to determine whether objective quality measures can be used to quantify changes in quality for a given voice coding method, with a known subjective performance level, as background noise or language conditions are changed. The speech coding algorithm chosen is regular-pulse excitation with long-term prediction (RPE-LTP), which has been chosen as the standard voice compression algorithm for the European Digital Mobile Radio system. Three areas are considered for objective quality assessment which include: (i) vocoder performance for American English in a noise-free environment, (ii) speech quality variation for three additive background noise sources, and (iii) noise-free performance for seven languages which include English, Japanese, Finnish, German, Hindi, Spanish, and French. It is suggested that although existing objective quality measures will never replace subjective testing, they can be a useful means of assessing changes in performance, identifying areas for improvement in algorithm design, and augmenting subjective quality tests for voice coding/compression algorithms in noise-free, noisy, and/or non-English applications.
NASA Astrophysics Data System (ADS)
Nishiura, Takanobu; Nakamura, Satoshi
2002-11-01
It is very important to capture distant-talking speech for a hands-free speech interface with high quality. A microphone array is an ideal candidate for this purpose. However, this approach requires localizing the target talker. Conventional talker localization algorithms in multiple sound source environments not only have difficulty localizing the multiple sound sources accurately, but also have difficulty localizing the target talker among known multiple sound source positions. To cope with these problems, we propose a new talker localization algorithm consisting of two algorithms. One is DOA (direction of arrival) estimation algorithm for multiple sound source localization based on CSP (cross-power spectrum phase) coefficient addition method. The other is statistical sound source identification algorithm based on GMM (Gaussian mixture model) for localizing the target talker position among localized multiple sound sources. In this paper, we particularly focus on the talker localization performance based on the combination of these two algorithms with a microphone array. We conducted evaluation experiments in real noisy reverberant environments. As a result, we confirmed that multiple sound signals can be identified accurately between ''speech'' or ''non-speech'' by the proposed algorithm. [Work supported by ATR, and MEXT of Japan.
Sequential Organization and Room Reverberation for Speech Segregation
2012-02-28
we have proposed two algorithms for sequential organization, an unsupervised clustering algorithm applicable to monaural recordings and a binaural ...algorithm that integrates monaural and binaural analyses. In addition, we have conducted speech intelligibility tests that Firmly establish the...comprehensive version is currently under review for journal publication. A binaural approach in room reverberation Most existing approaches to binaural or
NASA Astrophysics Data System (ADS)
Kaddoura, Tarek; Vadlamudi, Karunakar; Kumar, Shine; Bobhate, Prashant; Guo, Long; Jain, Shreepal; Elgendi, Mohamed; Coe, James Y.; Kim, Daniel; Taylor, Dylan; Tymchak, Wayne; Schuurmans, Dale; Zemp, Roger J.; Adatia, Ian
2016-09-01
We hypothesized that an automated speech- recognition-inspired classification algorithm could differentiate between the heart sounds in subjects with and without pulmonary hypertension (PH) and outperform physicians. Heart sounds, electrocardiograms, and mean pulmonary artery pressures (mPAp) were recorded simultaneously. Heart sound recordings were digitized to train and test speech-recognition-inspired classification algorithms. We used mel-frequency cepstral coefficients to extract features from the heart sounds. Gaussian-mixture models classified the features as PH (mPAp ≥ 25 mmHg) or normal (mPAp < 25 mmHg). Physicians blinded to patient data listened to the same heart sound recordings and attempted a diagnosis. We studied 164 subjects: 86 with mPAp ≥ 25 mmHg (mPAp 41 ± 12 mmHg) and 78 with mPAp < 25 mmHg (mPAp 17 ± 5 mmHg) (p < 0.005). The correct diagnostic rate of the automated speech-recognition-inspired algorithm was 74% compared to 56% by physicians (p = 0.005). The false positive rate for the algorithm was 34% versus 50% (p = 0.04) for clinicians. The false negative rate for the algorithm was 23% and 68% (p = 0.0002) for physicians. We developed an automated speech-recognition-inspired classification algorithm for the acoustic diagnosis of PH that outperforms physicians that could be used to screen for PH and encourage earlier specialist referral.
NASA Astrophysics Data System (ADS)
Athaudage, Chandranath R. N.; Bradley, Alan B.; Lech, Margaret
2003-12-01
A dynamic programming-based optimization strategy for a temporal decomposition (TD) model of speech and its application to low-rate speech coding in storage and broadcasting is presented. In previous work with the spectral stability-based event localizing (SBEL) TD algorithm, the event localization was performed based on a spectral stability criterion. Although this approach gave reasonably good results, there was no assurance on the optimality of the event locations. In the present work, we have optimized the event localizing task using a dynamic programming-based optimization strategy. Simulation results show that an improved TD model accuracy can be achieved. A methodology of incorporating the optimized TD algorithm within the standard MELP speech coder for the efficient compression of speech spectral information is also presented. The performance evaluation results revealed that the proposed speech coding scheme achieves 50%-60% compression of speech spectral information with negligible degradation in the decoded speech quality.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.
The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation maymore » decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.« less
Bartos, Anthony L; Cipr, Tomas; Nelson, Douglas J; Schwarz, Petr; Banowetz, John; Jerabek, Ladislav
2018-04-01
A method is presented in which conventional speech algorithms are applied, with no modifications, to improve their performance in extremely noisy environments. It has been demonstrated that, for eigen-channel algorithms, pre-training multiple speaker identification (SID) models at a lattice of signal-to-noise-ratio (SNR) levels and then performing SID using the appropriate SNR dependent model was successful in mitigating noise at all SNR levels. In those tests, it was found that SID performance was optimized when the SNR of the testing and training data were close or identical. In this current effort multiple i-vector algorithms were used, greatly improving both processing throughput and equal error rate classification accuracy. Using identical approaches in the same noisy environment, performance of SID, language identification, gender identification, and diarization were significantly improved. A critical factor in this improvement is speech activity detection (SAD) that performs reliably in extremely noisy environments, where the speech itself is barely audible. To optimize SAD operation at all SNR levels, two algorithms were employed. The first maximized detection probability at low levels (-10 dB ≤ SNR < +10 dB) using just the voiced speech envelope, and the second exploited features extracted from the original speech to improve overall accuracy at higher quality levels (SNR ≥ +10 dB).
Lee, Jong-Seok; Park, Cheol Hoon
2010-08-01
We propose a novel stochastic optimization algorithm, hybrid simulated annealing (SA), to train hidden Markov models (HMMs) for visual speech recognition. In our algorithm, SA is combined with a local optimization operator that substitutes a better solution for the current one to improve the convergence speed and the quality of solutions. We mathematically prove that the sequence of the objective values converges in probability to the global optimum in the algorithm. The algorithm is applied to train HMMs that are used as visual speech recognizers. While the popular training method of HMMs, the expectation-maximization algorithm, achieves only local optima in the parameter space, the proposed method can perform global optimization of the parameters of HMMs and thereby obtain solutions yielding improved recognition performance. The superiority of the proposed algorithm to the conventional ones is demonstrated via isolated word recognition experiments.
Optimal pattern synthesis for speech recognition based on principal component analysis
NASA Astrophysics Data System (ADS)
Korsun, O. N.; Poliyev, A. V.
2018-02-01
The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.
Objective Quality and Intelligibility Prediction for Users of Assistive Listening Devices
Falk, Tiago H.; Parsa, Vijay; Santos, João F.; Arehart, Kathryn; Hazrati, Oldooz; Huber, Rainer; Kates, James M.; Scollie, Susan
2015-01-01
This article presents an overview of twelve existing objective speech quality and intelligibility prediction tools. Two classes of algorithms are presented, namely intrusive and non-intrusive, with the former requiring the use of a reference signal, while the latter does not. Investigated metrics include both those developed for normal hearing listeners, as well as those tailored particularly for hearing impaired (HI) listeners who are users of assistive listening devices (i.e., hearing aids, HAs, and cochlear implants, CIs). Representative examples of those optimized for HI listeners include the speech-to-reverberation modulation energy ratio, tailored to hearing aids (SRMR-HA) and to cochlear implants (SRMR-CI); the modulation spectrum area (ModA); the hearing aid speech quality (HASQI) and perception indices (HASPI); and the PErception MOdel - hearing impairment quality (PEMO-Q-HI). The objective metrics are tested on three subjectively-rated speech datasets covering reverberation-alone, noise-alone, and reverberation-plus-noise degradation conditions, as well as degradations resultant from nonlinear frequency compression and different speech enhancement strategies. The advantages and limitations of each measure are highlighted and recommendations are given for suggested uses of the different tools under specific environmental and processing conditions. PMID:26052190
Effects of noise and working memory capacity on memory processing of speech for hearing-aid users.
Ng, Elaine Hoi Ning; Rudner, Mary; Lunner, Thomas; Pedersen, Michael Syskind; Rönnberg, Jerker
2013-07-01
It has been shown that noise reduction algorithms can reduce the negative effects of noise on memory processing in persons with normal hearing. The objective of the present study was to investigate whether a similar effect can be obtained for persons with hearing impairment and whether such an effect is dependent on individual differences in working memory capacity. A sentence-final word identification and recall (SWIR) test was conducted in two noise backgrounds with and without noise reduction as well as in quiet. Working memory capacity was measured using a reading span (RS) test. Twenty-six experienced hearing-aid users with moderate to moderately severe sensorineural hearing loss. Noise impaired recall performance. Competing speech disrupted memory performance more than speech-shaped noise. For late list items the disruptive effect of the competing speech background was virtually cancelled out by noise reduction for persons with high working memory capacity. Noise reduction can reduce the adverse effect of noise on memory for speech for persons with good working memory capacity. We argue that the mechanism behind this is faster word identification that enhances encoding into working memory.
Nawaz, Tabassam; Mehmood, Zahid; Rashid, Muhammad; Habib, Hafiz Adnan
2018-01-01
Recent research on speech segregation and music fingerprinting has led to improvements in speech segregation and music identification algorithms. Speech and music segregation generally involves the identification of music followed by speech segregation. However, music segregation becomes a challenging task in the presence of noise. This paper proposes a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. A recurrent neural network (RNN)-based hidden layer segregation model is applied to remove stationary noise. Dictionary-based fisher algorithms are employed for speech classification. The proposed method is tested on three datasets (TIMIT, MIR-1K, and MusicBrainz), and the results indicate the robustness of proposed method for speech segregation. The qualitative and quantitative analysis carried out on three datasets demonstrate the efficiency of the proposed method compared to the state-of-the-art speech segregation and classification-based methods. PMID:29558485
NASA Astrophysics Data System (ADS)
Franck, Bas A. M.; Dreschler, Wouter A.; Lyzenga, Johannes
2004-12-01
In this study we investigated the reliability and convergence characteristics of an adaptive multidirectional pattern search procedure, relative to a nonadaptive multidirectional pattern search procedure. The procedure was designed to optimize three speech-processing strategies. These comprise noise reduction, spectral enhancement, and spectral lift. The search is based on a paired-comparison paradigm, in which subjects evaluated the listening comfort of speech-in-noise fragments. The procedural and nonprocedural factors that influence the reliability and convergence of the procedure are studied using various test conditions. The test conditions combine different tests, initial settings, background noise types, and step size configurations. Seven normal hearing subjects participated in this study. The results indicate that the reliability of the optimization strategy may benefit from the use of an adaptive step size. Decreasing the step size increases accuracy, while increasing the step size can be beneficial to create clear perceptual differences in the comparisons. The reliability also depends on starting point, stop criterion, step size constraints, background noise, algorithms used, as well as the presence of drifting cues and suboptimal settings. There appears to be a trade-off between reliability and convergence, i.e., when the step size is enlarged the reliability improves, but the convergence deteriorates. .
Tsanas, Athanasios; Zañartu, Matías; Little, Max A.; Fox, Cynthia; Ramig, Lorraine O.; Clifford, Gari D.
2014-01-01
There has been consistent interest among speech signal processing researchers in the accurate estimation of the fundamental frequency (F0) of speech signals. This study examines ten F0 estimation algorithms (some well-established and some proposed more recently) to determine which of these algorithms is, on average, better able to estimate F0 in the sustained vowel /a/. Moreover, a robust method for adaptively weighting the estimates of individual F0 estimation algorithms based on quality and performance measures is proposed, using an adaptive Kalman filter (KF) framework. The accuracy of the algorithms is validated using (a) a database of 117 synthetic realistic phonations obtained using a sophisticated physiological model of speech production and (b) a database of 65 recordings of human phonations where the glottal cycles are calculated from electroglottograph signals. On average, the sawtooth waveform inspired pitch estimator and the nearly defect-free algorithms provided the best individual F0 estimates, and the proposed KF approach resulted in a ∼16% improvement in accuracy over the best single F0 estimation algorithm. These findings may be useful in speech signal processing applications where sustained vowels are used to assess vocal quality, when very accurate F0 estimation is required. PMID:24815269
Drijvers, Linda; Özyürek, Asli
2017-01-01
This study investigated whether and to what extent iconic co-speech gestures contribute to information from visible speech to enhance degraded speech comprehension at different levels of noise-vocoding. Previous studies of the contributions of these 2 visual articulators to speech comprehension have only been performed separately. Twenty participants watched videos of an actress uttering an action verb and completed a free-recall task. The videos were presented in 3 speech conditions (2-band noise-vocoding, 6-band noise-vocoding, clear), 3 multimodal conditions (speech + lips blurred, speech + visible speech, speech + visible speech + gesture), and 2 visual-only conditions (visible speech, visible speech + gesture). Accuracy levels were higher when both visual articulators were present compared with 1 or none. The enhancement effects of (a) visible speech, (b) gestural information on top of visible speech, and (c) both visible speech and iconic gestures were larger in 6-band than 2-band noise-vocoding or visual-only conditions. Gestural enhancement in 2-band noise-vocoding did not differ from gestural enhancement in visual-only conditions. When perceiving degraded speech in a visual context, listeners benefit more from having both visual articulators present compared with 1. This benefit was larger at 6-band than 2-band noise-vocoding, where listeners can benefit from both phonological cues from visible speech and semantic cues from iconic gestures to disambiguate speech.
Speech recognition for embedded automatic positioner for laparoscope
NASA Astrophysics Data System (ADS)
Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin
2014-07-01
In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.
Towards Artificial Speech Therapy: A Neural System for Impaired Speech Segmentation.
Iliya, Sunday; Neri, Ferrante
2016-09-01
This paper presents a neural system-based technique for segmenting short impaired speech utterances into silent, unvoiced, and voiced sections. Moreover, the proposed technique identifies those points of the (voiced) speech where the spectrum becomes steady. The resulting technique thus aims at detecting that limited section of the speech which contains the information about the potential impairment of the speech. This section is of interest to the speech therapist as it corresponds to the possibly incorrect movements of speech organs (lower lip and tongue with respect to the vocal tract). Two segmentation models to detect and identify the various sections of the disordered (impaired) speech signals have been developed and compared. The first makes use of a combination of four artificial neural networks. The second is based on a support vector machine (SVM). The SVM has been trained by means of an ad hoc nested algorithm whose outer layer is a metaheuristic while the inner layer is a convex optimization algorithm. Several metaheuristics have been tested and compared leading to the conclusion that some variants of the compact differential evolution (CDE) algorithm appears to be well-suited to address this problem. Numerical results show that the SVM model with a radial basis function is capable of effective detection of the portion of speech that is of interest to a therapist. The best performance has been achieved when the system is trained by the nested algorithm whose outer layer is hybrid-population-based/CDE. A population-based approach displays the best performance for the isolation of silence/noise sections, and the detection of unvoiced sections. On the other hand, a compact approach appears to be clearly well-suited to detect the beginning of the steady state of the voiced signal. Both the proposed segmentation models display outperformed two modern segmentation techniques based on Gaussian mixture model and deep learning.
2017-01-05
1 Performance Evaluation of Glottal Inverse Filtering Algorithms Using a Physiologically Based Articulatory Speech Synthesizer Yu-Ren Chien, Daryush...D. Mehta, Member, IEEE, Jón Guðnason, Matías Zañartu, Member, IEEE, and Thomas F. Quatieri, Fellow, IEEE Abstract—Glottal inverse filtering aims to...of inverse filtering performance has been challenging due to the practical difficulty in measuring the true glottal signals while speech signals are
Loss tolerant speech decoder for telecommunications
NASA Technical Reports Server (NTRS)
Prieto, Jr., Jaime L. (Inventor)
1999-01-01
A method and device for extrapolating past signal-history data for insertion into missing data segments in order to conceal digital speech frame errors. The extrapolation method uses past-signal history that is stored in a buffer. The method is implemented with a device that utilizes a finite-impulse response (FIR) multi-layer feed-forward artificial neural network that is trained by back-propagation for one-step extrapolation of speech compression algorithm (SCA) parameters. Once a speech connection has been established, the speech compression algorithm device begins sending encoded speech frames. As the speech frames are received, they are decoded and converted back into speech signal voltages. During the normal decoding process, pre-processing of the required SCA parameters will occur and the results stored in the past-history buffer. If a speech frame is detected to be lost or in error, then extrapolation modules are executed and replacement SCA parameters are generated and sent as the parameters required by the SCA. In this way, the information transfer to the SCA is transparent, and the SCA processing continues as usual. The listener will not normally notice that a speech frame has been lost because of the smooth transition between the last-received, lost, and next-received speech frames.
ERIC Educational Resources Information Center
Tierney, Joseph; Mack, Molly
1987-01-01
Stimuli used in research on the perception of the speech signal have often been obtained from simple filtering and distortion of the speech waveform, sometimes accompanied by noise. However, for more complex stimulus generation, the parameters of speech can be manipulated, after analysis and before synthesis, using various types of algorithms to…
Approximated mutual information training for speech recognition using myoelectric signals.
Guo, Hua J; Chan, A D C
2006-01-01
A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.
Performance of a low data rate speech codec for land-mobile satellite communications
NASA Technical Reports Server (NTRS)
Gersho, Allen; Jedrey, Thomas C.
1990-01-01
In an effort to foster the development of new technologies for the emerging land mobile satellite communications services, JPL funded two development contracts in 1984: one to the Univ. of Calif., Santa Barbara and the other to the Georgia Inst. of Technology, to develop algorithms and real time hardware for near toll quality speech compression at 4800 bits per second. Both universities have developed and delivered speech codecs to JPL, and the UCSB codec was extensively tested by JPL in a variety of experimental setups. The basic UCSB speech codec algorithms and the test results of the various experiments performed with this codec are presented.
Dual Key Speech Encryption Algorithm Based Underdetermined BSS
Zhao, Huan; Chen, Zuo; Zhang, Xixiang
2014-01-01
When the number of the mixed signals is less than that of the source signals, the underdetermined blind source separation (BSS) is a significant difficult problem. Due to the fact that the great amount data of speech communications and real-time communication has been required, we utilize the intractability of the underdetermined BSS problem to present a dual key speech encryption method. The original speech is mixed with dual key signals which consist of random key signals (one-time pad) generated by secret seed and chaotic signals generated from chaotic system. In the decryption process, approximate calculation is used to recover the original speech signals. The proposed algorithm for speech signals encryption can resist traditional attacks against the encryption system, and owing to approximate calculation, decryption becomes faster and more accurate. It is demonstrated that the proposed method has high level of security and can recover the original signals quickly and efficiently yet maintaining excellent audio quality. PMID:24955430
Flanagan, Sheila; Zorilă, Tudor-Cătălin; Stylianou, Yannis; Moore, Brian C J
2018-01-01
Auditory processing disorder (APD) may be diagnosed when a child has listening difficulties but has normal audiometric thresholds. For adults with normal hearing and with mild-to-moderate hearing impairment, an algorithm called spectral shaping with dynamic range compression (SSDRC) has been shown to increase the intelligibility of speech when background noise is added after the processing. Here, we assessed the effect of such processing using 8 children with APD and 10 age-matched control children. The loudness of the processed and unprocessed sentences was matched using a loudness model. The task was to repeat back sentences produced by a female speaker when presented with either speech-shaped noise (SSN) or a male competing speaker (CS) at two signal-to-background ratios (SBRs). Speech identification was significantly better with SSDRC processing than without, for both groups. The benefit of SSDRC processing was greater for the SSN than for the CS background. For the SSN, scores were similar for the two groups at both SBRs. For the CS, the APD group performed significantly more poorly than the control group. The overall improvement produced by SSDRC processing could be useful for enhancing communication in a classroom where the teacher's voice is broadcast using a wireless system.
Healy, Eric W; Yoho, Sarah E
2016-08-01
A primary complaint of hearing-impaired individuals involves poor speech understanding when background noise is present. Hearing aids and cochlear implants often allow good speech understanding in quiet backgrounds. But hearing-impaired individuals are highly noise intolerant, and existing devices are not very effective at combating background noise. As a result, speech understanding in noise is often quite poor. In accord with the significance of the problem, considerable effort has been expended toward understanding and remedying this issue. Fortunately, our understanding of the underlying issues is reasonably good. In sharp contrast, effective solutions have remained elusive. One solution that seems promising involves a single-microphone machine-learning algorithm to extract speech from background noise. Data from our group indicate that the algorithm is capable of producing vast increases in speech understanding by hearing-impaired individuals. This paper will first provide an overview of the speech-in-noise problem and outline why hearing-impaired individuals are so noise intolerant. An overview of our approach to solving this problem will follow.
An acoustic feature-based similarity scoring system for speech rehabilitation assistance.
Syauqy, Dahnial; Wu, Chao-Min; Setyawati, Onny
2016-08-01
The purpose of this study is to develop a tool to assist speech therapy and rehabilitation, which focused on automatic scoring based on the comparison of the patient's speech with another normal speech on several aspects including pitch, vowel, voiced-unvoiced segments, strident fricative and sound intensity. The pitch estimation employed the use of cepstrum-based algorithm for its robustness; the vowel classification used multilayer perceptron (MLP) to classify vowel from pitch and formants; and the strident fricative detection was based on the major peak spectral intensity, location and the pitch existence in the segment. In order to evaluate the performance of the system, this study analyzed eight patient's speech recordings (four males, four females; 4-58-years-old), which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Taoyuan General Hospital. The experiment result on pitch algorithm showed that the cepstrum method had 5.3% of gross pitch error from a total of 2086 frames. On the vowel classification algorithm, MLP method provided 93% accuracy (men), 87% (women) and 84% (children). In total, the overall results showed that 156 tool's grading results (81%) were consistent compared to 192 audio and visual observations done by four experienced respondents. Implication for Rehabilitation Difficulties in communication may limit the ability of a person to transfer and exchange information. The fact that speech is one of the primary means of communication has encouraged the needs of speech diagnosis and rehabilitation. The advances of technology in computer-assisted speech therapy (CAST) improve the quality, time efficiency of the diagnosis and treatment of the disorders. The present study attempted to develop tool to assist speech therapy and rehabilitation, which provided simple interface to let the assessment be done even by the patient himself without the need of particular knowledge of speech processing while at the same time, also provided further deep analysis of the speech, which can be useful for the speech therapist.
Noise Reduction with Microphone Arrays for Speaker Identification
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cohen, Z
Reducing acoustic noise in audio recordings is an ongoing problem that plagues many applications. This noise is hard to reduce because of interfering sources and non-stationary behavior of the overall background noise. Many single channel noise reduction algorithms exist but are limited in that the more the noise is reduced; the more the signal of interest is distorted due to the fact that the signal and noise overlap in frequency. Specifically acoustic background noise causes problems in the area of speaker identification. Recording a speaker in the presence of acoustic noise ultimately limits the performance and confidence of speaker identificationmore » algorithms. In situations where it is impossible to control the environment where the speech sample is taken, noise reduction filtering algorithms need to be developed to clean the recorded speech of background noise. Because single channel noise reduction algorithms would distort the speech signal, the overall challenge of this project was to see if spatial information provided by microphone arrays could be exploited to aid in speaker identification. The goals are: (1) Test the feasibility of using microphone arrays to reduce background noise in speech recordings; (2) Characterize and compare different multichannel noise reduction algorithms; (3) Provide recommendations for using these multichannel algorithms; and (4) Ultimately answer the question - Can the use of microphone arrays aid in speaker identification?« less
A Comparative Analysis of Pitch Detection Methods Under the Influence of Different Noise Conditions.
Sukhostat, Lyudmila; Imamverdiyev, Yadigar
2015-07-01
Pitch is one of the most important components in various speech processing systems. The aim of this study was to evaluate different pitch detection methods in terms of various noise conditions. Prospective study. For evaluation of pitch detection algorithms, time-domain, frequency-domain, and hybrid methods were considered by using Keele and CSTR speech databases. Each of them has its own advantages and disadvantages. Experiments have shown that BaNa method achieves the highest pitch detection accuracy. The development of methods for pitch detection, which are robust to additive noise at different signal-to-noise ratio, is an important field of research with many opportunities for enhancement the modern methods. Copyright © 2015 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Ng, Elaine Hoi Ning; Rudner, Mary; Lunner, Thomas; Rönnberg, Jerker
2015-01-01
A hearing aid noise reduction (NR) algorithm reduces the adverse effect of competing speech on memory for target speech for individuals with hearing impairment with high working memory capacity. In the present study, we investigated whether the positive effect of NR could be extended to individuals with low working memory capacity, as well as how NR influences recall performance for target native speech when the masker language is non-native. A sentence-final word identification and recall (SWIR) test was administered to 26 experienced hearing aid users. In this test, target spoken native language (Swedish) sentence lists were presented in competing native (Swedish) or foreign (Cantonese) speech with or without binary masking NR algorithm. After each sentence list, free recall of sentence final words was prompted. Working memory capacity was measured using a reading span (RS) test. Recall performance was associated with RS. However, the benefit obtained from NR was not associated with RS. Recall performance was more disrupted by native than foreign speech babble and NR improved recall performance in native but not foreign competing speech. Noise reduction improved memory for speech heard in competing speech for hearing aid users. Memory for native speech was more disrupted by native babble than foreign babble, but the disruptive effect of native speech babble was reduced to that of foreign babble when there was NR.
NASA Astrophysics Data System (ADS)
Liang, Ruiyu; Xi, Ji; Bao, Yongqiang
2017-07-01
To improve the performance of gain compensation based on three-segment sound pressure level (SPL) in hearing aids, an improved multichannel loudness compensation method based on eight-segment SPL was proposed. Firstly, the uniform cosine modulated filter bank was designed. Then, the adjacent channels which have low or gradual slopes were adaptively merged to obtain the corresponding non-uniform cosine modulated filter according to the audiogram of hearing impaired persons. Secondly, the input speech was decomposed into sub-band signals and the SPL of every sub-band signal was computed. Meanwhile, the audible SPL range from 0 dB SPL to 120 dB SPL was equally divided into eight segments. Based on these segments, a different prescription formula was designed to compute more detailed gain to compensate according to the audiogram and the computed SPL. Finally, the enhanced signal was synthesized. Objective experiments showed the decomposed signals after cosine modulated filter bank have little distortion. Objective experiments showed that the hearing aids speech perception index (HASPI) and hearing aids speech quality index (HASQI) increased 0.083 and 0.082 on average, respectively. Subjective experiments showed the proposed algorithm can effectively improve the speech recognition of six hearing impaired persons.
Chung, King
2004-01-01
This review discusses the challenges in hearing aid design and fitting and the recent developments in advanced signal processing technologies to meet these challenges. The first part of the review discusses the basic concepts and the building blocks of digital signal processing algorithms, namely, the signal detection and analysis unit, the decision rules, and the time constants involved in the execution of the decision. In addition, mechanisms and the differences in the implementation of various strategies used to reduce the negative effects of noise are discussed. These technologies include the microphone technologies that take advantage of the spatial differences between speech and noise and the noise reduction algorithms that take advantage of the spectral difference and temporal separation between speech and noise. The specific technologies discussed in this paper include first-order directional microphones, adaptive directional microphones, second-order directional microphones, microphone matching algorithms, array microphones, multichannel adaptive noise reduction algorithms, and synchrony detection noise reduction algorithms. Verification data for these technologies, if available, are also summarized. PMID:15678225
ERIC Educational Resources Information Center
Drijvers, Linda; Ozyurek, Asli
2017-01-01
Purpose: This study investigated whether and to what extent iconic co-speech gestures contribute to information from visible speech to enhance degraded speech comprehension at different levels of noise-vocoding. Previous studies of the contributions of these 2 visual articulators to speech comprehension have only been performed separately. Method:…
Rhythmic Priming Enhances the Phonological Processing of Speech
ERIC Educational Resources Information Center
Cason, Nia; Schon, Daniele
2012-01-01
While natural speech does not possess the same degree of temporal regularity found in music, there is recent evidence to suggest that temporal regularity enhances speech processing. The aim of this experiment was to examine whether speech processing would be enhanced by the prior presentation of a rhythmical prime. We recorded electrophysiological…
A novel speech watermarking algorithm by line spectrum pair modification
NASA Astrophysics Data System (ADS)
Zhang, Qian; Yang, Senbin; Chen, Guang; Zhou, Jun
2011-10-01
To explore digital watermarking specifically suitable for the speech domain, this paper experimentally investigates the properties of line spectrum pair (LSP) parameters firstly. The results show that the differences between contiguous LSPs are robust against common signal processing operations and small modifications of LSPs are imperceptible to the human auditory system (HAS). According to these conclusions, three contiguous LSPs of a speech frame are selected to embed a watermark bit. The middle LSP is slightly altered to modify the differences of these LSPs when embedding watermark. Correspondingly, the watermark is extracted by comparing these differences. The proposed algorithm's transparency is adjustable to meet the needs of different applications. The algorithm has good robustness against additive noise, quantization, amplitude scale and MP3 compression attacks, for the bit error rate (BER) is less than 5%. In addition, the algorithm allows a relatively low capacity, which approximates to 50 bps.
Ping, Lichuan; Wang, Ningyuan; Tang, Guofang; Lu, Thomas; Yin, Li; Tu, Wenhe; Fu, Qian-Jie
2017-09-01
Because of limited spectral resolution, Mandarin-speaking cochlear implant (CI) users have difficulty perceiving fundamental frequency (F0) cues that are important to lexical tone recognition. To improve Mandarin tone recognition in CI users, we implemented and evaluated a novel real-time algorithm (C-tone) to enhance the amplitude contour, which is strongly correlated with the F0 contour. The C-tone algorithm was implemented in clinical processors and evaluated in eight users of the Nurotron NSP-60 CI system. Subjects were given 2 weeks of experience with C-tone. Recognition of Chinese tones, monosyllables, and disyllables in quiet was measured with and without the C-tone algorithm. Subjective quality ratings were also obtained for C-tone. After 2 weeks of experience with C-tone, there were small but significant improvements in recognition of lexical tones, monosyllables, and disyllables (P < 0.05 in all cases). Among lexical tones, the largest improvements were observed for Tone 3 (falling-rising) and the smallest for Tone 4 (falling). Improvements with C-tone were greater for disyllables than for monosyllables. Subjective quality ratings showed no strong preference for or against C-tone, except for perception of own voice, where C-tone was preferred. The real-time C-tone algorithm provided small but significant improvements for speech performance in quiet with no change in sound quality. Pre-processing algorithms to reduce noise and better real-time F0 extraction would improve the benefits of C-tone in complex listening environments. Chinese CI users' speech recognition in quiet can be significantly improved by modifying the amplitude contour to better resemble the F0 contour.
Neural and Behavioral Mechanisms of Clear Speech
ERIC Educational Resources Information Center
Luque, Jenna Silver
2017-01-01
Clear speech is a speaking style that has been shown to improve intelligibility in adverse listening conditions, for various listener and talker populations. Clear-speech phonetic enhancements include a slowed speech rate, expanded vowel space, and expanded pitch range. Although clear-speech phonetic enhancements have been demonstrated across a…
NASA Astrophysics Data System (ADS)
Wang, Longbiao; Odani, Kyohei; Kai, Atsuhiko
2012-12-01
A blind dereverberation method based on power spectral subtraction (SS) using a multi-channel least mean squares algorithm was previously proposed to suppress the reverberant speech without additive noise. The results of isolated word speech recognition experiments showed that this method achieved significant improvements over conventional cepstral mean normalization (CMN) in a reverberant environment. In this paper, we propose a blind dereverberation method based on generalized spectral subtraction (GSS), which has been shown to be effective for noise reduction, instead of power SS. Furthermore, we extend the missing feature theory (MFT), which was initially proposed to enhance the robustness of additive noise, to dereverberation. A one-stage dereverberation and denoising method based on GSS is presented to simultaneously suppress both the additive noise and nonstationary multiplicative noise (reverberation). The proposed dereverberation method based on GSS with MFT is evaluated on a large vocabulary continuous speech recognition task. When the additive noise was absent, the dereverberation method based on GSS with MFT using only 2 microphones achieves a relative word error reduction rate of 11.4 and 32.6% compared to the dereverberation method based on power SS and the conventional CMN, respectively. For the reverberant and noisy speech, the dereverberation and denoising method based on GSS achieves a relative word error reduction rate of 12.8% compared to the conventional CMN with GSS-based additive noise reduction method. We also analyze the effective factors of the compensation parameter estimation for the dereverberation method based on SS, such as the number of channels (the number of microphones), the length of reverberation to be suppressed, and the length of the utterance used for parameter estimation. The experimental results showed that the SS-based method is robust in a variety of reverberant environments for both isolated and continuous speech recognition and under various parameter estimation conditions.
Speech coding at 4800 bps for mobile satellite communications
NASA Technical Reports Server (NTRS)
Gersho, Allen; Chan, Wai-Yip; Davidson, Grant; Chen, Juin-Hwey; Yong, Mei
1988-01-01
A speech compression project has recently been completed to develop a speech coding algorithm suitable for operation in a mobile satellite environment aimed at providing telephone quality natural speech at 4.8 kbps. The work has resulted in two alternative techniques which achieve reasonably good communications quality at 4.8 kbps while tolerating vehicle noise and rather severe channel impairments. The algorithms are embodied in a compact self-contained prototype consisting of two AT and T 32-bit floating-point DSP32 digital signal processors (DSP). A Motorola 68HC11 microcomputer chip serves as the board controller and interface handler. On a wirewrapped card, the prototype's circuit footprint amounts to only 200 sq cm, and consumes about 9 watts of power.
Huber, Rainer; Bisitz, Thomas; Gerkmann, Timo; Kiessling, Jürgen; Meister, Hartmut; Kollmeier, Birger
2018-06-01
The perceived qualities of nine different single-microphone noise reduction (SMNR) algorithms were to be evaluated and compared in subjective listening tests with normal hearing and hearing impaired (HI) listeners. Speech samples added with traffic noise or with party noise were processed by the SMNR algorithms. Subjects rated the amount of speech distortions, intrusiveness of background noise, listening effort and overall quality, using a simplified MUSHRA (ITU-R, 2003 ) assessment method. 18 normal hearing and 18 moderately HI subjects participated in the study. Significant differences between the rating behaviours of the two subject groups were observed: While normal hearing subjects clearly differentiated between different SMNR algorithms, HI subjects rated all processed signals very similarly. Moreover, HI subjects rated speech distortions of the unprocessed, noisier signals as being more severe than the distortions of the processed signals, in contrast to normal hearing subjects. It seems harder for HI listeners to distinguish between additive noise and speech distortions or/and they might have a different understanding of the term "speech distortion" than normal hearing listeners have. The findings confirm that the evaluation of SMNR schemes for hearing aids should always involve HI listeners.
Speech Emotion Feature Selection Method Based on Contribution Analysis Algorithm of Neural Network
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang Xiaojia; Mao Qirong; Zhan Yongzhao
There are many emotion features. If all these features are employed to recognize emotions, redundant features may be existed. Furthermore, recognition result is unsatisfying and the cost of feature extraction is high. In this paper, a method to select speech emotion features based on contribution analysis algorithm of NN is presented. The emotion features are selected by using contribution analysis algorithm of NN from the 95 extracted features. Cluster analysis is applied to analyze the effectiveness for the features selected, and the time of feature extraction is evaluated. Finally, 24 emotion features selected are used to recognize six speech emotions.more » The experiments show that this method can improve the recognition rate and the time of feature extraction.« less
What does voice-processing technology support today?
Nakatsu, R; Suzuki, Y
1995-01-01
This paper describes the state of the art in applications of voice-processing technologies. In the first part, technologies concerning the implementation of speech recognition and synthesis algorithms are described. Hardware technologies such as microprocessors and DSPs (digital signal processors) are discussed. Software development environment, which is a key technology in developing applications software, ranging from DSP software to support software also is described. In the second part, the state of the art of algorithms from the standpoint of applications is discussed. Several issues concerning evaluation of speech recognition/synthesis algorithms are covered, as well as issues concerning the robustness of algorithms in adverse conditions. Images Fig. 3 PMID:7479720
Watch what you say, your computer might be listening: A review of automated speech recognition
NASA Technical Reports Server (NTRS)
Degennaro, Stephen V.
1991-01-01
Spoken language is the most convenient and natural means by which people interact with each other and is, therefore, a promising candidate for human-machine interactions. Speech also offers an additional channel for hands-busy applications, complementing the use of motor output channels for control. Current speech recognition systems vary considerably across a number of important characteristics, including vocabulary size, speaking mode, training requirements for new speakers, robustness to acoustic environments, and accuracy. Algorithmically, these systems range from rule-based techniques through more probabilistic or self-learning approaches such as hidden Markov modeling and neural networks. This tutorial begins with a brief summary of the relevant features of current speech recognition systems and the strengths and weaknesses of the various algorithmic approaches.
Automatic measurement of voice onset time using discriminative structured prediction.
Sonderegger, Morgan; Keshet, Joseph
2012-12-01
A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.
An efficient robust sound classification algorithm for hearing aids.
Nordqvist, Peter; Leijon, Arne
2004-06-01
An efficient robust sound classification algorithm based on hidden Markov models is presented. The system would enable a hearing aid to automatically change its behavior for differing listening environments according to the user's preferences. This work attempts to distinguish between three listening environment categories: speech in traffic noise, speech in babble, and clean speech, regardless of the signal-to-noise ratio. The classifier uses only the modulation characteristics of the signal. The classifier ignores the absolute sound pressure level and the absolute spectrum shape, resulting in an algorithm that is robust against irrelevant acoustic variations. The measured classification hit rate was 96.7%-99.5% when the classifier was tested with sounds representing one of the three environment categories included in the classifier. False-alarm rates were 0.2%-1.7% in these tests. The algorithm is robust and efficient and consumes a small amount of instructions and memory. It is fully possible to implement the classifier in a DSP-based hearing instrument.
Intelligent hearing aids: the next revolution.
Tao Zhang; Mustiere, Fred; Micheyl, Christophe
2016-08-01
The first revolution in hearing aids came from nonlinear amplification, which allows better compensation for both soft and loud sounds. The second revolution stemmed from the introduction of digital signal processing, which allows better programmability and more sophisticated algorithms. The third revolution in hearing aids is wireless, which allows seamless connectivity between a pair of hearing aids and with more and more external devices. Each revolution has fundamentally transformed hearing aids and pushed the entire industry forward significantly. Machine learning has received significant attention in recent years and has been applied in many other industries, e.g., robotics, speech recognition, genetics, and crowdsourcing. We argue that the next revolution in hearing aids is machine intelligence. In fact, this revolution is already quietly happening. We will review the development in at least three major areas: applications of machine learning in speech enhancement; applications of machine learning in individualization and customization of signal processing algorithms; applications of machine learning in improving the efficiency and effectiveness of clinical tests. With the advent of the internet of things, the above developments will accelerate. This revolution will bring patient satisfactions to a new level that has never been seen before.
Lee, Jun Chang; Nam, Kyoung Won; Jang, Dong Pyo; Kim, In Young
2015-12-01
Previously suggested diagonal-steering algorithms for binaural hearing support devices have commonly assumed that the direction of the speech signal is known in advance, which is not always the case in many real circumstances. In this study, a new diagonal-steering-based binaural speech localization (BSL) algorithm is proposed, and the performances of the BSL algorithm and the binaural beamforming algorithm, which integrates the BSL and diagonal-steering algorithms, were evaluated using actual speech-in-noise signals in several simulated listening scenarios. Testing sounds were recorded in a KEMAR mannequin setup and two objective indices, improvements in signal-to-noise ratio (SNRi ) and segmental SNR (segSNRi ), were utilized for performance evaluation. Experimental results demonstrated that the accuracy of the BSL was in the 90-100% range when input SNR was -10 to +5 dB range. The average differences between the γ-adjusted and γ-fixed diagonal-steering algorithms (for -15 to +5 dB input SNR) in the talking in the restaurant scenario were 0.203-0.937 dB for SNRi and 0.052-0.437 dB for segSNRi , and in the listening while car driving scenario, the differences were 0.387-0.835 dB for SNRi and 0.259-1.175 dB for segSNRi . In addition, the average difference between the BSL-turned-on and the BSL-turned-off cases for the binaural beamforming algorithm in the listening while car driving scenario was 1.631-4.246 dB for SNRi and 0.574-2.784 dB for segSNRi . In all testing conditions, the γ-adjusted diagonal-steering and BSL algorithm improved the values of the indices more than the conventional algorithms. The binaural beamforming algorithm, which integrates the proposed BSL and diagonal-steering algorithm, is expected to improve the performance of the binaural hearing support devices in noisy situations. Copyright © 2015 International Center for Artificial Organs and Transplantation and Wiley Periodicals, Inc.
Enhancing Speech Intelligibility: Interactions among Context, Modality, Speech Style, and Masker
ERIC Educational Resources Information Center
Van Engen, Kristin J.; Phelps, Jasmine E. B.; Smiljanic, Rajka; Chandrasekaran, Bharath
2014-01-01
Purpose: The authors sought to investigate interactions among intelligibility-enhancing speech cues (i.e., semantic context, clearly produced speech, and visual information) across a range of masking conditions. Method: Sentence recognition in noise was assessed for 29 normal-hearing listeners. Testing included semantically normal and anomalous…
Huda, Shamsul; Yearwood, John; Togneri, Roberto
2009-02-01
This paper attempts to overcome the tendency of the expectation-maximization (EM) algorithm to locate a local rather than global maximum when applied to estimate the hidden Markov model (HMM) parameters in speech signal modeling. We propose a hybrid algorithm for estimation of the HMM in automatic speech recognition (ASR) using a constraint-based evolutionary algorithm (EA) and EM, the CEL-EM. The novelty of our hybrid algorithm (CEL-EM) is that it is applicable for estimation of the constraint-based models with many constraints and large numbers of parameters (which use EM) like HMM. Two constraint-based versions of the CEL-EM with different fusion strategies have been proposed using a constraint-based EA and the EM for better estimation of HMM in ASR. The first one uses a traditional constraint-handling mechanism of EA. The other version transforms a constrained optimization problem into an unconstrained problem using Lagrange multipliers. Fusion strategies for the CEL-EM use a staged-fusion approach where EM has been plugged with the EA periodically after the execution of EA for a specific period of time to maintain the global sampling capabilities of EA in the hybrid algorithm. A variable initialization approach (VIA) has been proposed using a variable segmentation to provide a better initialization for EA in the CEL-EM. Experimental results on the TIMIT speech corpus show that CEL-EM obtains higher recognition accuracies than the traditional EM algorithm as well as a top-standard EM (VIA-EM, constructed by applying the VIA to EM).
Using Web Speech Technology with Language Learning Applications
ERIC Educational Resources Information Center
Daniels, Paul
2015-01-01
In this article, the author presents the history of human-to-computer interaction based upon the design of sophisticated computerized speech recognition algorithms. Advancements such as the arrival of cloud-based computing and software like Google's Web Speech API allows anyone with an Internet connection and Chrome browser to take advantage of…
Application of an auditory model to speech recognition.
Cohen, J R
1989-06-01
Some aspects of auditory processing are incorporated in a front end for the IBM speech-recognition system [F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE 64 (4), 532-556 (1976)]. This new process includes adaptation, loudness scaling, and mel warping. Tests show that the design is an improvement over previous algorithms.
An audiovisual emotion recognition system
NASA Astrophysics Data System (ADS)
Han, Yi; Wang, Guoyin; Yang, Yong; He, Kun
2007-12-01
Human emotions could be expressed by many bio-symbols. Speech and facial expression are two of them. They are both regarded as emotional information which is playing an important role in human-computer interaction. Based on our previous studies on emotion recognition, an audiovisual emotion recognition system is developed and represented in this paper. The system is designed for real-time practice, and is guaranteed by some integrated modules. These modules include speech enhancement for eliminating noises, rapid face detection for locating face from background image, example based shape learning for facial feature alignment, and optical flow based tracking algorithm for facial feature tracking. It is known that irrelevant features and high dimensionality of the data can hurt the performance of classifier. Rough set-based feature selection is a good method for dimension reduction. So 13 speech features out of 37 ones and 10 facial features out of 33 ones are selected to represent emotional information, and 52 audiovisual features are selected due to the synchronization when speech and video fused together. The experiment results have demonstrated that this system performs well in real-time practice and has high recognition rate. Our results also show that the work in multimodules fused recognition will become the trend of emotion recognition in the future.
More About Vector Adaptive/Predictive Coding Of Speech
NASA Technical Reports Server (NTRS)
Jedrey, Thomas C.; Gersho, Allen
1992-01-01
Report presents additional information about digital speech-encoding and -decoding system described in "Vector Adaptive/Predictive Encoding of Speech" (NPO-17230). Summarizes development of vector adaptive/predictive coding (VAPC) system and describes basic functions of algorithm. Describes refinements introduced enabling receiver to cope with errors. VAPC algorithm implemented in integrated-circuit coding/decoding processors (codecs). VAPC and other codecs tested under variety of operating conditions. Tests designed to reveal effects of various background quiet and noisy environments and of poor telephone equipment. VAPC found competitive with and, in some respects, superior to other 4.8-kb/s codecs and other codecs of similar complexity.
Audio visual speech source separation via improved context dependent association model
NASA Astrophysics Data System (ADS)
Kazemi, Alireza; Boostani, Reza; Sobhanmanesh, Fariborz
2014-12-01
In this paper, we exploit the non-linear relation between a speech source and its associated lip video as a source of extra information to propose an improved audio-visual speech source separation (AVSS) algorithm. The audio-visual association is modeled using a neural associator which estimates the visual lip parameters from a temporal context of acoustic observation frames. We define an objective function based on mean square error (MSE) measure between estimated and target visual parameters. This function is minimized for estimation of the de-mixing vector/filters to separate the relevant source from linear instantaneous or time-domain convolutive mixtures. We have also proposed a hybrid criterion which uses AV coherency together with kurtosis as a non-Gaussianity measure. Experimental results are presented and compared in terms of visually relevant speech detection accuracy and output signal-to-interference ratio (SIR) of source separation. The suggested audio-visual model significantly improves relevant speech classification accuracy compared to existing GMM-based model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICA- and AVSS-based methods.
A versatile pitch tracking algorithm: from human speech to killer whale vocalizations.
Shapiro, Ari Daniel; Wang, Chao
2009-07-01
In this article, a pitch tracking algorithm [named discrete logarithmic Fourier transformation-pitch detection algorithm (DLFT-PDA)], originally designed for human telephone speech, was modified for killer whale vocalizations. The multiple frequency components of some of these vocalizations demand a spectral (rather than temporal) approach to pitch tracking. The DLFT-PDA algorithm derives reliable estimations of pitch and the temporal change of pitch from the harmonic structure of the vocal signal. Scores from both estimations are combined in a dynamic programming search to find a smooth pitch track. The algorithm is capable of tracking killer whale calls that contain simultaneous low and high frequency components and compares favorably across most signal to noise ratio ranges to the peak-picking and sidewinder algorithms that have been used for tracking killer whale vocalizations previously.
Voice Conversion Using Pitch Shifting Algorithm by Time Stretching with PSOLA and Re-Sampling
NASA Astrophysics Data System (ADS)
Mousa, Allam
2010-01-01
Voice changing has many applications in the industry and commercial filed. This paper emphasizes voice conversion using a pitch shifting method which depends on detecting the pitch of the signal (fundamental frequency) using Simplified Inverse Filter Tracking (SIFT) and changing it according to the target pitch period using time stretching with Pitch Synchronous Over Lap Add Algorithm (PSOLA), then resampling the signal in order to have the same play rate. The same study was performed to see the effect of voice conversion when some Arabic speech signal is considered. Treatment of certain Arabic voiced vowels and the conversion between male and female speech has shown some expansion or compression in the resulting speech. Comparison in terms of pitch shifting is presented here. Analysis was performed for a single frame and a full segmentation of speech.
Modeling Driving Performance Using In-Vehicle Speech Data From a Naturalistic Driving Study.
Kuo, Jonny; Charlton, Judith L; Koppel, Sjaan; Rudin-Brown, Christina M; Cross, Suzanne
2016-09-01
We aimed to (a) describe the development and application of an automated approach for processing in-vehicle speech data from a naturalistic driving study (NDS), (b) examine the influence of child passenger presence on driving performance, and (c) model this relationship using in-vehicle speech data. Parent drivers frequently engage in child-related secondary behaviors, but the impact on driving performance is unknown. Applying automated speech-processing techniques to NDS audio data would facilitate the analysis of in-vehicle driver-child interactions and their influence on driving performance. Speech activity detection and speaker diarization algorithms were applied to audio data from a Melbourne-based NDS involving 42 families. Multilevel models were developed to evaluate the effect of speech activity and the presence of child passengers on driving performance. Speech activity was significantly associated with velocity and steering angle variability. Child passenger presence alone was not associated with changes in driving performance. However, speech activity in the presence of two child passengers was associated with the most variability in driving performance. The effects of in-vehicle speech on driving performance in the presence of child passengers appear to be heterogeneous, and multiple factors may need to be considered in evaluating their impact. This goal can potentially be achieved within large-scale NDS through the automated processing of observational data, including speech. Speech-processing algorithms enable new perspectives on driving performance to be gained from existing NDS data, and variables that were once labor-intensive to process can be readily utilized in future research. © 2016, Human Factors and Ergonomics Society.
Van Ackeren, Markus Johannes; Barbero, Francesca M; Mattioni, Stefania; Bottini, Roberto
2018-01-01
The occipital cortex of early blind individuals (EB) activates during speech processing, challenging the notion of a hard-wired neurobiology of language. But, at what stage of speech processing do occipital regions participate in EB? Here we demonstrate that parieto-occipital regions in EB enhance their synchronization to acoustic fluctuations in human speech in the theta-range (corresponding to syllabic rate), irrespective of speech intelligibility. Crucially, enhanced synchronization to the intelligibility of speech was selectively observed in primary visual cortex in EB, suggesting that this region is at the interface between speech perception and comprehension. Moreover, EB showed overall enhanced functional connectivity between temporal and occipital cortices that are sensitive to speech intelligibility and altered directionality when compared to the sighted group. These findings suggest that the occipital cortex of the blind adopts an architecture that allows the tracking of speech material, and therefore does not fully abstract from the reorganized sensory inputs it receives. PMID:29338838
Cason, Nia; Astésano, Corine; Schön, Daniele
2015-02-01
Following findings that musical rhythmic priming enhances subsequent speech perception, we investigated whether rhythmic priming for spoken sentences can enhance phonological processing - the building blocks of speech - and whether audio-motor training enhances this effect. Participants heard a metrical prime followed by a sentence (with a matching/mismatching prosodic structure), for which they performed a phoneme detection task. Behavioural (RT) data was collected from two groups: one who received audio-motor training, and one who did not. We hypothesised that 1) phonological processing would be enhanced in matching conditions, and 2) audio-motor training with the musical rhythms would enhance this effect. Indeed, providing a matching rhythmic prime context resulted in faster phoneme detection, thus revealing a cross-domain effect of musical rhythm on phonological processing. In addition, our results indicate that rhythmic audio-motor training enhances this priming effect. These results have important implications for rhythm-based speech therapies, and suggest that metrical rhythm in music and speech may rely on shared temporal processing brain resources. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Nomura, Yukihiro; Lu, Jianming; Sekiya, Hiroo; Yahagi, Takashi
This paper presents a speech enhancement using the classification between the dominants of speech and noise. In our system, a new classification scheme between the dominants of speech and noise is proposed. The proposed classifications use the standard deviation of the spectrum of observation signal in each band. We introduce two oversubtraction factors for the dominants of speech and noise, respectively. And spectral subtraction is carried out after the classification. The proposed method is tested on several noise types from the Noisex-92 database. From the investigation of segmental SNR, Itakura-Saito distance measure, inspection of spectrograms and listening tests, the proposed system is shown to be effective to reduce background noise. Moreover, the enhanced speech using our system generates less musical noise and distortion than that of conventional systems.
Analysis of facial motion patterns during speech using a matrix factorization algorithm
Lucero, Jorge C.; Munhall, Kevin G.
2008-01-01
This paper presents an analysis of facial motion during speech to identify linearly independent kinematic regions. The data consists of three-dimensional displacement records of a set of markers located on a subject’s face while producing speech. A QR factorization with column pivoting algorithm selects a subset of markers with independent motion patterns. The subset is used as a basis to fit the motion of the other facial markers, which determines facial regions of influence of each of the linearly independent markers. Those regions constitute kinematic “eigenregions” whose combined motion produces the total motion of the face. Facial animations may be generated by driving the independent markers with collected displacement records. PMID:19062866
Programming Deep Brain Stimulation for Tremor and Dystonia: The Toronto Western Hospital Algorithms.
Picillo, Marina; Lozano, Andres M; Kou, Nancy; Munhoz, Renato Puppi; Fasano, Alfonso
2016-01-01
Deep brain stimulation (DBS) is an effective treatment for essential tremor (ET) and dystonia. After surgery, a number of extensive programming sessions are performed, mainly relying on neurologist's personal experience as no programming guidelines have been provided so far, with the exception of recommendations provided by groups of experts. Finally, fewer information is available for the management of DBS in ET and dystonia compared with Parkinson's disease. Our aim is to review the literature on initial and follow-up DBS programming procedures for ET and dystonia and integrate the results with our current practice at Toronto Western Hospital (TWH) to develop standardized DBS programming protocols. We conducted a literature search of PubMed from inception to July 2014 with the keywords "balance", "bradykinesia", "deep brain stimulation", "dysarthria", "dystonia", "gait disturbances", "initial programming", "loss of benefit", "micrographia", "speech", "speech difficulties" and "tremor". Seventy-six papers were considered for this review. Based on the literature review and our experience at TWH, we refined three algorithms for management of ET, including: (1) initial programming, (2) management of balance and speech issues and (3) loss of stimulation benefit. We also depicted algorithms for the management of dystonia, including: (1) initial programming and (2) management of stimulation-induced hypokinesia (shuffling gait, micrographia and speech impairment). We propose five algorithms tailored to an individualized approach to managing ET and dystonia patients with DBS. We encourage the application of these algorithms to supplement current standards of care in established as well as new DBS centers to test the clinical usefulness of these algorithms in supplementing the current standards of care. Copyright © 2016 Elsevier Inc. All rights reserved.
Carrillo, Facundo; Sigman, Mariano; Fernández Slezak, Diego; Ashton, Philip; Fitzgerald, Lily; Stroud, Jack; Nutt, David J; Carhart-Harris, Robin L
2018-04-01
Natural speech analytics has seen some improvements over recent years, and this has opened a window for objective and quantitative diagnosis in psychiatry. Here, we used a machine learning algorithm applied to natural speech to ask whether language properties measured before psilocybin for treatment-resistant can predict for which patients it will be effective and for which it will not. A baseline autobiographical memory interview was conducted and transcribed. Patients with treatment-resistant depression received 2 doses of psilocybin, 10 mg and 25 mg, 7 days apart. Psychological support was provided before, during and after all dosing sessions. Quantitative speech measures were applied to the interview data from 17 patients and 18 untreated age-matched healthy control subjects. A machine learning algorithm was used to classify between controls and patients and predict treatment response. Speech analytics and machine learning successfully differentiated depressed patients from healthy controls and identified treatment responders from non-responders with a significant level of 85% of accuracy (75% precision). Automatic natural language analysis was used to predict effective response to treatment with psilocybin, suggesting that these tools offer a highly cost-effective facility for screening individuals for treatment suitability and sensitivity. The sample size was small and replication is required to strengthen inferences on these results. Copyright © 2018 Elsevier B.V. All rights reserved.
A multilingual audiometer simulator software for training purposes.
Kompis, Martin; Steffen, Pascal; Caversaccio, Marco; Brugger, Urs; Oesch, Ivo
2012-04-01
A set of algorithms, which allows a computer to determine the answers of simulated patients during pure tone and speech audiometry, is presented. Based on these algorithms, a computer program for training in audiometry was written and found to be useful for teaching purposes. To develop a flexible audiometer simulator software as a teaching and training tool for pure tone and speech audiometry, both with and without masking. First a set of algorithms, which allows a computer to determine the answers of a simulated, hearing-impaired patient, was developed. Then, the software was implemented. Extensive use was made of simple, editable text files to define all texts in the user interface and all patient definitions. The software 'audiometer simulator' is available for free download. It can be used to train pure tone audiometry (both with and without masking), speech audiometry, measurement of the uncomfortable level, and simple simulation tests. Due to the use of text files, the user can alter or add patient definitions and all texts and labels shown on the screen. So far, English, French, German, and Portuguese user interfaces are available and the user can choose between German or French speech audiometry.
Start/End Delays of Voiced and Unvoiced Speech Signals
DOE Office of Scientific and Technical Information (OSTI.GOV)
Herrnstein, A
Recent experiments using low power EM-radar like sensors (e.g, GEMs) have demonstrated a new method for measuring vocal fold activity and the onset times of voiced speech, as vocal fold contact begins to take place. Similarly the end time of a voiced speech segment can be measured. Secondly it appears that in most normal uses of American English speech, unvoiced-speech segments directly precede or directly follow voiced-speech segments. For many applications, it is useful to know typical duration times of these unvoiced speech segments. A corpus, assembled earlier of spoken ''Timit'' words, phrases, and sentences and recorded using simultaneously measuredmore » acoustic and EM-sensor glottal signals, from 16 male speakers, was used for this study. By inspecting the onset (or end) of unvoiced speech, using the acoustic signal, and the onset (or end) of voiced speech using the EM sensor signal, the average duration times for unvoiced segments preceding onset of vocalization were found to be 300ms, and for following segments, 500ms. An unvoiced speech period is then defined in time, first by using the onset of the EM-sensed glottal signal, as the onset-time marker for the voiced speech segment and end marker for the unvoiced segment. Then, by subtracting 300ms from the onset time mark of voicing, the unvoiced speech segment start time is found. Similarly, the times for a following unvoiced speech segment can be found. While data of this nature have proven to be useful for work in our laboratory, a great deal of additional work remains to validate such data for use with general populations of users. These procedures have been useful for applying optimal processing algorithms over time segments of unvoiced, voiced, and non-speech acoustic signals. For example, these data appear to be of use in speaker validation, in vocoding, and in denoising algorithms.« less
Robust estimators for speech enhancement in real environments
NASA Astrophysics Data System (ADS)
Sandoval-Ibarra, Yuma; Diaz-Ramirez, Victor H.; Kober, Vitaly
2015-09-01
Common statistical estimators for speech enhancement rely on several assumptions about stationarity of speech signals and noise. These assumptions may not always valid in real-life due to nonstationary characteristics of speech and noise processes. We propose new estimators based on existing estimators by incorporation of computation of rank-order statistics. The proposed estimators are better adapted to non-stationary characteristics of speech signals and noise processes. Through computer simulations we show that the proposed estimators yield a better performance in terms of objective metrics than that of known estimators when speech signals are contaminated with airport, babble, restaurant, and train-station noise.
NASA Astrophysics Data System (ADS)
Tallal, Paula; Miller, Steve L.; Bedi, Gail; Byma, Gary; Wang, Xiaoqin; Nagarajan, Srikantan S.; Schreiner, Christoph; Jenkins, William M.; Merzenich, Michael M.
1996-01-01
A speech processing algorithm was developed to create more salient versions of the rapidly changing elements in the acoustic waveform of speech that have been shown to be deficiently processed by language-learning impaired (LLI) children. LLI children received extensive daily training, over a 4-week period, with listening exercises in which all speech was translated into this synthetic form. They also received daily training with computer "games" designed to adaptively drive improvements in temporal processing thresholds. Significant improvements in speech discrimination and language comprehension abilities were demonstrated in two independent groups of LLI children.
Live Speech Driven Head-and-Eye Motion Generators.
Le, Binh H; Ma, Xiaohan; Deng, Zhigang
2012-11-01
This paper describes a fully automated framework to generate realistic head motion, eye gaze, and eyelid motion simultaneously based on live (or recorded) speech input. Its central idea is to learn separate yet interrelated statistical models for each component (head motion, gaze, or eyelid motion) from a prerecorded facial motion data set: 1) Gaussian Mixture Models and gradient descent optimization algorithm are employed to generate head motion from speech features; 2) Nonlinear Dynamic Canonical Correlation Analysis model is used to synthesize eye gaze from head motion and speech features, and 3) nonnegative linear regression is used to model voluntary eye lid motion and log-normal distribution is used to describe involuntary eye blinks. Several user studies are conducted to evaluate the effectiveness of the proposed speech-driven head and eye motion generator using the well-established paired comparison methodology. Our evaluation results clearly show that this approach can significantly outperform the state-of-the-art head and eye motion generation algorithms. In addition, a novel mocap+video hybrid data acquisition technique is introduced to record high-fidelity head movement, eye gaze, and eyelid motion simultaneously.
NASA Astrophysics Data System (ADS)
Dobre, Robert A.; Negrescu, Cristian; Stanomir, Dumitru
2016-12-01
In many situations audio recordings can decide the fate of a trial when accepted as evidence. But until they can be taken into account they must be authenticated at first, but also the quality of the targeted content (speech in most cases) must be good enough to remove any doubt. In this scope two main directions of multimedia forensics come into play: content authentication and noise reduction. This paper presents an application that is included in the latter. If someone would like to conceal their conversation, the easiest way to do it would be to turn loud the nearest audio system. In this situation, if a microphone was placed close by, the recorded signal would be apparently useless because the speech signal would be masked by the loud music signal. The paper proposes an adaptive filters based solution to remove the musical content from a previously described signal mixture in order to recover the masked vocal signal. Two adaptive filtering algorithms were tested in the proposed solution: the Normalised Least Mean Squares (NLMS) and Recursive Least Squares (RLS). Their performances in the described situation were evaluated using Simulink, compared and included in the paper.
Hu, Yi; Loizou, Philipos C
2010-06-01
Attempts to develop noise-suppression algorithms that can significantly improve speech intelligibility in noise by cochlear implant (CI) users have met with limited success. This is partly because algorithms were sought that would work equally well in all listening situations. Accomplishing this has been quite challenging given the variability in the temporal/spectral characteristics of real-world maskers. A different approach is taken in the present study focused on the development of environment-specific noise suppression algorithms. The proposed algorithm selects a subset of the envelope amplitudes for stimulation based on the signal-to-noise ratio (SNR) of each channel. Binary classifiers, trained using data collected from a particular noisy environment, are first used to classify the mixture envelopes of each channel as either target-dominated (SNR>or=0 dB) or masker-dominated (SNR<0 dB). Only target-dominated channels are subsequently selected for stimulation. Results with CI listeners indicated substantial improvements (by nearly 44 percentage points at 5 dB SNR) in intelligibility with the proposed algorithm when tested with sentences embedded in three real-world maskers. The present study demonstrated that the environment-specific approach to noise reduction has the potential to restore speech intelligibility in noise to a level near to that attained in quiet.
A 4.8 kbps code-excited linear predictive coder
NASA Technical Reports Server (NTRS)
Tremain, Thomas E.; Campbell, Joseph P., Jr.; Welch, Vanoy C.
1988-01-01
A secure voice system STU-3 capable of providing end-to-end secure voice communications (1984) was developed. The terminal for the new system will be built around the standard LPC-10 voice processor algorithm. The performance of the present STU-3 processor is considered to be good, its response to nonspeech sounds such as whistles, coughs and impulse-like noises may not be completely acceptable. Speech in noisy environments also causes problems with the LPC-10 voice algorithm. In addition, there is always a demand for something better. It is hoped that LPC-10's 2.4 kbps voice performance will be complemented with a very high quality speech coder operating at a higher data rate. This new coder is one of a number of candidate algorithms being considered for an upgraded version of the STU-3 in late 1989. The problems of designing a code-excited linear predictive (CELP) coder to provide very high quality speech at a 4.8 kbps data rate that can be implemented on today's hardware are considered.
Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.
2002-01-01
Low power EM waves are used to detect motions of vocal tract tissues of the human speech system before, during, and after voiced speech. A voiced excitation function is derived. The excitation function provides speech production information to enhance speech characterization and to enable noise removal from human speech.
Speech enhancement using the modified phase-opponency model.
Deshmukh, Om D; Espy-Wilson, Carol Y; Carney, Laurel H
2007-06-01
In this paper we present a model called the Modified Phase-Opponency (MPO) model for single-channel speech enhancement when the speech is corrupted by additive noise. The MPO model is based on the auditory PO model, proposed for detection of tones in noise. The PO model includes a physiologically realistic mechanism for processing the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery by using a cross-auditory-nerve-fiber coincidence detection for extracting temporal cues. The MPO model alters the components of the PO model such that the basic functionality of the PO model is maintained but the properties of the model can be analyzed and modified independently. The MPO-based speech enhancement scheme does not need to estimate the noise characteristics nor does it assume that the noise satisfies any statistical model. The MPO technique leads to the lowest value of the LPC-based objective measures and the highest value of the perceptual evaluation of speech quality measure compared to other methods when the speech signals are corrupted by fluctuating noise. Combining the MPO speech enhancement technique with our aperiodicity, periodicity, and pitch detector further improves its performance.
Burnett, Greg C [Livermore, CA; Holzrichter, John F [Berkeley, CA; Ng, Lawrence C [Danville, CA
2006-08-08
The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.
Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.
2004-03-23
The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.
Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.
2006-02-14
The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.
Accurate visible speech synthesis based on concatenating variable length motion capture data.
Ma, Jiyong; Cole, Ron; Pellom, Bryan; Ward, Wayne; Wise, Barbara
2006-01-01
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
Speech recognition: Acoustic-phonetic knowledge acquisition and representation
NASA Astrophysics Data System (ADS)
Zue, Victor W.
1988-09-01
The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.
Methods of Improving Speech Intelligibility for Listeners with Hearing Resolution Deficit
2012-01-01
Abstract Methods developed for real-time time scale modification (TSM) of speech signal are presented. They are based on the non-uniform, speech rate depended SOLA algorithm (Synchronous Overlap and Add). Influence of the proposed method on the intelligibility of speech was investigated for two separate groups of listeners, i.e. hearing impaired children and elderly listeners. It was shown that for the speech with average rate equal to or higher than 6.48 vowels/s, all of the proposed methods have statistically significant impact on the improvement of speech intelligibility for hearing impaired children with reduced hearing resolution and one of the proposed methods significantly improves comprehension of speech in the group of elderly listeners with reduced hearing resolution. Virtual slides http://www.diagnosticpathology.diagnomx.eu/vs/2065486371761991 PMID:23009662
An integrated approach to improving noisy speech perception
NASA Astrophysics Data System (ADS)
Koval, Serguei; Stolbov, Mikhail; Smirnova, Natalia; Khitrov, Mikhail
2002-05-01
For a number of practical purposes and tasks, experts have to decode speech recordings of very poor quality. A combination of techniques is proposed to improve intelligibility and quality of distorted speech messages and thus facilitate their comprehension. Along with the application of noise cancellation and speech signal enhancement techniques removing and/or reducing various kinds of distortions and interference (primarily unmasking and normalization in time and frequency fields), the approach incorporates optimal listener expert tactics based on selective listening, nonstandard binaural listening, accounting for short-term and long-term human ear adaptation to noisy speech, as well as some methods of speech signal enhancement to support speech decoding during listening. The approach integrating the suggested techniques ensures high-quality ultimate results and has successfully been applied by Speech Technology Center experts and by numerous other users, mainly forensic institutions, to perform noisy speech records decoding for courts, law enforcement and emergency services, accident investigation bodies, etc.
An algorithm of improving speech emotional perception for hearing aid
NASA Astrophysics Data System (ADS)
Xi, Ji; Liang, Ruiyu; Fei, Xianju
2017-07-01
In this paper, a speech emotion recognition (SER) algorithm was proposed to improve the emotional perception of hearing-impaired people. The algorithm utilizes multiple kernel technology to overcome the drawback of SVM: slow training speed. Firstly, in order to improve the adaptive performance of Gaussian Radial Basis Function (RBF), the parameter determining the nonlinear mapping was optimized on the basis of Kernel target alignment. Then, the obtained Kernel Function was used as the basis kernel of Multiple Kernel Learning (MKL) with slack variable that could solve the over-fitting problem. However, the slack variable also brings the error into the result. Therefore, a soft-margin MKL was proposed to balance the margin against the error. Moreover, the relatively iterative algorithm was used to solve the combination coefficients and hyper-plane equations. Experimental results show that the proposed algorithm can acquire an accuracy of 90% for five kinds of emotions including happiness, sadness, anger, fear and neutral. Compared with KPCA+CCA and PIM-FSVM, the proposed algorithm has the highest accuracy.
[Improving speech comprehension using a new cochlear implant speech processor].
Müller-Deile, J; Kortmann, T; Hoppe, U; Hessel, H; Morsnowski, A
2009-06-01
The aim of this multicenter clinical field study was to assess the benefits of the new Freedom 24 sound processor for cochlear implant (CI) users implanted with the Nucleus 24 cochlear implant system. The study included 48 postlingually profoundly deaf experienced CI users who demonstrated speech comprehension performance with their current speech processor on the Oldenburg sentence test (OLSA) in quiet conditions of at least 80% correct scores and who were able to perform adaptive speech threshold testing using the OLSA in noisy conditions. Following baseline measures of speech comprehension performance with their current speech processor, subjects were upgraded to the Freedom 24 speech processor. After a take-home trial period of at least 2 weeks, subject performance was evaluated by measuring the speech reception threshold with the Freiburg multisyllabic word test and speech intelligibility with the Freiburg monosyllabic word test at 50 dB and 70 dB in the sound field. The results demonstrated highly significant benefits for speech comprehension with the new speech processor. Significant benefits for speech comprehension were also demonstrated with the new speech processor when tested in competing background noise.In contrast, use of the Abbreviated Profile of Hearing Aid Benefit (APHAB) did not prove to be a suitably sensitive assessment tool for comparative subjective self-assessment of hearing benefits with each processor. Use of the preprocessing algorithm known as adaptive dynamic range optimization (ADRO) in the Freedom 24 led to additional improvements over the standard upgrade map for speech comprehension in quiet and showed equivalent performance in noise. Through use of the preprocessing beam-forming algorithm BEAM, subjects demonstrated a highly significant improved signal-to-noise ratio for speech comprehension thresholds (i.e., signal-to-noise ratio for 50% speech comprehension scores) when tested with an adaptive procedure using the Oldenburg sentences in the clinical setting S(0)N(CI), with speech signal at 0 degrees and noise lateral to the CI at 90 degrees . With the convincing findings from our evaluations of this multicenter study cohort, a trial with the Freedom 24 sound processor for all suitable CI users is recommended. For evaluating the benefits of a new processor, the comparative assessment paradigm used in our study design would be considered ideal for use with individual patients.
2018-01-01
Abstract In real-world environments, humans comprehend speech by actively integrating prior knowledge (P) and expectations with sensory input. Recent studies have revealed effects of prior information in temporal and frontal cortical areas and have suggested that these effects are underpinned by enhanced encoding of speech-specific features, rather than a broad enhancement or suppression of cortical activity. However, in terms of the specific hierarchical stages of processing involved in speech comprehension, the effects of integrating bottom-up sensory responses and top-down predictions are still unclear. In addition, it is unclear whether the predictability that comes with prior information may differentially affect speech encoding relative to the perceptual enhancement that comes with that prediction. One way to investigate these issues is through examining the impact of P on indices of cortical tracking of continuous speech features. Here, we did this by presenting participants with degraded speech sentences that either were or were not preceded by a clear recording of the same sentences while recording non-invasive electroencephalography (EEG). We assessed the impact of prior information on an isolated index of cortical tracking that reflected phoneme-level processing. Our findings suggest the possibility that prior information affects the early encoding of natural speech in a dual manner. Firstly, the availability of prior information, as hypothesized, enhanced the perceived clarity of degraded speech, which was positively correlated with changes in phoneme-level encoding across subjects. In addition, P induced an overall reduction of this cortical measure, which we interpret as resulting from the increase in predictability. PMID:29662947
Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.
2006-04-25
The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.
Coding strategies for cochlear implants under adverse environments
NASA Astrophysics Data System (ADS)
Tahmina, Qudsia
Cochlear implants are electronic prosthetic devices that restores partial hearing in patients with severe to profound hearing loss. Although most coding strategies have significantly improved the perception of speech in quite listening conditions, there remains limitations on speech perception under adverse environments such as in background noise, reverberation and band-limited channels, and we propose strategies that improve the intelligibility of speech transmitted over the telephone networks, reverberated speech and speech in the presence of background noise. For telephone processed speech, we propose to examine the effects of adding low-frequency and high- frequency information to the band-limited telephone speech. Four listening conditions were designed to simulate the receiving frequency characteristics of telephone handsets. Results indicated improvement in cochlear implant and bimodal listening when telephone speech was augmented with high frequency information and therefore this study provides support for design of algorithms to extend the bandwidth towards higher frequencies. The results also indicated added benefit from hearing aids for bimodal listeners in all four types of listening conditions. Speech understanding in acoustically reverberant environments is always a difficult task for hearing impaired listeners. Reverberated sounds consists of direct sound, early reflections and late reflections. Late reflections are known to be detrimental to speech intelligibility. In this study, we propose a reverberation suppression strategy based on spectral subtraction to suppress the reverberant energies from late reflections. Results from listening tests for two reverberant conditions (RT60 = 0.3s and 1.0s) indicated significant improvement when stimuli was processed with SS strategy. The proposed strategy operates with little to no prior information on the signal and the room characteristics and therefore, can potentially be implemented in real-time CI speech processors. For speech in background noise, we propose a mechanism underlying the contribution of harmonics to the benefit of electroacoustic stimulations in cochlear implants. The proposed strategy is based on harmonic modeling and uses synthesis driven approach to synthesize the harmonics in voiced segments of speech. Based on objective measures, results indicated improvement in speech quality. This study warrants further work into development of algorithms to regenerate harmonics of voiced segments in the presence of noise.
Yoon, Sung Hoon; Nam, Kyoung Won; Yook, Sunhyun; Cho, Baek Hwan; Jang, Dong Pyo; Hong, Sung Hwa; Kim, In Young
2017-03-01
In an effort to improve hearing aid users' satisfaction, recent studies on trainable hearing aids have attempted to implement one or two environmental factors into training. However, it would be more beneficial to train the device based on the owner's personal preferences in a more expanded environmental acoustic conditions. Our study aimed at developing a trainable hearing aid algorithm that can reflect the user's individual preferences in a more extensive environmental acoustic conditions (ambient sound level, listening situation, and degree of noise suppression) and evaluated the perceptual benefit of the proposed algorithm. Ten normal hearing subjects participated in this study. Each subjects trained the algorithm to their personal preference and the trained data was used to record test sounds in three different settings to be utilized to evaluate the perceptual benefit of the proposed algorithm by performing the Comparison Mean Opinion Score test. Statistical analysis revealed that of the 10 subjects, four showed significant differences in amplification constant settings between the noise-only and speech-in-noise situation ( P <0.05) and one subject also showed significant difference between the speech-only and speech-in-noise situation ( P <0.05). Additionally, every subject preferred different β settings for beamforming in all different input sound levels. The positive findings from this study suggested that the proposed algorithm has potential to improve hearing aid users' personal satisfaction under various ambient situations.
Programming Deep Brain Stimulation for Parkinson's Disease: The Toronto Western Hospital Algorithms.
Picillo, Marina; Lozano, Andres M; Kou, Nancy; Puppi Munhoz, Renato; Fasano, Alfonso
2016-01-01
Deep brain stimulation (DBS) is an established and effective treatment for Parkinson's disease (PD). After surgery, a number of extensive programming sessions are performed to define the most optimal stimulation parameters. Programming sessions mainly rely only on neurologist's experience. As a result, patients often undergo inconsistent and inefficient stimulation changes, as well as unnecessary visits. We reviewed the literature on initial and follow-up DBS programming procedures and integrated our current practice at Toronto Western Hospital (TWH) to develop standardized DBS programming protocols. We propose four algorithms including the initial programming and specific algorithms tailored to symptoms experienced by patients following DBS: speech disturbances, stimulation-induced dyskinesia and gait impairment. We conducted a literature search of PubMed from inception to July 2014 with the keywords "deep brain stimulation", "festination", "freezing", "initial programming", "Parkinson's disease", "postural instability", "speech disturbances", and "stimulation induced dyskinesia". Seventy papers were considered for this review. Based on the literature review and our experience at TWH, we refined four algorithms for: (1) the initial programming stage, and management of symptoms following DBS, particularly addressing (2) speech disturbances, (3) stimulation-induced dyskinesia, and (4) gait impairment. We propose four algorithms tailored to an individualized approach to managing symptoms associated with DBS and disease progression in patients with PD. We encourage established as well as new DBS centers to test the clinical usefulness of these algorithms in supplementing the current standards of care. Copyright © 2016 Elsevier Inc. All rights reserved.
Korhonen, Petri; Kuk, Francis; Seper, Eric; Mørkebjerg, Martin; Roikjer, Majken
2017-01-01
Wind noise is a common problem reported by hearing aid wearers. The MarkeTrak VIII reported that 42% of hearing aid wearers are not satisfied with the performance of their hearing aids in situations where wind is present. The current study investigated the effect of a new wind noise attenuation (WNA) algorithm on subjective annoyance and speech recognition in the presence of wind. A single-blinded, repeated measures design was used. Fifteen experienced hearing aid wearers with bilaterally symmetrical (≤10 dB) mild-to-moderate sensorineural hearing loss participated in the study. Subjective rating for wind noise annoyance was measured for wind presented alone from 0° and 290° at wind speeds of 4, 5, 6, 7, and 10 m/sec. Phoneme identification performance was measured using Widex Office of Clinical Amplification Nonsense Syllable Test presented at 60, 65, 70, and 75 dB SPL from 270° in the presence of wind originating from 0° at a speed of 5 m/sec. The subjective annoyance from wind noise was reduced for wind originating from 0° at wind speeds from 4 to 7 m/sec. The largest improvement in phoneme identification with the WNA algorithm was 48.2% when speech was presented from 270° at 65 dB SPL and the wind originated from 0° azimuth at 5 m/sec. The WNA algorithm used in this study reduced subjective annoyance for wind speeds ranging from 4 to 7 m/sec. The algorithm was effective in improving speech identification in the presence of wind originating from 0° at 5 m/sec. These results suggest that the WNA algorithm used in the current study could expand the range of real-life situations where a hearing-impaired person can use the hearing aid optimally. American Academy of Audiology
Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy
2016-04-01
Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing. Copyright © 2016 Elsevier B.V. All rights reserved.
Knipfer, Christian; Riemann, Max; Bocklet, Tobias; Noeth, Elmar; Schuster, Maria; Sokol, Biljana; Eitner, Stephan; Nkenke, Emeka; Stelzle, Florian
2014-01-01
Tooth loss and its prosthetic rehabilitation significantly affect speech intelligibility. However, little is known about the influence of speech deficiencies on oral health-related quality of life (OHRQoL). The aim of this study was to investigate whether speech intelligibility enhancement through prosthetic rehabilitation significantly influences OHRQoL in patients wearing complete maxillary dentures. Speech intelligibility by means of an automatic speech recognition system (ASR) was prospectively evaluated and compared with subjectively assessed Oral Health Impact Profile (OHIP) scores. Speech was recorded in 28 edentulous patients 1 week prior to the fabrication of new complete maxillary dentures and 6 months thereafter. Speech intelligibility was computed based on the word accuracy (WA) by means of an ASR and compared with a matched control group. One week before and 6 months after rehabilitation, patients assessed themselves for OHRQoL. Speech intelligibility improved significantly after 6 months. Subjects reported a significantly higher OHRQoL after maxillary rehabilitation with complete dentures. No significant correlation was found between the OHIP sum score or its subscales to the WA. Speech intelligibility enhancement achieved through the fabrication of new complete maxillary dentures might not be in the forefront of the patients' perception of their quality of life. For the improvement of OHRQoL in patients wearing complete maxillary dentures, food intake and mastication as well as freedom from pain play a more prominent role.
Evaluation of synthesized voice approach callouts /SYNCALL/
NASA Technical Reports Server (NTRS)
Simpson, C. A.
1981-01-01
The two basic approaches to the generation of 'synthesized' speech include a utilization of analog recorded human speech and a construction of speech entirely from algorithms applied to constants describing speech sounds. Given the availability of synthesized speech displays for man-machine systems, research is needed to study suggested applications for speech and design principles for speech displays. The present investigation is concerned with a study for which new performance measures were developed. A number of air carrier approach and landing accidents during low or impaired visibility have been associated with the absence of approach callouts. The study had the purpose to compare a pilot-not-flying (PNF) approach callout system to a system composed of PNF callouts augmented by an automatic synthesized voice callout system (SYNCALL). Pilots were found to favor the use of a SYNCALL system containing certain modifications.
A novel speech-processing strategy incorporating tonal information for cochlear implants.
Lan, N; Nie, K B; Gao, S K; Zeng, F G
2004-05-01
Good performance in cochlear implant users depends in large part on the ability of a speech processor to effectively decompose speech signals into multiple channels of narrow-band electrical pulses for stimulation of the auditory nerve. Speech processors that extract only envelopes of the narrow-band signals (e.g., the continuous interleaved sampling (CIS) processor) may not provide sufficient information to encode the tonal cues in languages such as Chinese. To improve the performance in cochlear implant users who speak tonal language, we proposed and developed a novel speech-processing strategy, which extracted both the envelopes of the narrow-band signals and the fundamental frequency (F0) of the speech signal, and used them to modulate both the amplitude and the frequency of the electrical pulses delivered to stimulation electrodes. We developed an algorithm to extract the fundatmental frequency and identified the general patterns of pitch variations of four typical tones in Chinese speech. The effectiveness of the extraction algorithm was verified with an artificial neural network that recognized the tonal patterns from the extracted F0 information. We then compared the novel strategy with the envelope-extraction CIS strategy in human subjects with normal hearing. The novel strategy produced significant improvement in perception of Chinese tones, phrases, and sentences. This novel processor with dynamic modulation of both frequency and amplitude is encouraging for the design of a cochlear implant device for sensorineurally deaf patients who speak tonal languages.
The Effectiveness of Clear Speech as a Masker
ERIC Educational Resources Information Center
Calandruccio, Lauren; Van Engen, Kristin; Dhar, Sumitrajit; Bradlow, Ann R.
2010-01-01
Purpose: It is established that speaking clearly is an effective means of enhancing intelligibility. Because any signal-processing scheme modeled after known acoustic-phonetic features of clear speech will likely affect both target and competing speech, it is important to understand how speech recognition is affected when a competing speech signal…
Left Lateralized Enhancement of Orofacial Somatosensory Processing Due to Speech Sounds
ERIC Educational Resources Information Center
Ito, Takayuki; Johns, Alexis R.; Ostry, David J.
2013-01-01
Purpose: Somatosensory information associated with speech articulatory movements affects the perception of speech sounds and vice versa, suggesting an intimate linkage between speech production and perception systems. However, it is unclear which cortical processes are involved in the interaction between speech sounds and orofacial somatosensory…
Binaural enhancement for bilateral cochlear implant users.
Brown, Christopher A
2014-01-01
Bilateral cochlear implant (BCI) users receive limited binaural cues and, thus, show little improvement to speech intelligibility from spatial cues. The feasibility of a method for enhancing the binaural cues available to BCI users is investigated. This involved extending interaural differences of levels, which typically are restricted to high frequencies, into the low-frequency region. Speech intelligibility was measured in BCI users listening over headphones and with direct stimulation, with a target talker presented to one side of the head in the presence of a masker talker on the other side. Spatial separation was achieved by applying either naturally occurring binaural cues or enhanced cues. In this listening configuration, BCI patients showed greater speech intelligibility with the enhanced binaural cues than with naturally occurring binaural cues. In some situations, it is possible for BCI users to achieve greater speech intelligibility when binaural cues are enhanced by applying interaural differences of levels in the low-frequency region.
Application of artifical intelligence principles to the analysis of "crazy" speech.
Garfield, D A; Rapp, C
1994-04-01
Artificial intelligence computer simulation methods can be used to investigate psychotic or "crazy" speech. Here, symbolic reasoning algorithms establish semantic networks that schematize speech. These semantic networks consist of two main structures: case frames and object taxonomies. Node-based reasoning rules apply to object taxonomies and pathway-based reasoning rules apply to case frames. Normal listeners may recognize speech as "crazy talk" based on violations of node- and pathway-based reasoning rules. In this article, three separate segments of schizophrenic speech illustrate violations of these rules. This artificial intelligence approach is compared and contrasted with other neurolinguistic approaches and is discussed as a conceptual link between neurobiological and psychodynamic understandings of psychopathology.
Acoustic assessment of speech privacy curtains in two nursing units
Pope, Diana S.; Miller-Klein, Erik T.
2016-01-01
Hospitals have complex soundscapes that create challenges to patient care. Extraneous noise and high reverberation rates impair speech intelligibility, which leads to raised voices. In an unintended spiral, the increasing noise may result in diminished speech privacy, as people speak loudly to be heard over the din. The products available to improve hospital soundscapes include construction materials that absorb sound (acoustic ceiling tiles, carpet, wall insulation) and reduce reverberation rates. Enhanced privacy curtains are now available and offer potential for a relatively simple way to improve speech privacy and speech intelligibility by absorbing sound at the hospital patient's bedside. Acoustic assessments were performed over 2 days on two nursing units with a similar design in the same hospital. One unit was built with the 1970s’ standard hospital construction and the other was newly refurbished (2013) with sound-absorbing features. In addition, we determined the effect of an enhanced privacy curtain versus standard privacy curtains using acoustic measures of speech privacy and speech intelligibility indexes. Privacy curtains provided auditory protection for the patients. In general, that protection was increased by the use of enhanced privacy curtains. On an average, the enhanced curtain improved sound absorption from 20% to 30%; however, there was considerable variability, depending on the configuration of the rooms tested. Enhanced privacy curtains provide measureable improvement to the acoustics of patient rooms but cannot overcome larger acoustic design issues. To shorten reverberation time, additional absorption, and compact and more fragmented nursing unit floor plate shapes should be considered. PMID:26780959
Acoustic assessment of speech privacy curtains in two nursing units.
Pope, Diana S; Miller-Klein, Erik T
2016-01-01
Hospitals have complex soundscapes that create challenges to patient care. Extraneous noise and high reverberation rates impair speech intelligibility, which leads to raised voices. In an unintended spiral, the increasing noise may result in diminished speech privacy, as people speak loudly to be heard over the din. The products available to improve hospital soundscapes include construction materials that absorb sound (acoustic ceiling tiles, carpet, wall insulation) and reduce reverberation rates. Enhanced privacy curtains are now available and offer potential for a relatively simple way to improve speech privacy and speech intelligibility by absorbing sound at the hospital patient's bedside. Acoustic assessments were performed over 2 days on two nursing units with a similar design in the same hospital. One unit was built with the 1970s' standard hospital construction and the other was newly refurbished (2013) with sound-absorbing features. In addition, we determined the effect of an enhanced privacy curtain versus standard privacy curtains using acoustic measures of speech privacy and speech intelligibility indexes. Privacy curtains provided auditory protection for the patients. In general, that protection was increased by the use of enhanced privacy curtains. On an average, the enhanced curtain improved sound absorption from 20% to 30%; however, there was considerable variability, depending on the configuration of the rooms tested. Enhanced privacy curtains provide measureable improvement to the acoustics of patient rooms but cannot overcome larger acoustic design issues. To shorten reverberation time, additional absorption, and compact and more fragmented nursing unit floor plate shapes should be considered.
Zhang, Xiaoyang; Xue, Lei; Zhang, Zhi; Zhang, Yiwen
2016-01-01
Health problems about children have been attracting much attention of parents and even the whole society all the time, among which, child-language development is a hot research topic. The experts and scholars have studied and found that the guardians taking appropriate intervention in children at the early stage can promote children's language and cognitive ability development effectively, and carry out analysis of quantity. The intervention of Artificial Intelligence Technology has effect on the autistic spectrum disorders of children obviously. This paper presents a speech signal analysis system for children, with preprocessing of the speaker speech signal, subsequent calculation of the number in the speech of guardians and children, and some other characteristic parameters or indicators (e.g cognizable syllable number, the continuity of the language). With these quantitative analysis tool and parameters, we can evaluate and analyze the quality of children's language and cognitive ability objectively and quantitatively to provide the basis for decision-making criteria for parents. Thereby, they can adopt appropriate measures for children to promote the development of children's language and cognitive status. In this paper, according to the existing study of children's language development, we put forward several indicators in the process of automatic measurement for language development which influence the formation of children's language. From the experimental results we can see that after the pretreatment (including signal enhancement, speech activity detection), both divergence algorithm calculation results and the later words count are quite satisfactory compared with the actual situation.
Parbery-Clark, Alexandra; Anderson, Samira; Hittner, Emily; Kraus, Nina
2012-01-01
Older adults frequently complain that while they can hear a person talking, they cannot understand what is being said; this difficulty is exacerbated by background noise. Peripheral hearing loss cannot fully account for this age-related decline in speech-in-noise ability, as declines in central processing also contribute to this problem. Given that musicians have enhanced speech-in-noise perception, we aimed to define the effects of musical experience on subcortical responses to speech and speech-in-noise perception in middle-aged adults. Results reveal that musicians have enhanced neural encoding of speech in quiet and noisy settings. Enhancements include faster neural response timing, higher neural response consistency, more robust encoding of speech harmonics, and greater neural precision. Taken together, we suggest that musical experience provides perceptual benefits in an aging population by strengthening the underlying neural pathways necessary for the accurate representation of important temporal and spectral features of sound. PMID:23189051
Bahrick, Lorraine E.; Lickliter, Robert; Castellanos, Irina
2014-01-01
Although research has demonstrated impressive face perception skills of young infants, little attention has focused on conditions that enhance versus impair infant face perception. The present studies tested the prediction, generated from the Intersensory Redundancy Hypothesis (IRH), that face discrimination, which relies on detection of visual featural information, would be impaired in the context of intersensory redundancy provided by audiovisual speech, and enhanced in the absence of intersensory redundancy (unimodal visual and asynchronous audiovisual speech) in early development. Later in development, following improvements in attention, faces should be discriminated in both redundant audiovisual and nonredundant stimulation. Results supported these predictions. Two-month-old infants discriminated a novel face in unimodal visual and asynchronous audiovisual speech but not in synchronous audiovisual speech. By 3 months, face discrimination was evident even during synchronous audiovisual speech. These findings indicate that infant face perception is enhanced and emerges developmentally earlier following unimodal visual than synchronous audiovisual exposure and that intersensory redundancy generated by naturalistic audiovisual speech can interfere with face processing. PMID:23244407
Communication Supports for People with Motor Speech Disorders
ERIC Educational Resources Information Center
Hanson, Elizabeth K.; Fager, Susan K.
2017-01-01
Communication supports for people with motor speech disorders can include strategies and technologies to supplement natural speech efforts, resolve communication breakdowns, and replace natural speech when necessary to enhance participation in all communicative contexts. This article emphasizes communication supports that can enhance…
Biological impact of preschool music classes on processing speech in noise
Strait, Dana L.; Parbery-Clark, Alexandra; O’Connell, Samantha; Kraus, Nina
2013-01-01
Musicians have increased resilience to the effects of noise on speech perception and its neural underpinnings. We do not know, however, how early in life these enhancements arise. We compared auditory brainstem responses to speech in noise in 32 preschool children, half of whom were engaged in music training. Thirteen children returned for testing one year later, permitting the first longitudinal assessment of subcortical auditory function with music training. Results indicate emerging neural enhancements in musically trained preschoolers for processing speech in noise. Longitudinal outcomes reveal that children enrolled in music classes experience further increased neural resilience to background noise following one year of continued training compared to nonmusician peers. Together, these data reveal enhanced development of neural mechanisms undergirding speech-in-noise perception in preschoolers undergoing music training and may indicate a biological impact of music training on auditory function during early childhood. PMID:23872199
Biological impact of preschool music classes on processing speech in noise.
Strait, Dana L; Parbery-Clark, Alexandra; O'Connell, Samantha; Kraus, Nina
2013-10-01
Musicians have increased resilience to the effects of noise on speech perception and its neural underpinnings. We do not know, however, how early in life these enhancements arise. We compared auditory brainstem responses to speech in noise in 32 preschool children, half of whom were engaged in music training. Thirteen children returned for testing one year later, permitting the first longitudinal assessment of subcortical auditory function with music training. Results indicate emerging neural enhancements in musically trained preschoolers for processing speech in noise. Longitudinal outcomes reveal that children enrolled in music classes experience further increased neural resilience to background noise following one year of continued training compared to nonmusician peers. Together, these data reveal enhanced development of neural mechanisms undergirding speech-in-noise perception in preschoolers undergoing music training and may indicate a biological impact of music training on auditory function during early childhood. Copyright © 2013 Elsevier Ltd. All rights reserved.
ERIC Educational Resources Information Center
Megnin-Viggars, Odette; Goswami, Usha
2013-01-01
Visual speech inputs can enhance auditory speech information, particularly in noisy or degraded conditions. The natural statistics of audiovisual speech highlight the temporal correspondence between visual and auditory prosody, with lip, jaw, cheek and head movements conveying information about the speech envelope. Low-frequency spatial and…
Contributions of local speech encoding and functional connectivity to audio-visual speech perception
Giordano, Bruno L; Ince, Robin A A; Gross, Joachim; Schyns, Philippe G; Panzeri, Stefano; Kayser, Christoph
2017-01-01
Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments. DOI: http://dx.doi.org/10.7554/eLife.24763.001 PMID:28590903
"Who" is saying "what"? Brain-based decoding of human voice and speech.
Formisano, Elia; De Martino, Federico; Bonte, Milene; Goebel, Rainer
2008-11-07
Can we decipher speech content ("what" is being said) and speaker identity ("who" is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.
Desmond, Jill M; Collins, Leslie M; Throckmorton, Chandra S
2014-06-01
Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221-3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.
Somanath, Keerthan; Mau, Ted
2016-11-01
(1) To develop an automated algorithm to analyze electroglottographic (EGG) signal in continuous dysphonic speech, and (2) to identify EGG waveform parameters that correlate with the auditory-perceptual quality of strain in the speech of patients with adductor spasmodic dysphonia (ADSD). Software development with application in a prospective controlled study. EGG was recorded from 12 normal speakers and 12 subjects with ADSD reading excerpts from the Rainbow Passage. Data were processed by a new algorithm developed with the specific goal of analyzing continuous dysphonic speech. The contact quotient, pulse width, a new parameter peak skew, and various contact closing slope quotient and contact opening slope quotient measures were extracted. EGG parameters were compared between normal and ADSD speech. Within the ADSD group, intra-subject comparison was also made between perceptually strained syllables and unstrained syllables. The opening slope quotient SO7525 distinguished strained syllables from unstrained syllables in continuous speech within individual subjects with ADSD. The standard deviations, but not the means, of contact quotient, EGGW50, peak skew, and SO7525 were different between normal and ADSD speakers. The strain-stress pattern in continuous speech can be visualized as color gradients based on the variation of EGG parameter values. EGG parameters may provide a within-subject measure of vocal strain and serve as a marker for treatment response. The addition of EGG to multidimensional assessment may lead to improved characterization of the voice disturbance in ADSD. Copyright © 2016 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Somanath, Keerthan; Mau, Ted
2016-01-01
Objectives (1) To develop an automated algorithm to analyze electroglottographic (EGG) signal in continuous, dysphonic speech, and (2) to identify EGG waveform parameters that correlate with the auditory-perceptual quality of strain in the speech of patients with adductor spasmodic dysphonia (ADSD). Study Design Software development with application in a prospective controlled study. Methods EGG was recorded from 12 normal speakers and 12 subjects with ADSD reading excerpts from the Rainbow Passage. Data were processed by a new algorithm developed with the specific goal of analyzing continuous dysphonic speech. The contact quotient (CQ), pulse width (EGGW), a new parameter peak skew, and various contact closing slope quotient (SC) and contact opening slope quotient (SO) measures were extracted. EGG parameters were compared between normal and ADSD speech. Within the ADSD group, intra-subject comparison was also made between perceptually strained syllables and unstrained syllables. Results The opening slope quotient SO7525 distinguished strained syllables from unstrained syllables in continuous speech within individual ADSD subjects. The standard deviations, but not the means, of CQ, EGGW50, peak skew, and SO7525 were different between normal and ADSD speakers. The strain-stress pattern in continuous speech can be visualized as color gradients based on the variation of EGG parameter values. Conclusions EGG parameters may provide a within-subject measure of vocal strain and serve as a marker for treatment response. The addition of EGG to multi-dimensional assessment may lead to improved characterization of the voice disturbance in ADSD. PMID:26739857
Crosse, Michael J; Lalor, Edmund C
2014-04-01
Visual speech can greatly enhance a listener's comprehension of auditory speech when they are presented simultaneously. Efforts to determine the neural underpinnings of this phenomenon have been hampered by the limited temporal resolution of hemodynamic imaging and the fact that EEG and magnetoencephalographic data are usually analyzed in response to simple, discrete stimuli. Recent research has shown that neuronal activity in human auditory cortex tracks the envelope of natural speech. Here, we exploit this finding by estimating a linear forward-mapping between the speech envelope and EEG data and show that the latency at which the envelope of natural speech is represented in cortex is shortened by >10 ms when continuous audiovisual speech is presented compared with audio-only speech. In addition, we use a reverse-mapping approach to reconstruct an estimate of the speech stimulus from the EEG data and, by comparing the bimodal estimate with the sum of the unimodal estimates, find no evidence of any nonlinear additive effects in the audiovisual speech condition. These findings point to an underlying mechanism that could account for enhanced comprehension during audiovisual speech. Specifically, we hypothesize that low-level acoustic features that are temporally coherent with the preceding visual stream may be synthesized into a speech object at an earlier latency, which may provide an extended period of low-level processing before extraction of semantic information.
Kim, Jongin; Park, Hyeong-jun
2016-01-01
The purpose of this study is to classify EEG data on imagined speech in a single trial. We recorded EEG data while five subjects imagined different vowels, /a/, /e/, /i/, /o/, and /u/. We divided each single trial dataset into thirty segments and extracted features (mean, variance, standard deviation, and skewness) from all segments. To reduce the dimension of the feature vector, we applied a feature selection algorithm based on the sparse regression model. These features were classified using a support vector machine with a radial basis function kernel, an extreme learning machine, and two variants of an extreme learning machine with different kernels. Because each single trial consisted of thirty segments, our algorithm decided the label of the single trial by selecting the most frequent output among the outputs of the thirty segments. As a result, we observed that the extreme learning machine and its variants achieved better classification rates than the support vector machine with a radial basis function kernel and linear discrimination analysis. Thus, our results suggested that EEG responses to imagined speech could be successfully classified in a single trial using an extreme learning machine with a radial basis function and linear kernel. This study with classification of imagined speech might contribute to the development of silent speech BCI systems. PMID:28097128
Lu, Lingxi; Bao, Xiaohan; Chen, Jing; Qu, Tianshu; Wu, Xihong; Li, Liang
2018-05-01
Under a noisy "cocktail-party" listening condition with multiple people talking, listeners can use various perceptual/cognitive unmasking cues to improve recognition of the target speech against informational speech-on-speech masking. One potential unmasking cue is the emotion expressed in a speech voice, by means of certain acoustical features. However, it was unclear whether emotionally conditioning a target-speech voice that has none of the typical acoustical features of emotions (i.e., an emotionally neutral voice) can be used by listeners for enhancing target-speech recognition under speech-on-speech masking conditions. In this study we examined the recognition of target speech against a two-talker speech masker both before and after the emotionally neutral target voice was paired with a loud female screaming sound that has a marked negative emotional valence. The results showed that recognition of the target speech (especially the first keyword in a target sentence) was significantly improved by emotionally conditioning the target speaker's voice. Moreover, the emotional unmasking effect was independent of the unmasking effect of the perceived spatial separation between the target speech and the masker. Also, (skin conductance) electrodermal responses became stronger after emotional learning when the target speech and masker were perceptually co-located, suggesting an increase of listening efforts when the target speech was informationally masked. These results indicate that emotionally conditioning the target speaker's voice does not change the acoustical parameters of the target-speech stimuli, but the emotionally conditioned vocal features can be used as cues for unmasking target speech.
Dual-learning systems during speech category learning
Chandrasekaran, Bharath; Yi, Han-Gyol; Maddox, W. Todd
2013-01-01
Dual-systems models of visual category learning posit the existence of an explicit, hypothesis-testing ‘reflective’ system, as well as an implicit, procedural-based ‘reflexive’ system. The reflective and reflexive learning systems are competitive and neurally dissociable. Relatively little is known about the role of these domain-general learning systems in speech category learning. Given the multidimensional, redundant, and variable nature of acoustic cues in speech categories, our working hypothesis is that speech categories are learned reflexively. To this end, we examined the relative contribution of these learning systems to speech learning in adults. Native English speakers learned to categorize Mandarin tone categories over 480 trials. The training protocol involved trial-by-trial feedback and multiple talkers. Experiment 1 and 2 examined the effect of manipulating the timing (immediate vs. delayed) and information content (full vs. minimal) of feedback. Dual-systems models of visual category learning predict that delayed feedback and providing rich, informational feedback enhance reflective learning, while immediate and minimally informative feedback enhance reflexive learning. Across the two experiments, our results show feedback manipulations that targeted reflexive learning enhanced category learning success. In Experiment 3, we examined the role of trial-to-trial talker information (mixed vs. blocked presentation) on speech category learning success. We hypothesized that the mixed condition would enhance reflexive learning by not allowing an association between talker-related acoustic cues and speech categories. Our results show that the mixed talker condition led to relatively greater accuracies. Our experiments demonstrate that speech categories are optimally learned by training methods that target the reflexive learning system. PMID:24002965
A Discriminative Approach to EEG Seizure Detection
Johnson, Ashley N.; Sow, Daby; Biem, Alain
2011-01-01
Seizures are abnormal sudden discharges in the brain with signatures represented in electroencephalograms (EEG). The efficacy of the application of speech processing techniques to discriminate between seizure and non-seizure states in EEGs is reported. The approach accounts for the challenges of unbalanced datasets (seizure and non-seizure), while also showing a system capable of real-time seizure detection. The Minimum Classification Error (MCE) algorithm, which is a discriminative learning algorithm with wide-use in speech processing, is applied and compared with conventional classification techniques that have already been applied to the discrimination between seizure and non-seizure states in the literature. The system is evaluated on 22 pediatric patients multi-channel EEG recordings. Experimental results show that the application of speech processing techniques and MCE compare favorably with conventional classification techniques in terms of classification performance, while requiring less computational overhead. The results strongly suggests the possibility of deploying the designed system at the bedside. PMID:22195192
Audiovisual Asynchrony Detection in Human Speech
ERIC Educational Resources Information Center
Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta
2011-01-01
Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…
Determination of the Potential Benefit of Time-Frequency Gain Manipulation
Anzalone, Michael C.; Calandruccio, Lauren; Doherty, Karen A.; Carney, Laurel H.
2008-01-01
Objective The purpose of this study was to determine the maximum benefit provided by a time-frequency gain-manipulation algorithm for noise-reduction (NR) based on an ideal detector of speech energy. The amount of detected energy necessary to show benefit using this type of NR algorithm was examined, as well as the necessary speed and frequency resolution of the gain manipulation. Design NR was performed using time-frequency gain manipulation, wherein the gains of individual frequency bands depended on the absence or presence of speech energy within each band. Three different experiments were performed: (1) NR using ideal detectors, (2) NR with nonideal detectors, and (3) NR with ideal detectors and different processing speeds and frequency resolutions. All experiments were performed using the Hearing-in-Noise test (HINT). A total of 6 listeners with normal hearing and 14 listeners with hearing loss were tested. Results HINT thresholds improved for all listeners with NR based on the ideal detectors used in Experiment I. The nonideal detectors of Experiment II required detection of at least 90% of the speech energy before an improvement was seen in HINT thresholds. The results of Experiment III demonstrated that relatively high temporal resolution (<100 msec) was required by the NR algorithm to improve HINT thresholds. Conclusions The results indicated that a single-microphone NR system based on time-frequency gain manipulation improved the HINT thresholds of listeners. However, to obtain benefit in speech intelligibility, the detectors used in such a strategy were required to detect an unrealistically high percentage of the speech energy and to perform the gain manipulations on a fast temporal basis. PMID:16957499
Buechner, Andreas; Dyballa, Karl-Heinz; Hehrmann, Phillipp; Fredelake, Stefan; Lenarz, Thomas
2014-01-01
Objective To investigate the performance of monaural and binaural beamforming technology with an additional noise reduction algorithm, in cochlear implant recipients. Method This experimental study was conducted as a single subject repeated measures design within a large German cochlear implant centre. Twelve experienced users of an Advanced Bionics HiRes90K or CII implant with a Harmony speech processor were enrolled. The cochlear implant processor of each subject was connected to one of two bilaterally placed state-of-the-art hearing aids (Phonak Ambra) providing three alternative directional processing options: an omnidirectional setting, an adaptive monaural beamformer, and a binaural beamformer. A further noise reduction algorithm (ClearVoice) was applied to the signal on the cochlear implant processor itself. The speech signal was presented from 0° and speech shaped noise presented from loudspeakers placed at ±70°, ±135° and 180°. The Oldenburg sentence test was used to determine the signal-to-noise ratio at which subjects scored 50% correct. Results Both the adaptive and binaural beamformer were significantly better than the omnidirectional condition (5.3 dB±1.2 dB and 7.1 dB±1.6 dB (p<0.001) respectively). The best score was achieved with the binaural beamformer in combination with the ClearVoice noise reduction algorithm, with a significant improvement in SRT of 7.9 dB±2.4 dB (p<0.001) over the omnidirectional alone condition. Conclusions The study showed that the binaural beamformer implemented in the Phonak Ambra hearing aid could be used in conjunction with a Harmony speech processor to produce substantial average improvements in SRT of 7.1 dB. The monaural, adaptive beamformer provided an averaged SRT improvement of 5.3 dB. PMID:24755864
Schädler, Marc R; Warzybok, Anna; Kollmeier, Birger
2018-01-01
The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than -20 dB could not be predicted.
Schädler, Marc R.; Warzybok, Anna; Kollmeier, Birger
2018-01-01
The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than −20 dB could not be predicted. PMID:29692200
Bradlow, Ann R; Alexander, Jennifer A
2007-04-01
Previous research has shown that speech recognition differences between native and proficient non-native listeners emerge under suboptimal conditions. Current evidence has suggested that the key deficit that underlies this disproportionate effect of unfavorable listening conditions for non-native listeners is their less effective use of compensatory information at higher levels of processing to recover from information loss at the phoneme identification level. The present study investigated whether this non-native disadvantage could be overcome if enhancements at various levels of processing were presented in combination. Native and non-native listeners were presented with English sentences in which the final word varied in predictability and which were produced in either plain or clear speech. Results showed that, relative to the low-predictability-plain-speech baseline condition, non-native listener final word recognition improved only when both semantic and acoustic enhancements were available (high-predictability-clear-speech). In contrast, the native listeners benefited from each source of enhancement separately and in combination. These results suggests that native and non-native listeners apply similar strategies for speech-in-noise perception: The crucial difference is in the signal clarity required for contextual information to be effective, rather than in an inability of non-native listeners to take advantage of this contextual information per se.
Phase-Locked Responses to Speech in Human Auditory Cortex are Enhanced During Comprehension
Peelle, Jonathan E.; Gross, Joachim; Davis, Matthew H.
2013-01-01
A growing body of evidence shows that ongoing oscillations in auditory cortex modulate their phase to match the rhythm of temporally regular acoustic stimuli, increasing sensitivity to relevant environmental cues and improving detection accuracy. In the current study, we test the hypothesis that nonsensory information provided by linguistic content enhances phase-locked responses to intelligible speech in the human brain. Sixteen adults listened to meaningful sentences while we recorded neural activity using magnetoencephalography. Stimuli were processed using a noise-vocoding technique to vary intelligibility while keeping the temporal acoustic envelope consistent. We show that the acoustic envelopes of sentences contain most power between 4 and 7 Hz and that it is in this frequency band that phase locking between neural activity and envelopes is strongest. Bilateral oscillatory neural activity phase-locked to unintelligible speech, but this cerebro-acoustic phase locking was enhanced when speech was intelligible. This enhanced phase locking was left lateralized and localized to left temporal cortex. Together, our results demonstrate that entrainment to connected speech does not only depend on acoustic characteristics, but is also affected by listeners’ ability to extract linguistic information. This suggests a biological framework for speech comprehension in which acoustic and linguistic cues reciprocally aid in stimulus prediction. PMID:22610394
Phase-locked responses to speech in human auditory cortex are enhanced during comprehension.
Peelle, Jonathan E; Gross, Joachim; Davis, Matthew H
2013-06-01
A growing body of evidence shows that ongoing oscillations in auditory cortex modulate their phase to match the rhythm of temporally regular acoustic stimuli, increasing sensitivity to relevant environmental cues and improving detection accuracy. In the current study, we test the hypothesis that nonsensory information provided by linguistic content enhances phase-locked responses to intelligible speech in the human brain. Sixteen adults listened to meaningful sentences while we recorded neural activity using magnetoencephalography. Stimuli were processed using a noise-vocoding technique to vary intelligibility while keeping the temporal acoustic envelope consistent. We show that the acoustic envelopes of sentences contain most power between 4 and 7 Hz and that it is in this frequency band that phase locking between neural activity and envelopes is strongest. Bilateral oscillatory neural activity phase-locked to unintelligible speech, but this cerebro-acoustic phase locking was enhanced when speech was intelligible. This enhanced phase locking was left lateralized and localized to left temporal cortex. Together, our results demonstrate that entrainment to connected speech does not only depend on acoustic characteristics, but is also affected by listeners' ability to extract linguistic information. This suggests a biological framework for speech comprehension in which acoustic and linguistic cues reciprocally aid in stimulus prediction.
Cooke, Martin; Aubanel, Vincent
2017-01-01
Algorithmic modifications to the durational structure of speech designed to avoid intervals of intense masking lead to increases in intelligibility, but the basis for such gains is not clear. The current study addressed the possibility that the reduced information load produced by speech rate slowing might explain some or all of the benefits of durational modifications. The study also investigated the influence of masker stationarity on the effectiveness of durational changes. Listeners identified keywords in sentences that had undergone linear and nonlinear speech rate changes resulting in overall temporal lengthening in the presence of stationary and fluctuating maskers. Relative to unmodified speech, a slower speech rate produced no intelligibility gains for the stationary masker, suggesting that a reduction in information rate does not underlie intelligibility benefits of durationally modified speech. However, both linear and nonlinear modifications led to substantial intelligibility increases in fluctuating noise. One possibility is that overall increases in speech duration provide no new phonetic information in stationary masking conditions, but that temporal fluctuations in the background increase the likelihood of glimpsing additional salient speech cues. Alternatively, listeners may have benefitted from an increase in the difference in speech rates between the target and background. PMID:28618803
Onojima, Takayuki; Kitajo, Keiichi; Mizuhara, Hiroaki
2017-01-01
Neural oscillation is attracting attention as an underlying mechanism for speech recognition. Speech intelligibility is enhanced by the synchronization of speech rhythms and slow neural oscillation, which is typically observed as human scalp electroencephalography (EEG). In addition to the effect of neural oscillation, it has been proposed that speech recognition is enhanced by the identification of a speaker's motor signals, which are used for speech production. To verify the relationship between the effect of neural oscillation and motor cortical activity, we measured scalp EEG, and simultaneous EEG and functional magnetic resonance imaging (fMRI) during a speech recognition task in which participants were required to recognize spoken words embedded in noise sound. We proposed an index to quantitatively evaluate the EEG phase effect on behavioral performance. The results showed that the delta and theta EEG phase before speech inputs modulated the participant's response time when conducting speech recognition tasks. The simultaneous EEG-fMRI experiment showed that slow EEG activity was correlated with motor cortical activity. These results suggested that the effect of the slow oscillatory phase was associated with the activity of the motor cortex during speech recognition.
Emotion to emotion speech conversion in phoneme level
NASA Astrophysics Data System (ADS)
Bulut, Murtaza; Yildirim, Serdar; Busso, Carlos; Lee, Chul Min; Kazemzadeh, Ebrahim; Lee, Sungbok; Narayanan, Shrikanth
2004-10-01
Having an ability to synthesize emotional speech can make human-machine interaction more natural in spoken dialogue management. This study investigates the effectiveness of prosodic and spectral modification in phoneme level on emotion-to-emotion speech conversion. The prosody modification is performed with the TD-PSOLA algorithm (Moulines and Charpentier, 1990). We also transform the spectral envelopes of source phonemes to match those of target phonemes using LPC-based spectral transformation approach (Kain, 2001). Prosodic speech parameters (F0, duration, and energy) for target phonemes are estimated from the statistics obtained from the analysis of an emotional speech database of happy, angry, sad, and neutral utterances collected from actors. Listening experiments conducted with native American English speakers indicate that the modification of prosody only or spectrum only is not sufficient to elicit targeted emotions. The simultaneous modification of both prosody and spectrum results in higher acceptance rates of target emotions, suggesting that not only modeling speech prosody but also modeling spectral patterns that reflect underlying speech articulations are equally important to synthesize emotional speech with good quality. We are investigating suprasegmental level modifications for further improvement in speech quality and expressiveness.
Musicians change their tune: how hearing loss alters the neural code.
Parbery-Clark, Alexandra; Anderson, Samira; Kraus, Nina
2013-08-01
Individuals with sensorineural hearing loss have difficulty understanding speech, especially in background noise. This deficit remains even when audibility is restored through amplification, suggesting that mechanisms beyond a reduction in peripheral sensitivity contribute to the perceptual difficulties associated with hearing loss. Given that normal-hearing musicians have enhanced auditory perceptual skills, including speech-in-noise perception, coupled with heightened subcortical responses to speech, we aimed to determine whether similar advantages could be observed in middle-aged adults with hearing loss. Results indicate that musicians with hearing loss, despite self-perceptions of average performance for understanding speech in noise, have a greater ability to hear in noise relative to nonmusicians. This is accompanied by more robust subcortical encoding of sound (e.g., stimulus-to-response correlations and response consistency) as well as more resilient neural responses to speech in the presence of background noise (e.g., neural timing). Musicians with hearing loss also demonstrate unique neural signatures of spectral encoding relative to nonmusicians: enhanced neural encoding of the speech-sound's fundamental frequency but not of its upper harmonics. This stands in contrast to previous outcomes in normal-hearing musicians, who have enhanced encoding of the harmonics but not the fundamental frequency. Taken together, our data suggest that although hearing loss modifies a musician's spectral encoding of speech, the musician advantage for perceiving speech in noise persists in a hearing-impaired population by adaptively strengthening underlying neural mechanisms for speech-in-noise perception. Copyright © 2013 Elsevier B.V. All rights reserved.
Pamplona, María Del Carmen; Ysunza, Pablo Antonio; Morales, Santiago
2017-02-01
Children with cleft palate frequently show speech disorders known as compensatory articulation. Compensatory articulation requires a prolonged period of speech intervention that should include reinforcement at home. However, frequently relatives do not know how to work with their children at home. To study whether the use of audiovisual materials especially designed for complementing speech pathology treatment in children with compensatory articulation can be effective for stimulating articulation practice at home and consequently enhancing speech normalization in children with cleft palate. Eighty-two patients with compensatory articulation were studied. Patients were randomly divided into two groups. Both groups received speech pathology treatment aimed to correct articulation placement. In addition, patients from the active group received a set of audiovisual materials to be used at home. Parents were instructed about strategies and ideas about how to use the materials with their children. Severity of compensatory articulation was compared at the onset and at the end of the speech intervention. After the speech therapy period, the group of patients using audiovisual materials at home demonstrated significantly greater improvement in articulation, as compared with the patients receiving speech pathology treatment on - site without audiovisual supporting materials. The results of this study suggest that audiovisual materials especially designed for practicing adequate articulation placement at home can be effective for reinforcing and enhancing speech pathology treatment of patients with cleft palate and compensatory articulation. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Use of Computer Speech Technologies To Enhance Learning.
ERIC Educational Resources Information Center
Ferrell, Joe
1999-01-01
Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)
A Program Structure for Event-Based Speech Synthesis by Rules within a Flexible Segmental Framework.
ERIC Educational Resources Information Center
Hill, David R.
1978-01-01
A program structure based on recently developed techniques for operating system simulation has the required flexibility for use as a speech synthesis algorithm research framework. This program makes synthesis possible with less rigid time and frequency-component structure than simpler schemes. It also meets real-time operation and memory-size…
The Effects of Digital Noise Reduction on the Acceptance of Background Noise
Mueller, H. Gustav; Weber, Jennifer; Hornsby, Benjamin W. Y.
2006-01-01
Modern hearing aids commonly employ digital noise reduction (DNR) algorithms. The potential benefit of these algorithms is to provide improved speech understanding in noise or, at the least, to provide relaxed listening or increased ease of listening. In this study, 22 adults were fitted with 16-channel wide-dynamic-range compression hearing aids containing DNR processing. The DNR includes both modulation-based and Wiener-filter-type algorithms working simultaneously. Both speech intelligibility and acceptable noise level (ANL) were assessed using the Hearing in Noise Test (HINT) with DNR on and DNR off. The ANL was also assessed without hearing aids. The results showed a significant mean improvement for the ANL (4.2 dB) for the DNR-on condition when compared to DNR-off condition. Moreover, there was a significant correlation between the magnitude of ANL improvement (relative to DNR on) and the DNR-off ANL. There was no significant mean improvement for the HINT for the DNR-on condition, and on an individual basis, the HINT score did not significantly correlate with either aided ANL (DNR on or DNR off). These findings suggest that at least within the constraints of the DNR algorithms and test conditions employed in this study, DNR can significantly improve the clinically measured ANL, which may result in improved ease of listening for speech-in-noise situations. PMID:16959732
Real-time loudness normalisation with combined cochlear implant and hearing aid stimulation
Van Eeckhoutte, Maaike; Van Deun, Lieselot; Francart, Tom
2018-01-01
Background People who use a cochlear implant together with a contralateral hearing aid—so-called bimodal listeners—have poor localisation abilities and sounds are often not balanced in loudness across ears. In order to address the latter, a loudness balancing algorithm was created, which equalises the loudness growth functions for the two ears. The algorithm uses loudness models in order to continuously adjust the two signals to loudness targets. Previous tests demonstrated improved binaural balance, improved localisation, and better speech intelligibility in quiet for soft phonemes. In those studies, however, all stimuli were preprocessed so spontaneous head movements and individual head-related transfer functions were not taken into account. Furthermore, the hearing aid processing was linear. Study design In the present study, we simplified the acoustical loudness model and implemented the algorithm in a real-time system. We tested bimodal listeners on speech perception and on sound localisation, both in normal loudness growth configuration and in a configuration with a modified loudness growth function. We also used linear and compressive hearing aids. Results The comparison between the original acoustical loudness model and the new simplified model showed loudness differences below 3% for almost all tested speech-like stimuli and levels. We found no effect of balancing the loudness growth across ears for speech perception ability in quiet and in noise. We found some small improvements in localisation performance. Further investigation with a larger sample size is required. PMID:29617421
Zhang, Xiaoyang; Xue, Lei; Zhang, Zhi; Zhang, Yiwen
2016-01-01
Background: Health problems about children have been attracting much attention of parents and even the whole society all the time, among which, child-language development is a hot research topic. The experts and scholars have studied and found that the guardians taking appropriate intervention in children at the early stage can promote children’s language and cognitive ability development effectively, and carry out analysis of quantity. The intervention of Artificial Intelligence Technology has effect on the autistic spectrum disorders of children obviously. Objective and Methods: This paper presents a speech signal analysis system for children, with preprocessing of the speaker speech signal, subsequent calculation of the number in the speech of guardians and children, and some other characteristic parameters or indicators (e.g cognizable syllable number, the continuity of the language). Results: With these quantitative analysis tool and parameters, we can evaluate and analyze the quality of children’s language and cognitive ability objectively and quantitatively to provide the basis for decision-making criteria for parents. Thereby, they can adopt appropriate measures for children to promote the development of children's language and cognitive status. Conclusion: In this paper, according to the existing study of children’s language development, we put forward several indicators in the process of automatic measurement for language development which influence the formation of children’s language. From the experimental results we can see that after the pretreatment (including signal enhancement, speech activity detection), both divergence algorithm calculation results and the later words count are quite satisfactory compared with the actual situation. PMID:27583037
How should a speech recognizer work?
Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M
2005-11-12
Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.
Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers
NASA Astrophysics Data System (ADS)
Caballero Morales, Santiago Omar; Cox, Stephen J.
2009-12-01
Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.
Musical training during early childhood enhances the neural encoding of speech in noise
Strait, Dana L.; Parbery-Clark, Alexandra; Hittner, Emily; Kraus, Nina
2012-01-01
For children, learning often occurs in the presence of background noise. As such, there is growing desire to improve a child’s access to a target signal in noise. Given adult musicians’ perceptual and neural speech-in-noise enhancements, we asked whether similar effects are present in musically-trained children. We assessed the perception and subcortical processing of speech in noise and related cognitive abilities in musician and nonmusician children that were matched for a variety of overarching factors. Outcomes reveal that musicians’ advantages for processing speech in noise are present during pivotal developmental years. Supported by correlations between auditory working memory and attention and auditory brainstem response properties, we propose that musicians’ perceptual and neural enhancements are driven in a top-down manner by strengthened cognitive abilities with training. Our results may be considered by professionals involved in the remediation of language-based learning deficits, which are often characterized by poor speech perception in noise. PMID:23102977
Motor excitability during visual perception of known and unknown spoken languages.
Swaminathan, Swathi; MacSweeney, Mairéad; Boyles, Rowan; Waters, Dafydd; Watkins, Kate E; Möttönen, Riikka
2013-07-01
It is possible to comprehend speech and discriminate languages by viewing a speaker's articulatory movements. Transcranial magnetic stimulation studies have shown that viewing speech enhances excitability in the articulatory motor cortex. Here, we investigated the specificity of this enhanced motor excitability in native and non-native speakers of English. Both groups were able to discriminate between speech movements related to a known (i.e., English) and unknown (i.e., Hebrew) language. The motor excitability was higher during observation of a known language than an unknown language or non-speech mouth movements, suggesting that motor resonance is enhanced specifically during observation of mouth movements that convey linguistic information. Surprisingly, however, the excitability was equally high during observation of a static face. Moreover, the motor excitability did not differ between native and non-native speakers. These findings suggest that the articulatory motor cortex processes several kinds of visual cues during speech communication. Crown Copyright © 2013. Published by Elsevier Inc. All rights reserved.
Multimodal Speech Capture System for Speech Rehabilitation and Learning.
Sebkhi, Nordine; Desai, Dhyey; Islam, Mohammad; Lu, Jun; Wilson, Kimberly; Ghovanloo, Maysam
2017-11-01
Speech-language pathologists (SLPs) are trained to correct articulation of people diagnosed with motor speech disorders by analyzing articulators' motion and assessing speech outcome while patients speak. To assist SLPs in this task, we are presenting the multimodal speech capture system (MSCS) that records and displays kinematics of key speech articulators, the tongue and lips, along with voice, using unobtrusive methods. Collected speech modalities, tongue motion, lips gestures, and voice are visualized not only in real-time to provide patients with instant feedback but also offline to allow SLPs to perform post-analysis of articulators' motion, particularly the tongue, with its prominent but hardly visible role in articulation. We describe the MSCS hardware and software components, and demonstrate its basic visualization capabilities by a healthy individual repeating the words "Hello World." A proof-of-concept prototype has been successfully developed for this purpose, and will be used in future clinical studies to evaluate its potential impact on accelerating speech rehabilitation by enabling patients to speak naturally. Pattern matching algorithms to be applied to the collected data can provide patients with quantitative and objective feedback on their speech performance, unlike current methods that are mostly subjective, and may vary from one SLP to another.
NASA Astrophysics Data System (ADS)
Jelinek, H. J.
1986-01-01
This is the Final Report of Electronic Design Associates on its Phase I SBIR project. The purpose of this project is to develop a method for correcting helium speech, as experienced in diver-surface communication. The goal of the Phase I study was to design, prototype, and evaluate a real time helium speech corrector system based upon digital signal processing techniques. The general approach was to develop hardware (an IBM PC board) to digitize helium speech and software (a LAMBDA computer based simulation) to translate the speech. As planned in the study proposal, this initial prototype may now be used to assess expected performance from a self contained real time system which uses an identical algorithm. The Final Report details the work carried out to produce the prototype system. Four major project tasks were: a signal processing scheme for converting helium speech to normal sounding speech was generated. The signal processing scheme was simulated on a general purpose (LAMDA) computer. Actual helium speech was supplied to the simulation and the converted speech was generated. An IBM-PC based 14 bit data Input/Output board was designed and built. A bibliography of references on speech processing was generated.
Shahin, Antoine J; Shen, Stanley; Kerlin, Jess R
2017-01-01
We examined the relationship between tolerance for audiovisual onset asynchrony (AVOA) and the spectrotemporal fidelity of the spoken words and the speaker's mouth movements. In two experiments that only varied in the temporal order of sensory modality, visual speech leading (exp1) or lagging (exp2) acoustic speech, participants watched intact and blurred videos of a speaker uttering trisyllabic words and nonwords that were noise vocoded with 4-, 8-, 16-, and 32-channels. They judged whether the speaker's mouth movements and the speech sounds were in-sync or out-of-sync . Individuals perceived synchrony (tolerated AVOA) on more trials when the acoustic speech was more speech-like (8 channels and higher vs. 4 channels), and when visual speech was intact than blurred (exp1 only). These findings suggest that enhanced spectrotemporal fidelity of the audiovisual (AV) signal prompts the brain to widen the window of integration promoting the fusion of temporally distant AV percepts.
Infant-Directed Speech Enhances Attention to Speech in Deaf Infants with Cochlear Implants
ERIC Educational Resources Information Center
Wang, Yuanyuan; Bergeson, Tonya R.; Houston, Derek M.
2017-01-01
Purpose: Both theoretical models of infant language acquisition and empirical studies posit important roles for attention to speech in early language development. However, deaf infants with cochlear implants (CIs) show reduced attention to speech as compared with their peers with normal hearing (NH; Horn, Davis, Pisoni, & Miyamoto, 2005;…
ERIC Educational Resources Information Center
Snyder, Gregory J.; Hough, Monica Strauss; Blanchet, Paul; Ivy, Lennette J.; Waddell, Dwight
2009-01-01
Purpose: Relatively recent research documents that visual choral speech, which represents an externally generated form of synchronous visual speech feedback, significantly enhanced fluency in those who stutter. As a consequence, it was hypothesized that self-generated synchronous and asynchronous visual speech feedback would likewise enhance…
ERIC Educational Resources Information Center
Lockart, Rebekah; McLeod, Sharynne
2013-01-01
Purpose: To investigate speech-language pathology students' ability to identify errors and transcribe typical and atypical speech in Cantonese, a nonnative language. Method: Thirty-three English-speaking speech-language pathology students completed 3 tasks in an experimental within-subjects design. Results: Task 1 (baseline) involved transcribing…
The role of left inferior frontal cortex during audiovisual speech perception in infants.
Altvater-Mackensen, Nicole; Grossmann, Tobias
2016-06-01
In the first year of life, infants' speech perception attunes to their native language. While the behavioral changes associated with native language attunement are fairly well mapped, the underlying mechanisms and neural processes are still only poorly understood. Using fNIRS and eye tracking, the current study investigated 6-month-old infants' processing of audiovisual speech that contained matching or mismatching auditory and visual speech cues. Our results revealed that infants' speech-sensitive brain responses in inferior frontal brain regions were lateralized to the left hemisphere. Critically, our results further revealed that speech-sensitive left inferior frontal regions showed enhanced responses to matching when compared to mismatching audiovisual speech, and that infants with a preference to look at the speaker's mouth showed an enhanced left inferior frontal response to speech compared to infants with a preference to look at the speaker's eyes. These results suggest that left inferior frontal regions play a crucial role in associating information from different modalities during native language attunement, fostering the formation of multimodal phonological categories. Copyright © 2016 Elsevier Inc. All rights reserved.
Automatic speech recognition using a predictive echo state network classifier.
Skowronski, Mark D; Harris, John G
2007-04-01
We have combined an echo state network (ESN) with a competitive state machine framework to create a classification engine called the predictive ESN classifier. We derive the expressions for training the predictive ESN classifier and show that the model was significantly more noise robust compared to a hidden Markov model in noisy speech classification experiments by 8+/-1 dB signal-to-noise ratio. The simple training algorithm and noise robustness of the predictive ESN classifier make it an attractive classification engine for automatic speech recognition.
A Mis-recognized Medical Vocabulary Correction System for Speech-based Electronic Medical Record
Seo, Hwa Jeong; Kim, Ju Han; Sakabe, Nagamasa
2002-01-01
Speech recognition as an input tool for electronic medical record (EMR) enables efficient data entry at the point of care. However, the recognition accuracy for medical vocabulary is much poorer than that for doctor-patient dialogue. We developed a mis-recognized medical vocabulary correction system based on syllable-by-syllable comparison of speech text against medical vocabulary database. Using specialty medical vocabulary, the algorithm detects and corrects mis-recognized medical vocabularies in narrative text. Our preliminary evaluation showed 94% of accuracy in mis-recognized medical vocabulary correction.
Comparing Binaural Pre-processing Strategies III
Warzybok, Anna; Ernst, Stephan M. A.
2015-01-01
A comprehensive evaluation of eight signal pre-processing strategies, including directional microphones, coherence filters, single-channel noise reduction, binaural beamformers, and their combinations, was undertaken with normal-hearing (NH) and hearing-impaired (HI) listeners. Speech reception thresholds (SRTs) were measured in three noise scenarios (multitalker babble, cafeteria noise, and single competing talker). Predictions of three common instrumental measures were compared with the general perceptual benefit caused by the algorithms. The individual SRTs measured without pre-processing and individual benefits were objectively estimated using the binaural speech intelligibility model. Ten listeners with NH and 12 HI listeners participated. The participants varied in age and pure-tone threshold levels. Although HI listeners required a better signal-to-noise ratio to obtain 50% intelligibility than listeners with NH, no differences in SRT benefit from the different algorithms were found between the two groups. With the exception of single-channel noise reduction, all algorithms showed an improvement in SRT of between 2.1 dB (in cafeteria noise) and 4.8 dB (in single competing talker condition). Model predictions with binaural speech intelligibility model explained 83% of the measured variance of the individual SRTs in the no pre-processing condition. Regarding the benefit from the algorithms, the instrumental measures were not able to predict the perceptual data in all tested noise conditions. The comparable benefit observed for both groups suggests a possible application of noise reduction schemes for listeners with different hearing status. Although the model can predict the individual SRTs without pre-processing, further development is necessary to predict the benefits obtained from the algorithms at an individual level. PMID:26721922
Visual input enhances selective speech envelope tracking in auditory cortex at a "cocktail party".
Zion Golumbic, Elana; Cogan, Gregory B; Schroeder, Charles E; Poeppel, David
2013-01-23
Our ability to selectively attend to one auditory signal amid competing input streams, epitomized by the "Cocktail Party" problem, continues to stimulate research from various approaches. How this demanding perceptual feat is achieved from a neural systems perspective remains unclear and controversial. It is well established that neural responses to attended stimuli are enhanced compared with responses to ignored ones, but responses to ignored stimuli are nonetheless highly significant, leading to interference in performance. We investigated whether congruent visual input of an attended speaker enhances cortical selectivity in auditory cortex, leading to diminished representation of ignored stimuli. We recorded magnetoencephalographic signals from human participants as they attended to segments of natural continuous speech. Using two complementary methods of quantifying the neural response to speech, we found that viewing a speaker's face enhances the capacity of auditory cortex to track the temporal speech envelope of that speaker. This mechanism was most effective in a Cocktail Party setting, promoting preferential tracking of the attended speaker, whereas without visual input no significant attentional modulation was observed. These neurophysiological results underscore the importance of visual input in resolving perceptual ambiguity in a noisy environment. Since visual cues in speech precede the associated auditory signals, they likely serve a predictive role in facilitating auditory processing of speech, perhaps by directing attentional resources to appropriate points in time when to-be-attended acoustic input is expected to arrive.
François, Clément; Schön, Daniele
2014-02-01
There is increasing evidence that humans and other nonhuman mammals are sensitive to the statistical structure of auditory input. Indeed, neural sensitivity to statistical regularities seems to be a fundamental biological property underlying auditory learning. In the case of speech, statistical regularities play a crucial role in the acquisition of several linguistic features, from phonotactic to more complex rules such as morphosyntactic rules. Interestingly, a similar sensitivity has been shown with non-speech streams: sequences of sounds changing in frequency or timbre can be segmented on the sole basis of conditional probabilities between adjacent sounds. We recently ran a set of cross-sectional and longitudinal experiments showing that merging music and speech information in song facilitates stream segmentation and, further, that musical practice enhances sensitivity to statistical regularities in speech at both neural and behavioral levels. Based on recent findings showing the involvement of a fronto-temporal network in speech segmentation, we defend the idea that enhanced auditory learning observed in musicians originates via at least three distinct pathways: enhanced low-level auditory processing, enhanced phono-articulatory mapping via the left Inferior Frontal Gyrus and Pre-Motor cortex and increased functional connectivity within the audio-motor network. Finally, we discuss how these data predict a beneficial use of music for optimizing speech acquisition in both normal and impaired populations. Copyright © 2013 Elsevier B.V. All rights reserved.
ERIC Educational Resources Information Center
Ural, A. Engin; Yuret, Deniz; Ketrez, F. Nihan; Kocbas, Dilara; Kuntay, Aylin C.
2009-01-01
The syntactic bootstrapping mechanism of verb learning was evaluated against child-directed speech in Turkish, a language with rich morphology, nominal ellipsis and free word order. Machine-learning algorithms were run on transcribed caregiver speech directed to two Turkish learners (one hour every two weeks between 0;9 to 1;10) of different…
Analysis of 3-D Tongue Motion From Tagged and Cine Magnetic Resonance Images
Woo, Jonghye; Lee, Junghoon; Murano, Emi Z.; Stone, Maureen; Prince, Jerry L.
2016-01-01
Purpose Measuring tongue deformation and internal muscle motion during speech has been a challenging task because the tongue deforms in 3 dimensions, contains interdigitated muscles, and is largely hidden within the vocal tract. In this article, a new method is proposed to analyze tagged and cine magnetic resonance images of the tongue during speech in order to estimate 3-dimensional tissue displacement and deformation over time. Method The method involves computing 2-dimensional motion components using a standard tag-processing method called harmonic phase, constructing superresolution tongue volumes using cine magnetic resonance images, segmenting the tongue region using a random-walker algorithm, and estimating 3-dimensional tongue motion using an incompressible deformation estimation algorithm. Results Evaluation of the method is presented with a control group and a group of people who had received a glossectomy carrying out a speech task. A 2-step principal-components analysis is then used to reveal the unique motion patterns of the subjects. Azimuth motion angles and motion on the mirrored hemi-tongues are analyzed. Conclusion Tests of the method with a various collection of subjects show its capability of capturing patient motion patterns and indicate its potential value in future speech studies. PMID:27295428
Development of a two wheeled self balancing robot with speech recognition and navigation algorithm
NASA Astrophysics Data System (ADS)
Rahman, Md. Muhaimin; Ashik-E-Rasul, Haq, Nowab. Md. Aminul; Hassan, Mehedi; Hasib, Irfan Mohammad Al; Hassan, K. M. Rafidh
2016-07-01
This paper is aimed to discuss modeling, construction and development of navigation algorithm of a two wheeled self balancing mobile robot in an enclosure. In this paper, we have discussed the design of two of the main controller algorithms, namely PID algorithms, on the robot model. Simulation is performed in the SIMULINK environment. The controller is developed primarily for self-balancing of the robot and also it's positioning. As for the navigation in an enclosure, template matching algorithm is proposed for precise measurement of the robot position. The navigation system needs to be calibrated before navigation process starts. Almost all of the earlier template matching algorithms that can be found in the open literature can only trace the robot. But the proposed algorithm here can also locate the position of other objects in an enclosure, like furniture, tables etc. This will enable the robot to know the exact location of every stationary object in the enclosure. Moreover, some additional features, such as Speech Recognition and Object Detection, are added. For Object Detection, the single board Computer Raspberry Pi is used. The system is programmed to analyze images captured via the camera, which are then processed through background subtraction, followed by active noise reduction.
Musical intervention enhances infants’ neural processing of temporal structure in music and speech
Zhao, T. Christina; Kuhl, Patricia K.
2016-01-01
Individuals with music training in early childhood show enhanced processing of musical sounds, an effect that generalizes to speech processing. However, the conclusions drawn from previous studies are limited due to the possible confounds of predisposition and other factors affecting musicians and nonmusicians. We used a randomized design to test the effects of a laboratory-controlled music intervention on young infants’ neural processing of music and speech. Nine-month-old infants were randomly assigned to music (intervention) or play (control) activities for 12 sessions. The intervention targeted temporal structure learning using triple meter in music (e.g., waltz), which is difficult for infants, and it incorporated key characteristics of typical infant music classes to maximize learning (e.g., multimodal, social, and repetitive experiences). Controls had similar multimodal, social, repetitive play, but without music. Upon completion, infants’ neural processing of temporal structure was tested in both music (tones in triple meter) and speech (foreign syllable structure). Infants’ neural processing was quantified by the mismatch response (MMR) measured with a traditional oddball paradigm using magnetoencephalography (MEG). The intervention group exhibited significantly larger MMRs in response to music temporal structure violations in both auditory and prefrontal cortical regions. Identical results were obtained for temporal structure changes in speech. The intervention thus enhanced temporal structure processing not only in music, but also in speech, at 9 mo of age. We argue that the intervention enhanced infants’ ability to extract temporal structure information and to predict future events in time, a skill affecting both music and speech processing. PMID:27114512
Musical intervention enhances infants' neural processing of temporal structure in music and speech.
Zhao, T Christina; Kuhl, Patricia K
2016-05-10
Individuals with music training in early childhood show enhanced processing of musical sounds, an effect that generalizes to speech processing. However, the conclusions drawn from previous studies are limited due to the possible confounds of predisposition and other factors affecting musicians and nonmusicians. We used a randomized design to test the effects of a laboratory-controlled music intervention on young infants' neural processing of music and speech. Nine-month-old infants were randomly assigned to music (intervention) or play (control) activities for 12 sessions. The intervention targeted temporal structure learning using triple meter in music (e.g., waltz), which is difficult for infants, and it incorporated key characteristics of typical infant music classes to maximize learning (e.g., multimodal, social, and repetitive experiences). Controls had similar multimodal, social, repetitive play, but without music. Upon completion, infants' neural processing of temporal structure was tested in both music (tones in triple meter) and speech (foreign syllable structure). Infants' neural processing was quantified by the mismatch response (MMR) measured with a traditional oddball paradigm using magnetoencephalography (MEG). The intervention group exhibited significantly larger MMRs in response to music temporal structure violations in both auditory and prefrontal cortical regions. Identical results were obtained for temporal structure changes in speech. The intervention thus enhanced temporal structure processing not only in music, but also in speech, at 9 mo of age. We argue that the intervention enhanced infants' ability to extract temporal structure information and to predict future events in time, a skill affecting both music and speech processing.
An Algorithm for Controlled Integration of Sound and Text.
ERIC Educational Resources Information Center
Wohlert, Harry S.; McCormick, Martin
1985-01-01
A serious drawback in introducing sound into computer programs for teaching foreign language speech has been the lack of an algorithm to turn off the cassette recorder immediately to keep screen text and audio in synchronization. This article describes a program which solves that problem. (SED)
Drijvers, Linda; Özyürek, Asli; Jensen, Ole
2018-05-01
During face-to-face communication, listeners integrate speech with gestures. The semantic information conveyed by iconic gestures (e.g., a drinking gesture) can aid speech comprehension in adverse listening conditions. In this magnetoencephalography (MEG) study, we investigated the spatiotemporal neural oscillatory activity associated with gestural enhancement of degraded speech comprehension. Participants watched videos of an actress uttering clear or degraded speech, accompanied by a gesture or not and completed a cued-recall task after watching every video. When gestures semantically disambiguated degraded speech comprehension, an alpha and beta power suppression and a gamma power increase revealed engagement and active processing in the hand-area of the motor cortex, the extended language network (LIFG/pSTS/STG/MTG), medial temporal lobe, and occipital regions. These observed low- and high-frequency oscillatory modulations in these areas support general unification, integration and lexical access processes during online language comprehension, and simulation of and increased visual attention to manual gestures over time. All individual oscillatory power modulations associated with gestural enhancement of degraded speech comprehension predicted a listener's correct disambiguation of the degraded verb after watching the videos. Our results thus go beyond the previously proposed role of oscillatory dynamics in unimodal degraded speech comprehension and provide first evidence for the role of low- and high-frequency oscillations in predicting the integration of auditory and visual information at a semantic level. © 2018 The Authors Human Brain Mapping Published by Wiley Periodicals, Inc.
Özyürek, Asli; Jensen, Ole
2018-01-01
Abstract During face‐to‐face communication, listeners integrate speech with gestures. The semantic information conveyed by iconic gestures (e.g., a drinking gesture) can aid speech comprehension in adverse listening conditions. In this magnetoencephalography (MEG) study, we investigated the spatiotemporal neural oscillatory activity associated with gestural enhancement of degraded speech comprehension. Participants watched videos of an actress uttering clear or degraded speech, accompanied by a gesture or not and completed a cued‐recall task after watching every video. When gestures semantically disambiguated degraded speech comprehension, an alpha and beta power suppression and a gamma power increase revealed engagement and active processing in the hand‐area of the motor cortex, the extended language network (LIFG/pSTS/STG/MTG), medial temporal lobe, and occipital regions. These observed low‐ and high‐frequency oscillatory modulations in these areas support general unification, integration and lexical access processes during online language comprehension, and simulation of and increased visual attention to manual gestures over time. All individual oscillatory power modulations associated with gestural enhancement of degraded speech comprehension predicted a listener's correct disambiguation of the degraded verb after watching the videos. Our results thus go beyond the previously proposed role of oscillatory dynamics in unimodal degraded speech comprehension and provide first evidence for the role of low‐ and high‐frequency oscillations in predicting the integration of auditory and visual information at a semantic level. PMID:29380945
Effects of Familiarity and Feeding on Newborn Speech-Voice Recognition
ERIC Educational Resources Information Center
Valiante, A. Grace; Barr, Ronald G.; Zelazo, Philip R.; Brant, Rollin; Young, Simon N.
2013-01-01
Newborn infants preferentially orient to familiar over unfamiliar speech sounds. They are also better at remembering unfamiliar speech sounds for short periods of time if learning and retention occur after a feed than before. It is unknown whether short-term memory for speech is enhanced when the sound is familiar (versus unfamiliar) and, if so,…
ERIC Educational Resources Information Center
Gordon-Salant, Sandra; Fitzgibbons, Peter J.; Friedman, Sarah A.
2007-01-01
Purpose: The goal of this experiment was to determine whether selective slowing of speech segments improves recognition performance by young and elderly listeners. The hypotheses were (a) the benefits of time expansion occur for rapid speech but not for natural-rate speech, (b) selective time expansion of consonants produces greater score…
ERIC Educational Resources Information Center
Anderson, Karen L.; Goldstein, Howard
2004-01-01
Children typically learn in classroom environments that have background noise and reverberation that interfere with accurate speech perception. Amplification technology can enhance the speech perception of students who are hard of hearing. Purpose: This study used a single-subject alternating treatments design to compare the speech recognition…
Auditory cortical change detection in adults with Asperger syndrome.
Lepistö, Tuulia; Nieminen-von Wendt, Taina; von Wendt, Lennart; Näätänen, Risto; Kujala, Teija
2007-03-06
The present study investigated whether auditory deficits reported in children with Asperger syndrome (AS) are also present in adulthood. To this end, event-related potentials (ERPs) were recorded from adults with AS for duration, pitch, and phonetic changes in vowels, and for acoustically matched non-speech stimuli. These subjects had enhanced mismatch negativity (MMN) amplitudes particularly for pitch and duration deviants, indicating enhanced sound-discrimination abilities. Furthermore, as reflected by the P3a, their involuntary orienting was enhanced for changes in non-speech sounds, but tended to be deficient for changes in speech sounds. The results are consistent with those reported earlier in children with AS, except for the duration-MMN, which was diminished in children and enhanced in adults.
Hutka, Stefanie; Bidelman, Gavin M; Moreno, Sylvain
2015-05-01
Psychophysiological evidence supports a music-language association, such that experience in one domain can impact processing required in the other domain. We investigated the bidirectionality of this association by measuring event-related potentials (ERPs) in native English-speaking musicians, native tone language (Cantonese) nonmusicians, and native English-speaking nonmusician controls. We tested the degree to which pitch expertise stemming from musicianship or tone language experience similarly enhances the neural encoding of auditory information necessary for speech and music processing. Early cortical discriminatory processing for music and speech sounds was characterized using the mismatch negativity (MMN). Stimuli included 'large deviant' and 'small deviant' pairs of sounds that differed minimally in pitch (fundamental frequency, F0; contrastive musical tones) or timbre (first formant, F1; contrastive speech vowels). Behavioural F0 and F1 difference limen tasks probed listeners' perceptual acuity for these same acoustic features. Musicians and Cantonese speakers performed comparably in pitch discrimination; only musicians showed an additional advantage on timbre discrimination performance and an enhanced MMN responses to both music and speech. Cantonese language experience was not associated with enhancements on neural measures, despite enhanced behavioural pitch acuity. These data suggest that while both musicianship and tone language experience enhance some aspects of auditory acuity (behavioural pitch discrimination), musicianship confers farther-reaching enhancements to auditory function, tuning both pitch and timbre-related brain processes. Copyright © 2015 Elsevier Ltd. All rights reserved.
A study of speech emotion recognition based on hybrid algorithm
NASA Astrophysics Data System (ADS)
Zhu, Ju-xia; Zhang, Chao; Lv, Zhao; Rao, Yao-quan; Wu, Xiao-pei
2011-10-01
To effectively improve the recognition accuracy of the speech emotion recognition system, a hybrid algorithm which combines Continuous Hidden Markov Model (CHMM), All-Class-in-One Neural Network (ACON) and Support Vector Machine (SVM) is proposed. In SVM and ACON methods, some global statistics are used as emotional features, while in CHMM method, instantaneous features are employed. The recognition rate by the proposed method is 92.25%, with the rejection rate to be 0.78%. Furthermore, it obtains the relative increasing of 8.53%, 4.69% and 0.78% compared with ACON, CHMM and SVM methods respectively. The experiment result confirms the efficiency of distinguishing anger, happiness, neutral and sadness emotional states.
Adaptive Noise Suppression Using Digital Signal Processing
NASA Technical Reports Server (NTRS)
Kozel, David; Nelson, Richard
1996-01-01
A signal to noise ratio dependent adaptive spectral subtraction algorithm is developed to eliminate noise from noise corrupted speech signals. The algorithm determines the signal to noise ratio and adjusts the spectral subtraction proportion appropriately. After spectra subtraction low amplitude signals are squelched. A single microphone is used to obtain both eh noise corrupted speech and the average noise estimate. This is done by determining if the frame of data being sampled is a voiced or unvoiced frame. During unvoice frames an estimate of the noise is obtained. A running average of the noise is used to approximate the expected value of the noise. Applications include the emergency egress vehicle and the crawler transporter.
Cross-modal enhancement of speech detection in young and older adults: does signal content matter?
Tye-Murray, Nancy; Spehar, Brent; Myerson, Joel; Sommers, Mitchell S; Hale, Sandra
2011-01-01
The purpose of the present study was to examine the effects of age and visual content on cross-modal enhancement of auditory speech detection. Visual content consisted of three clearly distinct types of visual information: an unaltered video clip of a talker's face, a low-contrast version of the same clip, and a mouth-like Lissajous figure. It was hypothesized that both young and older adults would exhibit reduced enhancement as visual content diverged from the original clip of the talker's face, but that the decrease would be greater for older participants. Nineteen young adults and 19 older adults were asked to detect a single spoken syllable (/ba/) in speech-shaped noise, and the level of the signal was adaptively varied to establish the signal-to-noise ratio (SNR) at threshold. There was an auditory-only baseline condition and three audiovisual conditions in which the syllable was accompanied by one of the three visual signals (the unaltered clip of the talker's face, the low-contrast version of that clip, or the Lissajous figure). For each audiovisual condition, the SNR at threshold was compared with the SNR at threshold for the auditory-only condition to measure the amount of cross-modal enhancement. Young adults exhibited significant cross-modal enhancement with all three types of visual stimuli, with the greatest amount of enhancement observed for the unaltered clip of the talker's face. Older adults, in contrast, exhibited significant cross-modal enhancement only with the unaltered face. Results of this study suggest that visual signal content affects cross-modal enhancement of speech detection in both young and older adults. They also support a hypothesized age-related deficit in processing low-contrast visual speech stimuli, even in older adults with normal contrast sensitivity.
Network Speech Systems Technology Program.
1980-09-30
ognized that the lumped-speaker approximation could be extended even more generally to include cases of combined circuit-switched speech and packet...based on these tables. The first function is an im- portant element of the more general task of system control for a switched network, which in...programs are in preparation, as described below, for both steady-state evaluation and dynamic performance simulation of the algorithm in general
Speech responses and dual-task performance - Better time-sharing or asymmetric transfer?
NASA Technical Reports Server (NTRS)
Vidulich, Michael A.
1988-01-01
The value of speech controls in a dual-task experiment that also evaluated asymmetric transfer effects is considered. There was no evidence of asymmetric transfer in spite of significant effects supporting the advantage of mixing manual and speech responses. The data suggest that speech controls can be used to enhance performance in operational multiple-task environments.
ERIC Educational Resources Information Center
Hula, Shannon N. Austermann; Robin, Donald A.; Maas, Edwin; Ballard, Kirrie J.; Schmidt, Richard A.
2008-01-01
Purpose: Two studies examined speech skill learning in persons with apraxia of speech (AOS). Motor-learning research shows that delaying or reducing the frequency of feedback promotes retention and transfer of skills. By contrast, immediate or frequent feedback promotes temporary performance enhancement but interferes with retention and transfer.…
Temporal plasticity in auditory cortex improves neural discrimination of speech sounds
Engineer, Crystal T.; Shetake, Jai A.; Engineer, Navzer D.; Vrana, Will A.; Wolf, Jordan T.; Kilgard, Michael P.
2017-01-01
Background Many individuals with language learning impairments exhibit temporal processing deficits and degraded neural responses to speech sounds. Auditory training can improve both the neural and behavioral deficits, though significant deficits remain. Recent evidence suggests that vagus nerve stimulation (VNS) paired with rehabilitative therapies enhances both cortical plasticity and recovery of normal function. Objective/Hypothesis We predicted that pairing VNS with rapid tone trains would enhance the primary auditory cortex (A1) response to unpaired novel speech sounds. Methods VNS was paired with tone trains 300 times per day for 20 days in adult rats. Responses to isolated speech sounds, compressed speech sounds, word sequences, and compressed word sequences were recorded in A1 following the completion of VNS-tone train pairing. Results Pairing VNS with rapid tone trains resulted in stronger, faster, and more discriminable A1 responses to speech sounds presented at conversational rates. Conclusion This study extends previous findings by documenting that VNS paired with rapid tone trains altered the neural response to novel unpaired speech sounds. Future studies are necessary to determine whether pairing VNS with appropriate auditory stimuli could potentially be used to improve both neural responses to speech sounds and speech perception in individuals with receptive language disorders. PMID:28131520
Kumar, U A; Jayaram, M
2013-07-01
The purpose of this study was to evaluate the effect of lengthening of voice onset time and burst duration of selected speech stimuli on perception by individuals with auditory dys-synchrony. This is the second of a series of articles reporting the effect of signal enhancing strategies on speech perception by such individuals. Two experiments were conducted: (1) assessment of the 'just-noticeable difference' for voice onset time and burst duration of speech sounds; and (2) assessment of speech identification scores when speech sounds were modified by lengthening the voice onset time and the burst duration in units of one just-noticeable difference, both in isolation and in combination with each other plus transition duration modification. Lengthening of voice onset time as well as burst duration improved perception of voicing. However, the effect of voice onset time modification was greater than that of burst duration modification. Although combined lengthening of voice onset time, burst duration and transition duration resulted in improved speech perception, the improvement was less than that due to lengthening of transition duration alone. These results suggest that innovative speech processing strategies that enhance temporal cues may benefit individuals with auditory dys-synchrony.
An optimization method for speech enhancement based on deep neural network
NASA Astrophysics Data System (ADS)
Sun, Haixia; Li, Sikun
2017-06-01
Now, this document puts forward a deep neural network (DNN) model with more credible data set and more robust structure. First, we take two regularization skills, dropout and sparsity constraint to strengthen the generalization ability of the model. In this way, not only the model is able to reach the consistency between the pre-training model and the fine-tuning model, but also it reduce resource consumption. Then network compression by weights sharing and quantization is allowed to reduce storage cost. In the end, we evaluate the quality of the reconstructed speech according to different criterion. The result proofs that the improved framework has good performance on speech enhancement and meets the requirement of speech processing.
Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a ‘Cocktail Party’
Golumbic, Elana Zion; Cogan, Gregory B.; Schroeder, Charles E.; Poeppel, David
2013-01-01
Our ability to selectively attend to one auditory signal amidst competing input streams, epitomized by the ‘Cocktail Party’ problem, continues to stimulate research from various approaches. How this demanding perceptual feat is achieved from a neural systems perspective remains unclear and controversial. It is well established that neural responses to attended stimuli are enhanced compared to responses to ignored ones, but responses to ignored stimuli are nonetheless highly significant, leading to interference in performance. We investigated whether congruent visual input of an attended speaker enhances cortical selectivity in auditory cortex, leading to diminished representation of ignored stimuli. We recorded magnetoencephalographic (MEG) signals from human participants as they attended to segments of natural continuous speech. Using two complementary methods of quantifying the neural response to speech, we found that viewing a speaker’s face enhances the capacity of auditory cortex to track the temporal speech envelope of that speaker. This mechanism was most effective in a ‘Cocktail Party’ setting, promoting preferential tracking of the attended speaker, whereas without visual input no significant attentional modulation was observed. These neurophysiological results underscore the importance of visual input in resolving perceptual ambiguity in a noisy environment. Since visual cues in speech precede the associated auditory signals, they likely serve a predictive role in facilitating auditory processing of speech, perhaps by directing attentional resources to appropriate points in time when to-be-attended acoustic input is expected to arrive. PMID:23345218
Kernel-Based Sensor Fusion With Application to Audio-Visual Voice Activity Detection
NASA Astrophysics Data System (ADS)
Dov, David; Talmon, Ronen; Cohen, Israel
2016-12-01
In this paper, we address the problem of multiple view data fusion in the presence of noise and interferences. Recent studies have approached this problem using kernel methods, by relying particularly on a product of kernels constructed separately for each view. From a graph theory point of view, we analyze this fusion approach in a discrete setting. More specifically, based on a statistical model for the connectivity between data points, we propose an algorithm for the selection of the kernel bandwidth, a parameter, which, as we show, has important implications on the robustness of this fusion approach to interferences. Then, we consider the fusion of audio-visual speech signals measured by a single microphone and by a video camera pointed to the face of the speaker. Specifically, we address the task of voice activity detection, i.e., the detection of speech and non-speech segments, in the presence of structured interferences such as keyboard taps and office noise. We propose an algorithm for voice activity detection based on the audio-visual signal. Simulation results show that the proposed algorithm outperforms competing fusion and voice activity detection approaches. In addition, we demonstrate that a proper selection of the kernel bandwidth indeed leads to improved performance.
Leybaert, Jacqueline; LaSasso, Carol J.
2010-01-01
Nearly 300 million people worldwide have moderate to profound hearing loss. Hearing impairment, if not adequately managed, has strong socioeconomic and affective impact on individuals. Cochlear implants have become the most effective vehicle for helping profoundly deaf children and adults to understand spoken language, to be sensitive to environmental sounds, and, to some extent, to listen to music. The auditory information delivered by the cochlear implant remains non-optimal for speech perception because it delivers a spectrally degraded signal and lacks some of the fine temporal acoustic structure. In this article, we discuss research revealing the multimodal nature of speech perception in normally-hearing individuals, with important inter-subject variability in the weighting of auditory or visual information. We also discuss how audio-visual training, via Cued Speech, can improve speech perception in cochlear implantees, particularly in noisy contexts. Cued Speech is a system that makes use of visual information from speechreading combined with hand shapes positioned in different places around the face in order to deliver completely unambiguous information about the syllables and the phonemes of spoken language. We support our view that exposure to Cued Speech before or after the implantation could be important in the aural rehabilitation process of cochlear implantees. We describe five lines of research that are converging to support the view that Cued Speech can enhance speech perception in individuals with cochlear implants. PMID:20724357
Neural networks supporting audiovisual integration for speech: A large-scale lesion study.
Hickok, Gregory; Rogalsky, Corianne; Matchin, William; Basilakos, Alexandra; Cai, Julia; Pillay, Sara; Ferrill, Michelle; Mickelsen, Soren; Anderson, Steven W; Love, Tracy; Binder, Jeffrey; Fridriksson, Julius
2018-06-01
Auditory and visual speech information are often strongly integrated resulting in perceptual enhancements for audiovisual (AV) speech over audio alone and sometimes yielding compelling illusory fusion percepts when AV cues are mismatched, the McGurk-MacDonald effect. Previous research has identified three candidate regions thought to be critical for AV speech integration: the posterior superior temporal sulcus (STS), early auditory cortex, and the posterior inferior frontal gyrus. We assess the causal involvement of these regions (and others) in the first large-scale (N = 100) lesion-based study of AV speech integration. Two primary findings emerged. First, behavioral performance and lesion maps for AV enhancement and illusory fusion measures indicate that classic metrics of AV speech integration are not necessarily measuring the same process. Second, lesions involving superior temporal auditory, lateral occipital visual, and multisensory zones in the STS are the most disruptive to AV speech integration. Further, when AV speech integration fails, the nature of the failure-auditory vs visual capture-can be predicted from the location of the lesions. These findings show that AV speech processing is supported by unimodal auditory and visual cortices as well as multimodal regions such as the STS at their boundary. Motor related frontal regions do not appear to play a role in AV speech integration. Copyright © 2018 Elsevier Ltd. All rights reserved.
Gao, Yayue; Wang, Qian; Ding, Yu; Wang, Changming; Li, Haifeng; Wu, Xihong; Qu, Tianshu; Li, Liang
2017-01-01
Human listeners are able to selectively attend to target speech in a noisy environment with multiple-people talking. Using recordings of scalp electroencephalogram (EEG), this study investigated how selective attention facilitates the cortical representation of target speech under a simulated “cocktail-party” listening condition with speech-on-speech masking. The result shows that the cortical representation of target-speech signals under the multiple-people talking condition was specifically improved by selective attention relative to the non-selective-attention listening condition, and the beta-band activity was most strongly modulated by selective attention. Moreover, measured with the Granger Causality value, selective attention to the single target speech in the mixed-speech complex enhanced the following four causal connectivities for the beta-band oscillation: the ones (1) from site FT7 to the right motor area, (2) from the left frontal area to the right motor area, (3) from the central frontal area to the right motor area, and (4) from the central frontal area to the right frontal area. However, the selective-attention-induced change in beta-band causal connectivity from the central frontal area to the right motor area, but not other beta-band causal connectivities, was significantly correlated with the selective-attention-induced change in the cortical beta-band representation of target speech. These findings suggest that under the “cocktail-party” listening condition, the beta-band oscillation in EEGs to target speech is specifically facilitated by selective attention to the target speech that is embedded in the mixed-speech complex. The selective attention-induced unmasking of target speech may be associated with the improved beta-band functional connectivity from the central frontal area to the right motor area, suggesting a top-down attentional modulation of the speech-motor process. PMID:28239344
Gao, Yayue; Wang, Qian; Ding, Yu; Wang, Changming; Li, Haifeng; Wu, Xihong; Qu, Tianshu; Li, Liang
2017-01-01
Human listeners are able to selectively attend to target speech in a noisy environment with multiple-people talking. Using recordings of scalp electroencephalogram (EEG), this study investigated how selective attention facilitates the cortical representation of target speech under a simulated "cocktail-party" listening condition with speech-on-speech masking. The result shows that the cortical representation of target-speech signals under the multiple-people talking condition was specifically improved by selective attention relative to the non-selective-attention listening condition, and the beta-band activity was most strongly modulated by selective attention. Moreover, measured with the Granger Causality value, selective attention to the single target speech in the mixed-speech complex enhanced the following four causal connectivities for the beta-band oscillation: the ones (1) from site FT7 to the right motor area, (2) from the left frontal area to the right motor area, (3) from the central frontal area to the right motor area, and (4) from the central frontal area to the right frontal area. However, the selective-attention-induced change in beta-band causal connectivity from the central frontal area to the right motor area, but not other beta-band causal connectivities, was significantly correlated with the selective-attention-induced change in the cortical beta-band representation of target speech. These findings suggest that under the "cocktail-party" listening condition, the beta-band oscillation in EEGs to target speech is specifically facilitated by selective attention to the target speech that is embedded in the mixed-speech complex. The selective attention-induced unmasking of target speech may be associated with the improved beta-band functional connectivity from the central frontal area to the right motor area, suggesting a top-down attentional modulation of the speech-motor process.
Speech training alters tone frequency tuning in rat primary auditory cortex
Engineer, Crystal T.; Perez, Claudia A.; Carraway, Ryan S.; Chang, Kevin Q.; Roland, Jarod L.; Kilgard, Michael P.
2013-01-01
Previous studies in both humans and animals have documented improved performance following discrimination training. This enhanced performance is often associated with cortical response changes. In this study, we tested the hypothesis that long-term speech training on multiple tasks can improve primary auditory cortex (A1) responses compared to rats trained on a single speech discrimination task or experimentally naïve rats. Specifically, we compared the percent of A1 responding to trained sounds, the responses to both trained and untrained sounds, receptive field properties of A1 neurons, and the neural discrimination of pairs of speech sounds in speech trained and naïve rats. Speech training led to accurate discrimination of consonant and vowel sounds, but did not enhance A1 response strength or the neural discrimination of these sounds. Speech training altered tone responses in rats trained on six speech discrimination tasks but not in rats trained on a single speech discrimination task. Extensive speech training resulted in broader frequency tuning, shorter onset latencies, a decreased driven response to tones, and caused a shift in the frequency map to favor tones in the range where speech sounds are the loudest. Both the number of trained tasks and the number of days of training strongly predict the percent of A1 responding to a low frequency tone. Rats trained on a single speech discrimination task performed less accurately than rats trained on multiple tasks and did not exhibit A1 response changes. Our results indicate that extensive speech training can reorganize the A1 frequency map, which may have downstream consequences on speech sound processing. PMID:24344364
New Ideas for Speech Recognition and Related Technologies
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J F
The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstratedmore » by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun to think differently about practical applications of such radars. In particular, his demonstrations of heart and lung motions have opened up many new areas of application for human and animal measurements.« less
Park, Hyojin; Kayser, Christoph; Thut, Gregor; Gross, Joachim
2016-01-01
During continuous speech, lip movements provide visual temporal signals that facilitate speech processing. Here, using MEG we directly investigated how these visual signals interact with rhythmic brain activity in participants listening to and seeing the speaker. First, we investigated coherence between oscillatory brain activity and speaker’s lip movements and demonstrated significant entrainment in visual cortex. We then used partial coherence to remove contributions of the coherent auditory speech signal from the lip-brain coherence. Comparing this synchronization between different attention conditions revealed that attending visual speech enhances the coherence between activity in visual cortex and the speaker’s lips. Further, we identified a significant partial coherence between left motor cortex and lip movements and this partial coherence directly predicted comprehension accuracy. Our results emphasize the importance of visually entrained and attention-modulated rhythmic brain activity for the enhancement of audiovisual speech processing. DOI: http://dx.doi.org/10.7554/eLife.14521.001 PMID:27146891
Başkent, Deniz; Eiler, Cheryl L; Edwards, Brent
2007-06-01
To present a comprehensive analysis of the feasibility of genetic algorithms (GA) for finding the best fit of hearing aids or cochlear implants for individual users in clinical or research settings, where the algorithm is solely driven by subjective human input. Due to varying pathology, the best settings of an auditory device differ for each user. It is also likely that listening preferences vary at the same time. The settings of a device customized for a particular user can only be evaluated by the user. When optimization algorithms are used for fitting purposes, this situation poses a difficulty for a systematic and quantitative evaluation of the suitability of the fitting parameters produced by the algorithm. In the present study, an artificial listening environment was generated by distorting speech using a noiseband vocoder. The settings produced by the GA for this listening problem could objectively be evaluated by measuring speech recognition and comparing the performance to the best vocoder condition where speech was least distorted. Nine normal-hearing subjects participated in the study. The parameters to be optimized were the number of vocoder channels, the shift between the input frequency range and the synthesis frequency range, and the compression-expansion of the input frequency range over the synthesis frequency range. The subjects listened to pairs of sentences processed with the vocoder, and entered a preference for the sentence with better intelligibility. The GA modified the solutions iteratively according to the subject preferences. The program converged when the user ranked the same set of parameters as the best in three consecutive steps. The results produced by the GA were analyzed for quality by measuring speech intelligibility, for test-retest reliability by running the GA three times with each subject, and for convergence properties. Speech recognition scores averaged across subjects were similar for the best vocoder solution and for the solutions produced by the GA. The average number of iterations was 8 and the average convergence time was 25.5 minutes. The settings produced by different GA runs for the same subject were slightly different; however, speech recognition scores measured with these settings were similar. Individual data from subjects showed that in each run, a small number of GA solutions produced poorer speech intelligibility than for the best setting. This was probably a result of the combination of the inherent randomness of the GA, the convergence criterion used in the present study, and possible errors that the users might have made during the paired comparisons. On the other hand, the effect of these errors was probably small compared to the other two factors, as a comparison between subjective preferences and objective measures showed that for many subjects the two were in good agreement. The results showed that the GA was able to produce good solutions by using listener preferences in a relatively short time. For practical applications, the program can be made more robust by running the GA twice or by not using an automatic stopping criterion, and it can be made faster by optimizing the number of the paired comparisons completed in each iteration.
Design of a robust baseband LPC coder for speech transmission over 9.6 kbit/s noisy channels
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Russell, W. H.; Higgins, A. L.
1982-04-01
This paper describes the design of a baseband Linear Predictive Coder (LPC) which transmits speech over 9.6 kbit/sec synchronous channels with random bit errors of up to 1%. Presented are the results of our investigation of a number of aspects of the baseband LPC coder with the goal of maximizing the quality of the transmitted speech. Important among these aspects are: bandwidth of the baseband, coding of the baseband residual, high-frequency regeneration, and error protection of important transmission parameters. The paper discusses these and other issues, presents the results of speech-quality tests conducted during the various stages of optimization, and describes the details of the optimized speech coder. This optimized speech coding algorithm has been implemented as a real-time full-duplex system on an array processor. Informal listening tests of the real-time coder have shown that the coder produces good speech quality in the absence of channel bit errors and introduces only a slight degradation in quality for channel bit error rates of up to 1%.
High school music classes enhance the neural processing of speech.
Tierney, Adam; Krizman, Jennifer; Skoe, Erika; Johnston, Kathleen; Kraus, Nina
2013-01-01
Should music be a priority in public education? One argument for teaching music in school is that private music instruction relates to enhanced language abilities and neural function. However, the directionality of this relationship is unclear and it is unknown whether school-based music training can produce these enhancements. Here we show that 2 years of group music classes in high school enhance the neural encoding of speech. To tease apart the relationships between music and neural function, we tested high school students participating in either music or fitness-based training. These groups were matched at the onset of training on neural timing, reading ability, and IQ. Auditory brainstem responses were collected to a synthesized speech sound presented in background noise. After 2 years of training, the neural responses of the music training group were earlier than at pre-training, while the neural timing of students in the fitness training group was unchanged. These results represent the strongest evidence to date that in-school music education can cause enhanced speech encoding. The neural benefits of musical training are, therefore, not limited to expensive private instruction early in childhood but can be elicited by cost-effective group instruction during adolescence.
Time-Frequency Masking for Speech Separation and Its Potential for Hearing Aid Design
Wang, DeLiang
2008-01-01
A new approach to the separation of speech from speech-in-noise mixtures is the use of time-frequency (T-F) masking. Originated in the field of computational auditory scene analysis, T-F masking performs separation in the time-frequency domain. This article introduces the T-F masking concept and reviews T-F masking algorithms that separate target speech from either monaural or binaural mixtures, as well as microphone-array recordings. The review emphasizes techniques that are promising for hearing aid design. This article also surveys recent studies that evaluate the perceptual effects of T-F masking techniques, particularly their effectiveness in improving human speech recognition in noise. An assessment is made of the potential benefits of T-F masking methods for the hearing impaired in light of the processing constraints of hearing aids. Finally, several issues pertinent to T-F masking are discussed. PMID:18974204
Nonlinear frequency compression: effects on sound quality ratings of speech and music.
Parsa, Vijay; Scollie, Susan; Glista, Danielle; Seelisch, Andreas
2013-03-01
Frequency lowering technologies offer an alternative amplification solution for severe to profound high frequency hearing losses. While frequency lowering technologies may improve audibility of high frequency sounds, the very nature of this processing can affect the perceived sound quality. This article reports the results from two studies that investigated the impact of a nonlinear frequency compression (NFC) algorithm on perceived sound quality. In the first study, the cutoff frequency and compression ratio parameters of the NFC algorithm were varied, and their effect on the speech quality was measured subjectively with 12 normal hearing adults, 12 normal hearing children, 13 hearing impaired adults, and 9 hearing impaired children. In the second study, 12 normal hearing and 8 hearing impaired adult listeners rated the quality of speech in quiet, speech in noise, and music after processing with a different set of NFC parameters. Results showed that the cutoff frequency parameter had more impact on sound quality ratings than the compression ratio, and that the hearing impaired adults were more tolerant to increased frequency compression than normal hearing adults. No statistically significant differences were found in the sound quality ratings of speech-in-noise and music stimuli processed through various NFC settings by hearing impaired listeners. These findings suggest that there may be an acceptable range of NFC settings for hearing impaired individuals where sound quality is not adversely affected. These results may assist an Audiologist in clinical NFC hearing aid fittings for achieving a balance between high frequency audibility and sound quality.
Relationship between listeners' nonnative speech recognition and categorization abilities
Atagi, Eriko; Bent, Tessa
2015-01-01
Enhancement of the perceptual encoding of talker characteristics (indexical information) in speech can facilitate listeners' recognition of linguistic content. The present study explored this indexical-linguistic relationship in nonnative speech processing by examining listeners' performance on two tasks: nonnative accent categorization and nonnative speech-in-noise recognition. Results indicated substantial variability across listeners in their performance on both the accent categorization and nonnative speech recognition tasks. Moreover, listeners' accent categorization performance correlated with their nonnative speech-in-noise recognition performance. These results suggest that having more robust indexical representations for nonnative accents may allow listeners to more accurately recognize the linguistic content of nonnative speech. PMID:25618098
The Roles of Suprasegmental Features in Predicting English Oral Proficiency with an Automated System
ERIC Educational Resources Information Center
Kang, Okim; Johnson, David
2018-01-01
Suprasegmental features have received growing attention in the field of oral assessment. In this article we describe a set of computer algorithms that automatically scores the oral proficiency of non-native speakers using unconstrained English speech. The algorithms employ machine learning and 11 suprasegmental measures divided into four groups…
Two-voice fundamental frequency estimation
NASA Astrophysics Data System (ADS)
de Cheveigné, Alain
2002-05-01
An algorithm is presented that estimates the fundamental frequencies of two concurrent voices or instruments. The algorithm models each voice as a periodic function of time, and jointly estimates both periods by cancellation according to a previously proposed method [de Cheveigné and Kawahara, Speech Commun. 27, 175-185 (1999)]. The new algorithm improves on the old in several respects; it allows an unrestricted search range, effectively avoids harmonic and subharmonic errors, is more accurate (it uses two-dimensional parabolic interpolation), and is computationally less costly. It remains subject to unavoidable errors when periods are in certain simple ratios and the task is inherently ambiguous. The algorithm is evaluated on a small database including speech, singing voice, and instrumental sounds. It can be extended in several ways; to decide the number of voices, to handle amplitude variations, and to estimate more than two voices (at the expense of increased processing cost and decreased reliability). It makes no use of instrument models, learned or otherwise, although it could usefully be combined with such models. [Work supported by the Cognitique programme of the French Ministry of Research and Technology.
Magnified Neural Envelope Coding Predicts Deficits in Speech Perception in Noise.
Millman, Rebecca E; Mattys, Sven L; Gouws, André D; Prendergast, Garreth
2017-08-09
Verbal communication in noisy backgrounds is challenging. Understanding speech in background noise that fluctuates in intensity over time is particularly difficult for hearing-impaired listeners with a sensorineural hearing loss (SNHL). The reduction in fast-acting cochlear compression associated with SNHL exaggerates the perceived fluctuations in intensity in amplitude-modulated sounds. SNHL-induced changes in the coding of amplitude-modulated sounds may have a detrimental effect on the ability of SNHL listeners to understand speech in the presence of modulated background noise. To date, direct evidence for a link between magnified envelope coding and deficits in speech identification in modulated noise has been absent. Here, magnetoencephalography was used to quantify the effects of SNHL on phase locking to the temporal envelope of modulated noise (envelope coding) in human auditory cortex. Our results show that SNHL enhances the amplitude of envelope coding in posteromedial auditory cortex, whereas it enhances the fidelity of envelope coding in posteromedial and posterolateral auditory cortex. This dissociation was more evident in the right hemisphere, demonstrating functional lateralization in enhanced envelope coding in SNHL listeners. However, enhanced envelope coding was not perceptually beneficial. Our results also show that both hearing thresholds and, to a lesser extent, magnified cortical envelope coding in left posteromedial auditory cortex predict speech identification in modulated background noise. We propose a framework in which magnified envelope coding in posteromedial auditory cortex disrupts the segregation of speech from background noise, leading to deficits in speech perception in modulated background noise. SIGNIFICANCE STATEMENT People with hearing loss struggle to follow conversations in noisy environments. Background noise that fluctuates in intensity over time poses a particular challenge. Using magnetoencephalography, we demonstrate anatomically distinct cortical representations of modulated noise in normal-hearing and hearing-impaired listeners. This work provides the first link among hearing thresholds, the amplitude of cortical representations of modulated sounds, and the ability to understand speech in modulated background noise. In light of previous work, we propose that magnified cortical representations of modulated sounds disrupt the separation of speech from modulated background noise in auditory cortex. Copyright © 2017 Millman et al.
The Downside of Greater Lexical Influences: Selectively Poorer Speech Perception in Noise
ERIC Educational Resources Information Center
Lam, Boji P. W.; Xie, Zilong; Tessmer, Rachel; Chandrasekaran, Bharath
2017-01-01
Purpose: Although lexical information influences phoneme perception, the extent to which reliance on lexical information enhances speech processing in challenging listening environments is unclear. We examined the extent to which individual differences in lexical influences on phonemic processing impact speech processing in maskers containing…
Neural Development of Networks for Audiovisual Speech Comprehension
ERIC Educational Resources Information Center
Dick, Anthony Steven; Solodkin, Ana; Small, Steven L.
2010-01-01
Everyday conversation is both an auditory and a visual phenomenon. While visual speech information enhances comprehension for the listener, evidence suggests that the ability to benefit from this information improves with development. A number of brain regions have been implicated in audiovisual speech comprehension, but the extent to which the…
Reflections on Clinical Learning in Novice Speech-Language Therapy Students
ERIC Educational Resources Information Center
Hill, Anne E.; Davidson, Bronwyn J.; Theodoros, Deborah G.
2012-01-01
Background: Reflective practice is reported to enhance clinical reasoning and therefore to maximize client outcomes. The inclusion of targeted reflective practice in academic programmes in speech-language therapy has not been consistent, although providing opportunities for speech-language therapy students to reflect during their clinical practice…
Reduced efficiency of audiovisual integration for nonnative speech.
Yi, Han-Gyol; Phelps, Jasmine E B; Smiljanic, Rajka; Chandrasekaran, Bharath
2013-11-01
The role of visual cues in native listeners' perception of speech produced by nonnative speakers has not been extensively studied. Native perception of English sentences produced by native English and Korean speakers in audio-only and audiovisual conditions was examined. Korean speakers were rated as more accented in audiovisual than in the audio-only condition. Visual cues enhanced word intelligibility for native English speech but less so for Korean-accented speech. Reduced intelligibility of Korean-accented audiovisual speech was associated with implicit visual biases, suggesting that listener-related factors partially influence the efficiency of audiovisual integration for nonnative speech perception.
ERIC Educational Resources Information Center
Rees, Rachel; Bladel, Judith
2013-01-01
Many studies have shown that French Cued Speech (CS) can enhance lipreading and the development of phonological awareness and literacy in deaf children but, as yet, there is little evidence that these findings can be generalized to English CS. This study investigated the possible effects of English CS on the speech perception, phonological…
Kong, Ying-Yee; Mullangi, Ala; Ding, Nai
2014-01-01
This study investigates how top-down attention modulates neural tracking of the speech envelope in different listening conditions. In the quiet conditions, a single speech stream was presented and the subjects paid attention to the speech stream (active listening) or watched a silent movie instead (passive listening). In the competing speaker (CS) conditions, two speakers of opposite genders were presented diotically. Ongoing electroencephalographic (EEG) responses were measured in each condition and cross-correlated with the speech envelope of each speaker at different time lags. In quiet, active and passive listening resulted in similar neural responses to the speech envelope. In the CS conditions, however, the shape of the cross-correlation function was remarkably different between the attended and unattended speech. The cross-correlation with the attended speech showed stronger N1 and P2 responses but a weaker P1 response compared with the cross-correlation with the unattended speech. Furthermore, the N1 response to the attended speech in the CS condition was enhanced and delayed compared with the active listening condition in quiet, while the P2 response to the unattended speaker in the CS condition was attenuated compared with the passive listening in quiet. Taken together, these results demonstrate that top-down attention differentially modulates envelope-tracking neural activity at different time lags and suggest that top-down attention can both enhance the neural responses to the attended sound stream and suppress the responses to the unattended sound stream. PMID:25124153
Speech Rhythms and Multiplexed Oscillatory Sensory Coding in the Human Brain
Gross, Joachim; Hoogenboom, Nienke; Thut, Gregor; Schyns, Philippe; Panzeri, Stefano; Belin, Pascal; Garrod, Simon
2013-01-01
Cortical oscillations are likely candidates for segmentation and coding of continuous speech. Here, we monitored continuous speech processing with magnetoencephalography (MEG) to unravel the principles of speech segmentation and coding. We demonstrate that speech entrains the phase of low-frequency (delta, theta) and the amplitude of high-frequency (gamma) oscillations in the auditory cortex. Phase entrainment is stronger in the right and amplitude entrainment is stronger in the left auditory cortex. Furthermore, edges in the speech envelope phase reset auditory cortex oscillations thereby enhancing their entrainment to speech. This mechanism adapts to the changing physical features of the speech envelope and enables efficient, stimulus-specific speech sampling. Finally, we show that within the auditory cortex, coupling between delta, theta, and gamma oscillations increases following speech edges. Importantly, all couplings (i.e., brain-speech and also within the cortex) attenuate for backward-presented speech, suggesting top-down control. We conclude that segmentation and coding of speech relies on a nested hierarchy of entrained cortical oscillations. PMID:24391472
Civier, Oren; Tasko, Stephen M; Guenther, Frank H
2010-09-01
This paper investigates the hypothesis that stuttering may result in part from impaired readout of feedforward control of speech, which forces persons who stutter (PWS) to produce speech with a motor strategy that is weighted too much toward auditory feedback control. Over-reliance on feedback control leads to production errors which if they grow large enough, can cause the motor system to "reset" and repeat the current syllable. This hypothesis is investigated using computer simulations of a "neurally impaired" version of the DIVA model, a neural network model of speech acquisition and production. The model's outputs are compared to published acoustic data from PWS' fluent speech, and to combined acoustic and articulatory movement data collected from the dysfluent speech of one PWS. The simulations mimic the errors observed in the PWS subject's speech, as well as the repairs of these errors. Additional simulations were able to account for enhancements of fluency gained by slowed/prolonged speech and masking noise. Together these results support the hypothesis that many dysfluencies in stuttering are due to a bias away from feedforward control and toward feedback control. The reader will be able to (a) describe the contribution of auditory feedback control and feedforward control to normal and stuttered speech production, (b) summarize the neural modeling approach to speech production and its application to stuttering, and (c) explain how the DIVA model accounts for enhancements of fluency gained by slowed/prolonged speech and masking noise.
Parbery-Clark, Alexandra; Strait, Dana L.; Anderson, Samira; Hittner, Emily; Kraus, Nina
2011-01-01
Much of our daily communication occurs in the presence of background noise, compromising our ability to hear. While understanding speech in noise is a challenge for everyone, it becomes increasingly difficult as we age. Although aging is generally accompanied by hearing loss, this perceptual decline cannot fully account for the difficulties experienced by older adults for hearing in noise. Decreased cognitive skills concurrent with reduced perceptual acuity are thought to contribute to the difficulty older adults experience understanding speech in noise. Given that musical experience positively impacts speech perception in noise in young adults (ages 18–30), we asked whether musical experience benefits an older cohort of musicians (ages 45–65), potentially offsetting the age-related decline in speech-in-noise perceptual abilities and associated cognitive function (i.e., working memory). Consistent with performance in young adults, older musicians demonstrated enhanced speech-in-noise perception relative to nonmusicians along with greater auditory, but not visual, working memory capacity. By demonstrating that speech-in-noise perception and related cognitive function are enhanced in older musicians, our results imply that musical training may reduce the impact of age-related auditory decline. PMID:21589653
Parbery-Clark, Alexandra; Strait, Dana L; Anderson, Samira; Hittner, Emily; Kraus, Nina
2011-05-11
Much of our daily communication occurs in the presence of background noise, compromising our ability to hear. While understanding speech in noise is a challenge for everyone, it becomes increasingly difficult as we age. Although aging is generally accompanied by hearing loss, this perceptual decline cannot fully account for the difficulties experienced by older adults for hearing in noise. Decreased cognitive skills concurrent with reduced perceptual acuity are thought to contribute to the difficulty older adults experience understanding speech in noise. Given that musical experience positively impacts speech perception in noise in young adults (ages 18-30), we asked whether musical experience benefits an older cohort of musicians (ages 45-65), potentially offsetting the age-related decline in speech-in-noise perceptual abilities and associated cognitive function (i.e., working memory). Consistent with performance in young adults, older musicians demonstrated enhanced speech-in-noise perception relative to nonmusicians along with greater auditory, but not visual, working memory capacity. By demonstrating that speech-in-noise perception and related cognitive function are enhanced in older musicians, our results imply that musical training may reduce the impact of age-related auditory decline.
Musical Training during Early Childhood Enhances the Neural Encoding of Speech in Noise
ERIC Educational Resources Information Center
Strait, Dana L.; Parbery-Clark, Alexandra; Hittner, Emily; Kraus, Nina
2012-01-01
For children, learning often occurs in the presence of background noise. As such, there is growing desire to improve a child's access to a target signal in noise. Given adult musicians' perceptual and neural speech-in-noise enhancements, we asked whether similar effects are present in musically-trained children. We assessed the perception and…
ERIC Educational Resources Information Center
Mayer, Jennifer L.; Hannent, Ian; Heaton, Pamela F.
2016-01-01
Whilst enhanced perception has been widely reported in individuals with Autism Spectrum Disorders (ASDs), relatively little is known about the developmental trajectory and impact of atypical auditory processing on speech perception in intellectually high-functioning adults with ASD. This paper presents data on perception of complex tones and…
NASA Technical Reports Server (NTRS)
Kumar, P.; Lin, F. Y.; Vaishampayan, V.; Farvardin, N.
1986-01-01
A complete documentation of the software developed in the Communication and Signal Processing Laboratory (CSPL) during the period of July 1985 to March 1986 is provided. Utility programs and subroutines that were developed for a user-friendly image and speech processing environment are described. Additional programs for data compression of image and speech type signals are included. Also, programs for the zero-memory and block transform quantization in the presence of channel noise are described. Finally, several routines for simulating the perfromance of image compression algorithms are included.
Abdeltawwab, Mohamed M; Khater, Ahmed; El-Anwar, Mohammad W
2016-01-01
The combination of acoustic and electric stimulation as a way to enhance speech recognition performance in cochlear implant (CI) users has generated considerable interest in the recent years. The purpose of this study was to evaluate the bimodal advantage of the FS4 speech processing strategy in combination with hearing aids (HA) as a means to improve low-frequency resolution in CI patients. Nineteen postlingual CI adults were selected to participate in this study. All patients wore implants on one side and HA on the contralateral side with residual hearing. Monosyllabic word recognition, speech in noise, and emotion and talker identification were assessed using CI with fine structure processing/FS4 and high-definition continuous interleaved sampling strategies, HA alone, and a combination of CI and HA. The bimodal stimulation showed improvement in speech performance and emotion identification for the question/statement/order tasks, which was statistically significant compared to patients with CI alone, but there were no significant statistical differences in intragender talker discrimination and emotion identification for the happy/angry/neutral tasks. The poorest performance was obtained with HA only, and it was statistically significant compared to the other modalities. The bimodal stimulation showed enhanced speech performance in CI patients, and it improves the limitations provided by electric or acoustic stimulation alone. © 2016 S. Karger AG, Basel.
Private Speech Moderates the Effects of Effortful Control on Emotionality
ERIC Educational Resources Information Center
Day, Kimberly L.; Smith, Cynthia L.; Neal, Amy; Dunsmore, Julie C.
2018-01-01
Research Findings: In addition to being a regulatory strategy, children's private speech may enhance or interfere with their effortful control used to regulate emotion. The goal of the current study was to investigate whether children's private speech during a selective attention task moderated the relations of their effortful control to their…
Processing and Comprehension of Accented Speech by Monolingual and Bilingual Children
ERIC Educational Resources Information Center
McDonald, Margarethe; Gross, Megan; Buac, Milijana; Batko, Michelle; Kaushanskaya, Margarita
2018-01-01
This study tested the effect of Spanish-accented speech on sentence comprehension in children with different degrees of Spanish experience. The hypothesis was that earlier acquisition of Spanish would be associated with enhanced comprehension of Spanish-accented speech. Three groups of 5-6-year-old children were tested: monolingual…
Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali
2015-01-01
In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature. PMID:25799141
Enhancing Communication in Noisy Environments
2009-10-01
derived from the ITD and ILD cues, which are binaural . ITD depends on the azimuthal position of the source. Similarly, ILD refers to the fact...4.4 dB No Perceptual Binaural Speech Enhancement [42] 4.5 dB Yes Fuzzy Cocktail Party Processor [25] 7.5 dB Yes Binaural segregation [43] 8.9 dB No...modulation. IEEE Transactions on Neural Networks. 15 (2004): 1135-50. [42] Dong R. Perceptual Binaural Speech Enhancement in Noisy Environments. M.A.Sc
Sobol-Shikler, Tal; Robinson, Peter
2010-07-01
We present a classification algorithm for inferring affective states (emotions, mental states, attitudes, and the like) from their nonverbal expressions in speech. It is based on the observations that affective states can occur simultaneously and different sets of vocal features, such as intonation and speech rate, distinguish between nonverbal expressions of different affective states. The input to the inference system was a large set of vocal features and metrics that were extracted from each utterance. The classification algorithm conducted independent pairwise comparisons between nine affective-state groups. The classifier used various subsets of metrics of the vocal features and various classification algorithms for different pairs of affective-state groups. Average classification accuracy of the 36 pairwise machines was 75 percent, using 10-fold cross validation. The comparison results were consolidated into a single ranked list of the nine affective-state groups. This list was the output of the system and represented the inferred combination of co-occurring affective states for the analyzed utterance. The inference accuracy of the combined machine was 83 percent. The system automatically characterized over 500 affective state concepts from the Mind Reading database. The inference of co-occurring affective states was validated by comparing the inferred combinations to the lexical definitions of the labels of the analyzed sentences. The distinguishing capabilities of the system were comparable to human performance.
Automatic speech recognition research at NASA-Ames Research Center
NASA Technical Reports Server (NTRS)
Coler, Clayton R.; Plummer, Robert P.; Huff, Edward M.; Hitchcock, Myron H.
1977-01-01
A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented. The voice command system VCS encodes speech by sampling 16 bandpass filters with center frequencies in the range from 200 to 5000 Hz. Variations in speaking rate are compensated for by a compression algorithm that subdivides each utterance into eight subintervals in such a way that the amount of spectral change within each subinterval is the same. The recorded filter values within each subinterval are then reduced to a 15-bit representation, giving a 120-bit encoding for each utterance. The VCS incorporates a simple recognition algorithm that utilizes five training samples of each word in a vocabulary of up to 24 words. The recognition rate of approximately 85 percent correct for untrained speakers and 94 percent correct for trained speakers was not considered adequate for flight systems use. Therefore, the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.
Spectro-temporal cues enhance modulation sensitivity in cochlear implant users
Zheng, Yi; Escabí, Monty; Litovsky, Ruth Y.
2018-01-01
Although speech understanding is highly variable amongst cochlear implants (CIs) subjects, the remarkably high speech recognition performance of many CI users is unexpected and not well understood. Numerous factors, including neural health and degradation of the spectral information in the speech signal of CIs, likely contribute to speech understanding. We studied the ability to use spectro-temporal modulations, which may be critical for speech understanding and discrimination, and hypothesize that CI users adopt a different perceptual strategy than normal-hearing (NH) individuals, whereby they rely more heavily on joint spectro-temporal cues to enhance detection of auditory cues. Modulation detection sensitivity was studied in CI users and NH subjects using broadband “ripple” stimuli that were modulated spectrally, temporally, or jointly, i.e., spectro-temporally. The spectro-temporal modulation transfer functions of CI users and NH subjects was decomposed into spectral and temporal dimensions and compared to those subjects’ spectral-only and temporal-only modulation transfer functions. In CI users, the joint spectro-temporal sensitivity was better than that predicted by spectral-only and temporal-only sensitivity, indicating a heightened spectro-temporal sensitivity. Such an enhancement through the combined integration of spectral and temporal cues was not observed in NH subjects. The unique use of spectro-temporal cues by CI patients can yield benefits for use of cues that are important for speech understanding. This finding has implications for developing sound processing strategies that may rely on joint spectro-temporal modulations to improve speech comprehension of CI users, and the findings of this study may be valuable for developing clinical assessment tools to optimize CI processor performance. PMID:28601530
Perceptual learning of degraded speech by minimizing prediction error.
Sohoglu, Ediz; Davis, Matthew H
2016-03-22
Human perception is shaped by past experience on multiple timescales. Sudden and dramatic changes in perception occur when prior knowledge or expectations match stimulus content. These immediate effects contrast with the longer-term, more gradual improvements that are characteristic of perceptual learning. Despite extensive investigation of these two experience-dependent phenomena, there is considerable debate about whether they result from common or dissociable neural mechanisms. Here we test single- and dual-mechanism accounts of experience-dependent changes in perception using concurrent magnetoencephalographic and EEG recordings of neural responses evoked by degraded speech. When speech clarity was enhanced by prior knowledge obtained from matching text, we observed reduced neural activity in a peri-auditory region of the superior temporal gyrus (STG). Critically, longer-term improvements in the accuracy of speech recognition following perceptual learning resulted in reduced activity in a nearly identical STG region. Moreover, short-term neural changes caused by prior knowledge and longer-term neural changes arising from perceptual learning were correlated across subjects with the magnitude of learning-induced changes in recognition accuracy. These experience-dependent effects on neural processing could be dissociated from the neural effect of hearing physically clearer speech, which similarly enhanced perception but increased rather than decreased STG responses. Hence, the observed neural effects of prior knowledge and perceptual learning cannot be attributed to epiphenomenal changes in listening effort that accompany enhanced perception. Instead, our results support a predictive coding account of speech perception; computational simulations show how a single mechanism, minimization of prediction error, can drive immediate perceptual effects of prior knowledge and longer-term perceptual learning of degraded speech.
Perceptual learning of degraded speech by minimizing prediction error
Sohoglu, Ediz
2016-01-01
Human perception is shaped by past experience on multiple timescales. Sudden and dramatic changes in perception occur when prior knowledge or expectations match stimulus content. These immediate effects contrast with the longer-term, more gradual improvements that are characteristic of perceptual learning. Despite extensive investigation of these two experience-dependent phenomena, there is considerable debate about whether they result from common or dissociable neural mechanisms. Here we test single- and dual-mechanism accounts of experience-dependent changes in perception using concurrent magnetoencephalographic and EEG recordings of neural responses evoked by degraded speech. When speech clarity was enhanced by prior knowledge obtained from matching text, we observed reduced neural activity in a peri-auditory region of the superior temporal gyrus (STG). Critically, longer-term improvements in the accuracy of speech recognition following perceptual learning resulted in reduced activity in a nearly identical STG region. Moreover, short-term neural changes caused by prior knowledge and longer-term neural changes arising from perceptual learning were correlated across subjects with the magnitude of learning-induced changes in recognition accuracy. These experience-dependent effects on neural processing could be dissociated from the neural effect of hearing physically clearer speech, which similarly enhanced perception but increased rather than decreased STG responses. Hence, the observed neural effects of prior knowledge and perceptual learning cannot be attributed to epiphenomenal changes in listening effort that accompany enhanced perception. Instead, our results support a predictive coding account of speech perception; computational simulations show how a single mechanism, minimization of prediction error, can drive immediate perceptual effects of prior knowledge and longer-term perceptual learning of degraded speech. PMID:26957596
Selective attention in anxiety: distraction and enhancement in visual search.
Rinck, Mike; Becker, Eni S; Kellermann, Jana; Roth, Walton T
2003-01-01
According to cognitive models of anxiety, anxiety patients exhibit an attentional bias towards threat, manifested as greater distractibility by threat stimuli and enhanced detection of them. Both phenomena were studied in two experiments, using a modified visual search task, in which participants were asked to find single target words (GAD-related, speech-related, neutral, or positive) hidden in matrices made up of distractor words (also GAD-related, speech-related, neutral, or positive). Generalized anxiety disorder (GAD) patients, social phobia (SP) patients afraid of giving speeches, and healthy controls participated in the visual search task. GAD patients were slowed by GAD-related distractor words but did not show statistically reliable evidence of enhanced detection of GAD-related target words. SP patients showed neither distraction nor enhancement effects. These results extend previous findings of attentional biases observed with other experimental paradigms. Copyright 2003 Wiley-Liss, Inc.
Underwater speech communications with a modulated laser
NASA Astrophysics Data System (ADS)
Woodward, B.; Sari, H.
2008-04-01
A novel speech communications system using a modulated laser beam has been developed for short-range applications in which high directionality is an exploitable feature. Although it was designed for certain underwater applications, such as speech communications between divers or between a diver and the surface, it may equally be used for air applications. With some modification it could be used for secure diver-to-diver communications in the situation where untethered divers are swimming close together and do not want their conversations monitored by intruders. Unlike underwater acoustic communications, where the transmitted speech may be received at ranges of hundreds of metres omnidirectionally, a laser communication link is very difficult to intercept and also obviates the need for cables that become snagged or broken. Further applications include the transmission of speech and data, including the short message service (SMS), from a fixed installation such as a sea-bed habitat; and data transmission to and from an autonomous underwater vehicle (AUV), particularly during docking manoeuvres. The performance of the system has been assessed subjectively by listening tests, which revealed that the speech was intelligible, although of poor quality due to the speech algorithm used.
Identification of speech transients using variable frame rate analysis and wavelet packets.
Rasetshwane, Daniel M; Boston, J Robert; Li, Ching-Chung
2006-01-01
Speech transients are important cues for identifying and discriminating speech sounds. Yoo et al. and Tantibundhit et al. were successful in identifying speech transients and, emphasizing them, improving the intelligibility of speech in noise. However, their methods are computationally intensive and unsuitable for real-time applications. This paper presents a method to identify and emphasize speech transients that combines subband decomposition by the wavelet packet transform with variable frame rate (VFR) analysis and unvoiced consonant detection. The VFR analysis is applied to each wavelet packet to define a transitivity function that describes the extent to which the wavelet coefficients of that packet are changing. Unvoiced consonant detection is used to identify unvoiced consonant intervals and the transitivity function is amplified during these intervals. The wavelet coefficients are multiplied by the transitivity function for that packet, amplifying the coefficients localized at times when they are changing and attenuating coefficients at times when they are steady. Inverse transform of the modified wavelet packet coefficients produces a signal corresponding to speech transients similar to the transients identified by Yoo et al. and Tantibundhit et al. A preliminary implementation of the algorithm runs more efficiently.
[A research in speech endpoint detection based on boxes-coupling generalization dimension].
Wang, Zimei; Yang, Cuirong; Wu, Wei; Fan, Yingle
2008-06-01
In this paper, a new calculating method of generalized dimension, based on boxes-coupling principle, is proposed to overcome the edge effects and to improve the capability of the speech endpoint detection which is based on the original calculating method of generalized dimension. This new method has been applied to speech endpoint detection. Firstly, the length of overlapping border was determined, and through calculating the generalized dimension by covering the speech signal with overlapped boxes, three-dimension feature vectors including the box dimension, the information dimension and the correlation dimension were obtained. Secondly, in the light of the relation between feature distance and similarity degree, feature extraction was conducted by use of common distance. Lastly, bi-threshold method was used to classify the speech signals. The results of experiment indicated that, by comparison with the original generalized dimension (OGD) and the spectral entropy (SE) algorithm, the proposed method is more robust and effective for detecting the speech signals which contain different kinds of noise in different signal noise ratio (SNR), especially in low SNR.
LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.
Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu
2005-01-01
Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.
Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features
NASA Astrophysics Data System (ADS)
Nguyen, Chuong H.; Karavas, George K.; Artemiadis, Panagiotis
2018-02-01
Objective. In this paper, we investigate the suitability of imagined speech for brain-computer interface (BCI) applications. Approach. A novel method based on covariance matrix descriptors, which lie in Riemannian manifold, and the relevance vector machines classifier is proposed. The method is applied on electroencephalographic (EEG) signals and tested in multiple subjects. Main results. The method is shown to outperform other approaches in the field with respect to accuracy and robustness. The algorithm is validated on various categories of speech, such as imagined pronunciation of vowels, short words and long words. The classification accuracy of our methodology is in all cases significantly above chance level, reaching a maximum of 70% for cases where we classify three words and 95% for cases of two words. Significance. The results reveal certain aspects that may affect the success of speech imagery classification from EEG signals, such as sound, meaning and word complexity. This can potentially extend the capability of utilizing speech imagery in future BCI applications. The dataset of speech imagery collected from total 15 subjects is also published.
Review of Speech-to-Text Recognition Technology for Enhancing Learning
ERIC Educational Resources Information Center
Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min
2014-01-01
This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…
ERIC Educational Resources Information Center
Mecrow, Carol; Beckwith, Jennie; Klee, Thomas
2010-01-01
Background: Increased demand for access to specialist services for providing support to children with speech, language and communication needs prompted a local service review of how best to allocate limited resources. This study arose as a consequence of a wish to evaluate the effectiveness of an enhanced consultative approach to delivering speech…
A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields.
Carlin, Michael A; Elhilali, Mounya
2015-12-01
One of the hallmarks of sound processing in the brain is the ability of the nervous system to adapt to changing behavioral demands and surrounding soundscapes. It can dynamically shift sensory and cognitive resources to focus on relevant sounds. Neurophysiological studies indicate that this ability is supported by adaptively retuning the shapes of cortical spectro-temporal receptive fields (STRFs) to enhance features of target sounds while suppressing those of task-irrelevant distractors. Because an important component of human communication is the ability of a listener to dynamically track speech in noisy environments, the solution obtained by auditory neurophysiology implies a useful adaptation strategy for speech activity detection (SAD). SAD is an important first step in a number of automated speech processing systems, and performance is often reduced in highly noisy environments. In this paper, we describe how task-driven adaptation is induced in an ensemble of neurophysiological STRFs, and show how speech-adapted STRFs reorient themselves to enhance spectro-temporal modulations of speech while suppressing those associated with a variety of nonspeech sounds. We then show how an adapted ensemble of STRFs can better detect speech in unseen noisy environments compared to an unadapted ensemble and a noise-robust baseline. Finally, we use a stimulus reconstruction task to demonstrate how the adapted STRF ensemble better captures the spectrotemporal modulations of attended speech in clean and noisy conditions. Our results suggest that a biologically plausible adaptation framework can be applied to speech processing systems to dynamically adapt feature representations for improving noise robustness.
Data-Driven Subclassification of Speech Sound Disorders in Preschool Children
Vick, Jennell C.; Campbell, Thomas F.; Shriberg, Lawrence D.; Green, Jordan R.; Truemper, Klaus; Rusiewicz, Heather Leavy; Moore, Christopher A.
2015-01-01
Purpose The purpose of the study was to determine whether distinct subgroups of preschool children with speech sound disorders (SSD) could be identified using a subgroup discovery algorithm (SUBgroup discovery via Alternate Random Processes, or SUBARP). Of specific interest was finding evidence of a subgroup of SSD exhibiting performance consistent with atypical speech motor control. Method Ninety-seven preschool children with SSD completed speech and nonspeech tasks. Fifty-three kinematic, acoustic, and behavioral measures from these tasks were input to SUBARP. Results Two distinct subgroups were identified from the larger sample. The 1st subgroup (76%; population prevalence estimate = 67.8%–84.8%) did not have characteristics that would suggest atypical speech motor control. The 2nd subgroup (10.3%; population prevalence estimate = 4.3%– 16.5%) exhibited significantly higher variability in measures of articulatory kinematics and poor ability to imitate iambic lexical stress, suggesting atypical speech motor control. Both subgroups were consistent with classes of SSD in the Speech Disorders Classification System (SDCS; Shriberg et al., 2010a). Conclusion Characteristics of children in the larger subgroup were consistent with the proportionally large SDCS class termed speech delay; characteristics of children in the smaller subgroup were consistent with the SDCS subtype termed motor speech disorder—not otherwise specified. The authors identified candidate measures to identify children in each of these groups. PMID:25076005
A robotic voice simulator and the interactive training for hearing-impaired people.
Sawada, Hideyuki; Kitani, Mitsuki; Hayashi, Yasumori
2008-01-01
A talking and singing robot which adaptively learns the vocalization skill by means of an auditory feedback learning algorithm is being developed. The robot consists of motor-controlled vocal organs such as vocal cords, a vocal tract and a nasal cavity to generate a natural voice imitating a human vocalization. In this study, the robot is applied to the training system of speech articulation for the hearing-impaired, because the robot is able to reproduce their vocalization and to teach them how it is to be improved to generate clear speech. The paper briefly introduces the mechanical construction of the robot and how it autonomously acquires the vocalization skill in the auditory feedback learning by listening to human speech. Then the training system is described, together with the evaluation of the speech training by auditory impaired people.
Vector Sum Excited Linear Prediction (VSELP) speech coding at 4.8 kbps
NASA Technical Reports Server (NTRS)
Gerson, Ira A.; Jasiuk, Mark A.
1990-01-01
Code Excited Linear Prediction (CELP) speech coders exhibit good performance at data rates as low as 4800 bps. The major drawback to CELP type coders is their larger computational requirements. The Vector Sum Excited Linear Prediction (VSELP) speech coder utilizes a codebook with a structure which allows for a very efficient search procedure. Other advantages of the VSELP codebook structure is discussed and a detailed description of a 4.8 kbps VSELP coder is given. This coder is an improved version of the VSELP algorithm, which finished first in the NSA's evaluation of the 4.8 kbps speech coders. The coder uses a subsample resolution single tap long term predictor, a single VSELP excitation codebook, a novel gain quantizer which is robust to channel errors, and a new adaptive pre/postfilter arrangement.
The effects of gated speech on the fluency of speakers who stutter
Howell, Peter
2007-01-01
It is known that the speech of people who stutter improves when the speaker’s own vocalization is changed while the participant is speaking. One explanation of these effects is the disruptive rhythm hypothesis (DRH). DRH maintains that the manipulated sound only needs to disturb timing to affect speech control. The experiment investigated whether speech that was gated on and off (interrupted) affected the speech control of speakers who stutter. Eight children who stutter read a passage when they heard their voice normally and when the speech was gated. Fluency was enhanced (fewer errors were made and time to read a set passage was reduced) when speech was interrupted in this way. The results support the DRH. PMID:17726328
The effects of gated speech on the fluency of speakers who stutter.
Howell, Peter
2007-01-01
It is known that the speech of people who stutter improves when the speaker's own vocalization is changed while the participant is speaking. One explanation of these effects is the disruptive rhythm hypothesis (DRH). The DRH maintains that the manipulated sound only needs to disturb timing to affect speech control. The experiment investigated whether speech that was gated on and off (interrupted) affected the speech control of speakers who stutter. Eight children who stutter read a passage when they heard their voice normally and when the speech was gated. Fluency was enhanced (fewer errors were made and time to read a set passage was reduced) when speech was interrupted in this way. The results support the DRH. Copyright 2007 S. Karger AG, Basel.
Walking the talk--speech activates the leg motor cortex.
Liuzzi, Gianpiero; Ellger, Tanja; Flöel, Agnes; Breitenstein, Caterina; Jansen, Andreas; Knecht, Stefan
2008-09-01
Speech may have evolved from earlier modes of communication based on gestures. Consistent with such a motor theory of speech, cortical orofacial and hand motor areas are activated by both speech production and speech perception. However, the extent of speech-related activation of the motor cortex remains unclear. Therefore, we examined if reading and listening to continuous prose also activates non-brachiofacial motor representations like the leg motor cortex. We found corticospinal excitability of bilateral leg muscle representations to be enhanced by speech production and silent reading. Control experiments showed that speech production yielded stronger facilitation of the leg motor system than non-verbal tongue-mouth mobilization and silent reading more than a visuo-attentional task thus indicating speech-specificity of the effect. In the frame of the motor theory of speech this finding suggests that the system of gestural communication, from which speech may have evolved, is not confined to the hand but includes gestural movements of other body parts as well.
ERIC Educational Resources Information Center
Harrison, Emily; Wood, Clare; Holliman, Andrew J.; Vousden, Janet I.
2018-01-01
Despite empirical evidence of a relationship between sensitivity to speech rhythm and reading, there have been few studies that have examined the impact of rhythmic training on reading attainment, and no intervention study has focused on speech rhythm sensitivity specifically to enhance reading skills. Seventy-three typically developing 4- to…
Word Learning in Clear and Plain Speech in Quiet and Noisy Listening Conditions
ERIC Educational Resources Information Center
Riley, Kristine Marie Grohne
2010-01-01
Previous research demonstrates enhanced speech perception abilities for typically hearing and hearing-impaired listeners when speakers use clear versus plain speech, particularly in the presence of background noise. To date, very few studies have investigated the effects of noise on word learning and no studies have examined the effects of clear…
Effects of Audio-Visual Integration on the Detection of Masked Speech and Non-Speech Sounds
ERIC Educational Resources Information Center
Eramudugolla, Ranmalee; Henderson, Rachel; Mattingley, Jason B.
2011-01-01
Integration of simultaneous auditory and visual information about an event can enhance our ability to detect that event. This is particularly evident in the perception of speech, where the articulatory gestures of the speaker's lips and face can significantly improve the listener's detection and identification of the message, especially when that…
The analysis and detection of hypernasality based on a formant extraction algorithm
NASA Astrophysics Data System (ADS)
Qian, Jiahui; Fu, Fanglin; Liu, Xinyi; He, Ling; Yin, Heng; Zhang, Han
2017-08-01
In the clinical practice, the effective assessment of cleft palate speech disorders is important. For hypernasal speech, the resonance between nasal cavity and oral cavity causes an additional nasal formant. Thus, the formant frequency is a crucial cue for the judgment of hypernasality in cleft palate speech. Due to the existence of nasal formant, the peak merger occurs to the spectrum of nasal speech more often. However, the peak merger could not be solved by classical linear prediction coefficient root extraction method. In this paper, a method is proposed to detect the additional nasal formant in low-frequency region and obtain the formant frequency. The experiment results show that the proposed method could locate the nasal formant preferably. Moreover, the formants are regarded as the extraction features to proceed the detection of hypernasality. 436 phonemes, which are collected from Hospital of Stomatology, are used to carry out the experiment. The detection accuracy of hypernasality in cleft palate speech is 95.2%.
Strahl, Stefan; Mertins, Alfred
2008-07-18
Evidence that neurosensory systems use sparse signal representations as well as improved performance of signal processing algorithms using sparse signal models raised interest in sparse signal coding in the last years. For natural audio signals like speech and environmental sounds, gammatone atoms have been derived as expansion functions that generate a nearly optimal sparse signal model (Smith, E., Lewicki, M., 2006. Efficient auditory coding. Nature 439, 978-982). Furthermore, gammatone functions are established models for the human auditory filters. Thus far, a practical application of a sparse gammatone signal model has been prevented by the fact that deriving the sparsest representation is, in general, computationally intractable. In this paper, we applied an accelerated version of the matching pursuit algorithm for gammatone dictionaries allowing real-time and large data set applications. We show that a sparse signal model in general has advantages in audio coding and that a sparse gammatone signal model encodes speech more efficiently in terms of sparseness than a sparse modified discrete cosine transform (MDCT) signal model. We also show that the optimal gammatone parameters derived for English speech do not match the human auditory filters, suggesting for signal processing applications to derive the parameters individually for each applied signal class instead of using psychometrically derived parameters. For brain research, it means that care should be taken with directly transferring findings of optimality for technical to biological systems.
Automated analysis of free speech predicts psychosis onset in high-risk youths
Bedi, Gillinder; Carrillo, Facundo; Cecchi, Guillermo A; Slezak, Diego Fernández; Sigman, Mariano; Mota, Natália B; Ribeiro, Sidarta; Javitt, Daniel C; Copelli, Mauro; Corcoran, Cheryl M
2015-01-01
Background/Objectives: Psychiatry lacks the objective clinical tests routinely used in other specializations. Novel computerized methods to characterize complex behaviors such as speech could be used to identify and predict psychiatric illness in individuals. AIMS: In this proof-of-principle study, our aim was to test automated speech analyses combined with Machine Learning to predict later psychosis onset in youths at clinical high-risk (CHR) for psychosis. Methods: Thirty-four CHR youths (11 females) had baseline interviews and were assessed quarterly for up to 2.5 years; five transitioned to psychosis. Using automated analysis, transcripts of interviews were evaluated for semantic and syntactic features predicting later psychosis onset. Speech features were fed into a convex hull classification algorithm with leave-one-subject-out cross-validation to assess their predictive value for psychosis outcome. The canonical correlation between the speech features and prodromal symptom ratings was computed. Results: Derived speech features included a Latent Semantic Analysis measure of semantic coherence and two syntactic markers of speech complexity: maximum phrase length and use of determiners (e.g., which). These speech features predicted later psychosis development with 100% accuracy, outperforming classification from clinical interviews. Speech features were significantly correlated with prodromal symptoms. Conclusions: Findings support the utility of automated speech analysis to measure subtle, clinically relevant mental state changes in emergent psychosis. Recent developments in computer science, including natural language processing, could provide the foundation for future development of objective clinical tests for psychiatry. PMID:27336038
Should visual speech cues (speechreading) be considered when fitting hearing aids?
NASA Astrophysics Data System (ADS)
Grant, Ken
2002-05-01
When talker and listener are face-to-face, visual speech cues become an important part of the communication environment, and yet, these cues are seldom considered when designing hearing aids. Models of auditory-visual speech recognition highlight the importance of complementary versus redundant speech information for predicting auditory-visual recognition performance. Thus, for hearing aids to work optimally when visual speech cues are present, it is important to know whether the cues provided by amplification and the cues provided by speechreading complement each other. In this talk, data will be reviewed that show nonmonotonicity between auditory-alone speech recognition and auditory-visual speech recognition, suggesting that efforts designed solely to improve auditory-alone recognition may not always result in improved auditory-visual recognition. Data will also be presented showing that one of the most important speech cues for enhancing auditory-visual speech recognition performance, voicing, is often the cue that benefits least from amplification.
Relative Salience of Speech Rhythm and Speech Rate on Perceived Foreign Accent in a Second Language.
Polyanskaya, Leona; Ordin, Mikhail; Busa, Maria Grazia
2017-09-01
We investigated the independent contribution of speech rate and speech rhythm to perceived foreign accent. To address this issue we used a resynthesis technique that allows neutralizing segmental and tonal idiosyncrasies between identical sentences produced by French learners of English at different proficiency levels and maintaining the idiosyncrasies pertaining to prosodic timing patterns. We created stimuli that (1) preserved the idiosyncrasies in speech rhythm while controlling for the differences in speech rate between the utterances; (2) preserved the idiosyncrasies in speech rate while controlling for the differences in speech rhythm between the utterances; and (3) preserved the idiosyncrasies both in speech rate and speech rhythm. All the stimuli were created in intoned (with imposed intonational contour) and flat (with monotonized, constant F0) conditions. The original and the resynthesized sentences were rated by native speakers of English for degree of foreign accent. We found that both speech rate and speech rhythm influence the degree of perceived foreign accent, but the effect of speech rhythm is larger than that of speech rate. We also found that intonation enhances the perception of fine differences in rhythmic patterns but reduces the perceptual salience of fine differences in speech rate.
Binaural model-based dynamic-range compression.
Ernst, Stephan M A; Kortlang, Steffen; Grimm, Giso; Bisitz, Thomas; Kollmeier, Birger; Ewert, Stephan D
2018-01-26
Binaural cues such as interaural level differences (ILDs) are used to organise auditory perception and to segregate sound sources in complex acoustical environments. In bilaterally fitted hearing aids, dynamic-range compression operating independently at each ear potentially alters these ILDs, thus distorting binaural perception and sound source segregation. A binaurally-linked model-based fast-acting dynamic compression algorithm designed to approximate the normal-hearing basilar membrane (BM) input-output function in hearing-impaired listeners is suggested. A multi-center evaluation in comparison with an alternative binaural and two bilateral fittings was performed to assess the effect of binaural synchronisation on (a) speech intelligibility and (b) perceived quality in realistic conditions. 30 and 12 hearing impaired (HI) listeners were aided individually with the algorithms for both experimental parts, respectively. A small preference towards the proposed model-based algorithm in the direct quality comparison was found. However, no benefit of binaural-synchronisation regarding speech intelligibility was found, suggesting a dominant role of the better ear in all experimental conditions. The suggested binaural synchronisation of compression algorithms showed a limited effect on the tested outcome measures, however, linking could be situationally beneficial to preserve a natural binaural perception of the acoustical environment.
Nittrouer, Susan; Tarr, Eric; Wucinich, Taylor; Moberly, Aaron C.; Lowenstein, Joanna H.
2015-01-01
Broadened auditory filters associated with sensorineural hearing loss have clearly been shown to diminish speech recognition in noise for adults, but far less is known about potential effects for children. This study examined speech recognition in noise for adults and children using simulated auditory filters of different widths. Specifically, 5 groups (20 listeners each) of adults or children (5 and 7 yrs), were asked to recognize sentences in speech-shaped noise. Seven-year-olds listened at 0 dB signal-to-noise ratio (SNR) only; 5-yr-olds listened at +3 or 0 dB SNR; and adults listened at 0 or −3 dB SNR. Sentence materials were processed both to smear the speech spectrum (i.e., simulate broadened filters), and to enhance the spectrum (i.e., simulate narrowed filters). Results showed: (1) Spectral smearing diminished recognition for listeners of all ages; (2) spectral enhancement did not improve recognition, and in fact diminished it somewhat; and (3) interactions were observed between smearing and SNR, but only for adults. That interaction made age effects difficult to gauge. Nonetheless, it was concluded that efforts to diagnose the extent of broadening of auditory filters and to develop techniques to correct this condition could benefit patients with hearing loss, especially children. PMID:25920851
NASA Astrophysics Data System (ADS)
Lecun, Yann; Bengio, Yoshua; Hinton, Geoffrey
2015-05-01
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Deep Learning Based Binaural Speech Separation in Reverberant Environments.
Zhang, Xueliang; Wang, DeLiang
2017-05-01
Speech signal is usually degraded by room reverberation and additive noises in real environments. This paper focuses on separating target speech signal in reverberant conditions from binaural inputs. Binaural separation is formulated as a supervised learning problem, and we employ deep learning to map from both spatial and spectral features to a training target. With binaural inputs, we first apply a fixed beamformer and then extract several spectral features. A new spatial feature is proposed and extracted to complement the spectral features. The training target is the recently suggested ideal ratio mask. Systematic evaluations and comparisons show that the proposed system achieves very good separation performance and substantially outperforms related algorithms under challenging multi-source and reverberant environments.
LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey
2015-05-28
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Predictive top-down integration of prior knowledge during speech perception.
Sohoglu, Ediz; Peelle, Jonathan E; Carlyon, Robert P; Davis, Matthew H
2012-06-20
A striking feature of human perception is that our subjective experience depends not only on sensory information from the environment but also on our prior knowledge or expectations. The precise mechanisms by which sensory information and prior knowledge are integrated remain unclear, with longstanding disagreement concerning whether integration is strictly feedforward or whether higher-level knowledge influences sensory processing through feedback connections. Here we used concurrent EEG and MEG recordings to determine how sensory information and prior knowledge are integrated in the brain during speech perception. We manipulated listeners' prior knowledge of speech content by presenting matching, mismatching, or neutral written text before a degraded (noise-vocoded) spoken word. When speech conformed to prior knowledge, subjective perceptual clarity was enhanced. This enhancement in clarity was associated with a spatiotemporal profile of brain activity uniquely consistent with a feedback process: activity in the inferior frontal gyrus was modulated by prior knowledge before activity in lower-level sensory regions of the superior temporal gyrus. In parallel, we parametrically varied the level of speech degradation, and therefore the amount of sensory detail, so that changes in neural responses attributable to sensory information and prior knowledge could be directly compared. Although sensory detail and prior knowledge both enhanced speech clarity, they had an opposite influence on the evoked response in the superior temporal gyrus. We argue that these data are best explained within the framework of predictive coding in which sensory activity is compared with top-down predictions and only unexplained activity propagated through the cortical hierarchy.
Voice Quality Modelling for Expressive Speech Synthesis
Socoró, Joan Claudi
2014-01-01
This paper presents the perceptual experiments that were carried out in order to validate the methodology of transforming expressive speech styles using voice quality (VoQ) parameters modelling, along with the well-known prosody (F 0, duration, and energy), from a neutral style into a number of expressive ones. The main goal was to validate the usefulness of VoQ in the enhancement of expressive synthetic speech in terms of speech quality and style identification. A harmonic plus noise model (HNM) was used to modify VoQ and prosodic parameters that were extracted from an expressive speech corpus. Perception test results indicated the improvement of obtained expressive speech styles using VoQ modelling along with prosodic characteristics. PMID:24587738
NASA Astrophysics Data System (ADS)
Přibil, Jiří; Přibilová, Anna; Frollo, Ivan
2017-12-01
The paper focuses on two methods of evaluation of successfulness of speech signal enhancement recorded in the open-air magnetic resonance imager during phonation for the 3D human vocal tract modeling. The first approach enables to obtain a comparison based on statistical analysis by ANOVA and hypothesis tests. The second method is based on classification by Gaussian mixture models (GMM). The performed experiments have confirmed that the proposed ANOVA and GMM classifiers for automatic evaluation of the speech quality are functional and produce fully comparable results with the standard evaluation based on the listening test method.
Spectro-temporal cues enhance modulation sensitivity in cochlear implant users.
Zheng, Yi; Escabí, Monty; Litovsky, Ruth Y
2017-08-01
Although speech understanding is highly variable amongst cochlear implants (CIs) subjects, the remarkably high speech recognition performance of many CI users is unexpected and not well understood. Numerous factors, including neural health and degradation of the spectral information in the speech signal of CIs, likely contribute to speech understanding. We studied the ability to use spectro-temporal modulations, which may be critical for speech understanding and discrimination, and hypothesize that CI users adopt a different perceptual strategy than normal-hearing (NH) individuals, whereby they rely more heavily on joint spectro-temporal cues to enhance detection of auditory cues. Modulation detection sensitivity was studied in CI users and NH subjects using broadband "ripple" stimuli that were modulated spectrally, temporally, or jointly, i.e., spectro-temporally. The spectro-temporal modulation transfer functions of CI users and NH subjects was decomposed into spectral and temporal dimensions and compared to those subjects' spectral-only and temporal-only modulation transfer functions. In CI users, the joint spectro-temporal sensitivity was better than that predicted by spectral-only and temporal-only sensitivity, indicating a heightened spectro-temporal sensitivity. Such an enhancement through the combined integration of spectral and temporal cues was not observed in NH subjects. The unique use of spectro-temporal cues by CI patients can yield benefits for use of cues that are important for speech understanding. This finding has implications for developing sound processing strategies that may rely on joint spectro-temporal modulations to improve speech comprehension of CI users, and the findings of this study may be valuable for developing clinical assessment tools to optimize CI processor performance. Copyright © 2017 Elsevier B.V. All rights reserved.
Neural Processing of Congruent and Incongruent Audiovisual Speech in School-Age Children and Adults
ERIC Educational Resources Information Center
Heikkilä, Jenni; Tiippana, Kaisa; Loberg, Otto; Leppänen, Paavo H. T.
2018-01-01
Seeing articulatory gestures enhances speech perception. Perception of auditory speech can even be changed by incongruent visual gestures, which is known as the McGurk effect (e.g., dubbing a voice saying /mi/ onto a face articulating /ni/, observers often hear /ni/). In children, the McGurk effect is weaker than in adults, but no previous…
Integrated Spacesuit Audio System Enhances Speech Quality and Reduces Noise
NASA Technical Reports Server (NTRS)
Huang, Yiteng Arden; Chen, Jingdong; Chen, Shaoyan Sharyl
2009-01-01
A new approach has been proposed for increasing astronaut comfort and speech capture. Currently, the special design of a spacesuit forms an extreme acoustic environment making it difficult to capture clear speech without compromising comfort. The proposed Integrated Spacesuit Audio (ISA) system is to incorporate the microphones into the helmet and use software to extract voice signals from background noise.
ERIC Educational Resources Information Center
Boucher, Victor J.
2006-01-01
Language learning requires a capacity to recall novel series of speech sounds. Research shows that prosodic marks create grouping effects enhancing serial recall. However, any restriction on memory affecting the reproduction of prosody would limit the set of patterns that could be learned and subsequently used in speech. By implication, grouping…
Sample-based engine noise synthesis using an enhanced pitch-synchronous overlap-and-add method.
Jagla, Jan; Maillard, Julien; Martin, Nadine
2012-11-01
An algorithm for the real time synthesis of internal combustion engine noise is presented. Through the analysis of a recorded engine noise signal of continuously varying engine speed, a dataset of sound samples is extracted allowing the real time synthesis of the noise induced by arbitrary evolutions of engine speed. The sound samples are extracted from a recording spanning the entire engine speed range. Each sample is delimitated such as to contain the sound emitted during one cycle of the engine plus the necessary overlap to ensure smooth transitions during the synthesis. The proposed approach, an extension of the PSOLA method introduced for speech processing, takes advantage of the specific periodicity of engine noise signals to locate the extraction instants of the sound samples. During the synthesis stage, the sound samples corresponding to the target engine speed evolution are concatenated with an overlap and add algorithm. It is shown that this method produces high quality audio restitution with a low computational load. It is therefore well suited for real time applications.
Representations, Approximations, and Algorithms for Mathematical Speech Processing
1998-06-16
location on the basilar membrane was very low (i.e., any given location responded well to a broad range of frequencies ); so theorists had trouble...are variants of the signal-to- noise ratio (SNR). SNR measures compare the energy of the signal with the energy of the noise (defined as the difference...segment m and frequency band j, and 0"^ • and cr^mj- are the variances for band j and segment m of the original speech and noise , respectively
Perceptual restoration of degraded speech is preserved with advancing age.
Saija, Jefta D; Akyürek, Elkan G; Andringa, Tjeerd C; Başkent, Deniz
2014-02-01
Cognitive skills, such as processing speed, memory functioning, and the ability to divide attention, are known to diminish with aging. The present study shows that, despite these changes, older adults can successfully compensate for degradations in speech perception. Critically, the older participants of this study were not pre-selected for high performance on cognitive tasks, but only screened for normal hearing. We measured the compensation for speech degradation using phonemic restoration, where intelligibility of degraded speech is enhanced using top-down repair mechanisms. Linguistic knowledge, Gestalt principles of perception, and expectations based on situational and linguistic context are used to effectively fill in the inaudible masked speech portions. A positive compensation effect was previously observed only with young normal hearing people, but not with older hearing-impaired populations, leaving the question whether the lack of compensation was due to aging or due to age-related hearing problems. Older participants in the present study showed poorer intelligibility of degraded speech than the younger group, as expected from previous reports of aging effects. However, in conditions that induce top-down restoration, a robust compensation was observed. Speech perception by the older group was enhanced, and the enhancement effect was similar to that observed with the younger group. This effect was even stronger with slowed-down speech, which gives more time for cognitive processing. Based on previous research, the likely explanations for these observations are that older adults can overcome age-related cognitive deterioration by relying on linguistic skills and vocabulary that they have accumulated over their lifetime. Alternatively, or simultaneously, they may use different cerebral activation patterns or exert more mental effort. This positive finding on top-down restoration skills by the older individuals suggests that new cognitive training methods can teach older adults to effectively use compensatory mechanisms to cope with the complex listening environments of everyday life.
Multi-stream LSTM-HMM decoding and histogram equalization for noise robust keyword spotting.
Wöllmer, Martin; Marchi, Erik; Squartini, Stefano; Schuller, Björn
2011-09-01
Highly spontaneous, conversational, and potentially emotional and noisy speech is known to be a challenge for today's automatic speech recognition (ASR) systems, which highlights the need for advanced algorithms that improve speech features and models. Histogram Equalization is an efficient method to reduce the mismatch between clean and noisy conditions by normalizing all moments of the probability distribution of the feature vector components. In this article, we propose to combine histogram equalization and multi-condition training for robust keyword detection in noisy speech. To better cope with conversational speaking styles, we show how contextual information can be effectively exploited in a multi-stream ASR framework that dynamically models context-sensitive phoneme estimates generated by a long short-term memory neural network. The proposed techniques are evaluated on the SEMAINE database-a corpus containing emotionally colored conversations with a cognitive system for "Sensitive Artificial Listening".
Keidser, Gitte; Hartley, David; Carter, Lyndal
2008-12-01
To investigate the long-term benefit of multichannel wide dynamic range compression (WDRC) alone and in combination with directional microphones and noise reduction/speech enhancement for listeners with severe or profound hearing loss. At the conclusion of a research project, 39 participants with severe or profound hearing loss were fitted with WDRC in one program and WDRC with directional microphones and speech enhancement enabled in a 2nd program. More than 2 years after the 1st participants exited the project, a retrospective survey was conducted to determine the participants' use of, and satisfaction with, the 2 programs. From the 30 returned questionnaires, it seems that WDRC is used with a high degree of satisfaction in general everyday listening situations. The reported benefit from the addition of a directional microphone and speech enhancement for listening in noisy environments was lower and varied among the users. This variable was significantly correlated with how much the program was used. The less frequent and more varied use of the program with directional microphones and speech enhancement activated in combination suggests that these features may be best offered in a 2nd listening program for listeners with severe or profound hearing loss.
Characterizing Speech Intelligibility in Noise After Wide Dynamic Range Compression.
Rhebergen, Koenraad S; Maalderink, Thijs H; Dreschler, Wouter A
The effects of nonlinear signal processing on speech intelligibility in noise are difficult to evaluate. Often, the effects are examined by comparing speech intelligibility scores with and without processing measured at fixed signal to noise ratios (SNRs) or by comparing the adaptive measured speech reception thresholds corresponding to 50% intelligibility (SRT50) with and without processing. These outcome measures might not be optimal. Measuring at fixed SNRs can be affected by ceiling or floor effects, because the range of relevant SNRs is not know in advance. The SRT50 is less time consuming, has a fixed performance level (i.e., 50% correct), but the SRT50 could give a limited view, because we hypothesize that the effect of most nonlinear signal processing algorithms at the SRT50 cannot be generalized to other points of the psychometric function. In this article, we tested the value of estimating the entire psychometric function. We studied the effect of wide dynamic range compression (WDRC) on speech intelligibility in stationary, and interrupted speech-shaped noise in normal-hearing subjects, using a fast method-based local linear fitting approach and by two adaptive procedures. The measured performance differences for conditions with and without WDRC for the psychometric functions in stationary noise and interrupted speech-shaped noise show that the effects of WDRC on speech intelligibility are SNR dependent. We conclude that favorable and unfavorable effects of WDRC on speech intelligibility can be missed if the results are presented in terms of SRT50 values only.
Robust speaker's location detection in a vehicle environment using GMM models.
Hu, Jwu-Sheng; Cheng, Chieh-Cheng; Liu, Wei-Han
2006-04-01
Abstract-Human-computer interaction (HCI) using speech communication is becoming increasingly important, especially in driving where safety is the primary concern. Knowing the speaker's location (i.e., speaker localization) not only improves the enhancement results of a corrupted signal, but also provides assistance to speaker identification. Since conventional speech localization algorithms suffer from the uncertainties of environmental complexity and noise, as well as from the microphone mismatch problem, they are frequently not robust in practice. Without a high reliability, the acceptance of speech-based HCI would never be realized. This work presents a novel speaker's location detection method and demonstrates high accuracy within a vehicle cabinet using a single linear microphone array. The proposed approach utilize Gaussian mixture models (GMM) to model the distributions of the phase differences among the microphones caused by the complex characteristic of room acoustic and microphone mismatch. The model can be applied both in near-field and far-field situations in a noisy environment. The individual Gaussian component of a GMM represents some general location-dependent but content and speaker-independent phase difference distributions. Moreover, the scheme performs well not only in nonline-of-sight cases, but also when the speakers are aligned toward the microphone array but at difference distances from it. This strong performance can be achieved by exploiting the fact that the phase difference distributions at different locations are distinguishable in the environment of a car. The experimental results also show that the proposed method outperforms the conventional multiple signal classification method (MUSIC) technique at various SNRs.
Hu, Yi
2010-05-01
Recent research results show that combined electric and acoustic stimulation (EAS) significantly improves speech recognition in noise, and it is generally established that access to the improved F0 representation of target speech, along with the glimpse cues, provide the EAS benefits. Under noisy listening conditions, noise signals degrade these important cues by introducing undesired temporal-frequency components and corrupting harmonics structure. In this study, the potential of combining noise reduction and harmonics regeneration techniques was investigated to further improve speech intelligibility in noise by providing improved beneficial cues for EAS. Three hypotheses were tested: (1) noise reduction methods can improve speech intelligibility in noise for EAS; (2) harmonics regeneration after noise reduction can further improve speech intelligibility in noise for EAS; and (3) harmonics sideband constraints in frequency domain (or equivalently, amplitude modulation in temporal domain), even deterministic ones, can provide additional benefits. Test results demonstrate that combining noise reduction and harmonics regeneration can significantly improve speech recognition in noise for EAS, and it is also beneficial to preserve the harmonics sidebands under adverse listening conditions. This finding warrants further work into the development of algorithms that regenerate harmonics and the related sidebands for EAS processing under noisy conditions.
Speech rhythm alterations in Spanish-speaking individuals with Alzheimer's disease.
Martínez-Sánchez, Francisco; Meilán, Juan J G; Vera-Ferrandiz, Juan Antonio; Carro, Juan; Pujante-Valverde, Isabel M; Ivanova, Olga; Carcavilla, Nuria
2017-07-01
Rhythm is the speech property related to the temporal organization of sounds. Considerable evidence is now available for suggesting that dementia of Alzheimer's type is associated with impairments in speech rhythm. The aim of this study is to assess the use of an automatic computerized system for measuring speech rhythm characteristics in an oral reading task performed by 45 patients with Alzheimer's disease (AD) compared with those same characteristics among 82 healthy older adults without a diagnosis of dementia, and matched by age, sex and cultural background. Ranges of rhythmic-metric and clinical measurements were applied. The results show rhythmic differences between the groups, with higher variability of syllabic intervals in AD patients. Signal processing algorithms applied to oral reading recordings prove to be capable of differentiating between AD patients and older adults without dementia with an accuracy of 87% (specificity 81.7%, sensitivity 82.2%), based on the standard deviation of the duration of syllabic intervals. Experimental results show that the syllabic variability measurements extracted from the speech signal can be used to distinguish between older adults without a diagnosis of dementia and those with AD, and may be useful as a tool for the objective study and quantification of speech deficits in AD.
ERIC Educational Resources Information Center
Katongo, Emily Mwamba; Ndhlovu, Daniel
2015-01-01
This study sought to establish the role of music in speech intelligibility of learners with Post Lingual Hearing Impairment (PLHI) and strategies teachers used to enhance speech intelligibility in learners with PLHI in selected special units for the deaf in Lusaka district. The study used a descriptive research design. Qualitative and quantitative…
ERIC Educational Resources Information Center
Eichorn, Naomi; Marton, Klara; Schwartz, Richard G.; Melara, Robert D.; Pirutinsky, Steven
2016-01-01
Purpose: The present study examined whether engaging working memory in a secondary task benefits speech fluency. Effects of dual-task conditions on speech fluency, rate, and errors were examined with respect to predictions derived from three related theoretical accounts of disfluencies. Method: Nineteen adults who stutter and twenty adults who do…
Speech perception in individuals with auditory dys-synchrony.
Kumar, U A; Jayaram, M
2011-03-01
This study aimed to evaluate the effect of lengthening the transition duration of selected speech segments upon the perception of those segments in individuals with auditory dys-synchrony. Thirty individuals with auditory dys-synchrony participated in the study, along with 30 age-matched normal hearing listeners. Eight consonant-vowel syllables were used as auditory stimuli. Two experiments were conducted. Experiment one measured the 'just noticeable difference' time: the smallest prolongation of the speech sound transition duration which was noticeable by the subject. In experiment two, speech sounds were modified by lengthening the transition duration by multiples of the just noticeable difference time, and subjects' speech identification scores for the modified speech sounds were assessed. Subjects with auditory dys-synchrony demonstrated poor processing of temporal auditory information. Lengthening of speech sound transition duration improved these subjects' perception of both the placement and voicing features of the speech syllables used. These results suggest that innovative speech processing strategies which enhance temporal cues may benefit individuals with auditory dys-synchrony.
Visual speech information: a help or hindrance in perceptual processing of dysarthric speech.
Borrie, Stephanie A
2015-03-01
This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.
Fluid-acoustic interactions and their impact on pathological voiced speech
NASA Astrophysics Data System (ADS)
Erath, Byron D.; Zanartu, Matias; Peterson, Sean D.; Plesniak, Michael W.
2011-11-01
Voiced speech is produced by vibration of the vocal fold structures. Vocal fold dynamics arise from aerodynamic pressure loadings, tissue properties, and acoustic modulation of the driving pressures. Recent speech science advancements have produced a physiologically-realistic fluid flow solver (BLEAP) capable of prescribing asymmetric intraglottal flow attachment that can be easily assimilated into reduced order models of speech. The BLEAP flow solver is extended to incorporate acoustic loading and sound propagation in the vocal tract by implementing a wave reflection analog approach for sound propagation based on the governing BLEAP equations. This enhanced physiological description of the physics of voiced speech is implemented into a two-mass model of speech. The impact of fluid-acoustic interactions on vocal fold dynamics is elucidated for both normal and pathological speech through linear and nonlinear analysis techniques. Supported by NSF Grant CBET-1036280.
Anderson, Karen L; Goldstein, Howard
2004-04-01
Children typically learn in classroom environments that have background noise and reverberation that interfere with accurate speech perception. Amplification technology can enhance the speech perception of students who are hard of hearing. This study used a single-subject alternating treatments design to compare the speech recognition abilities of children who are, hard of hearing when they were using hearing aids with each of three frequency modulated (FM) or infrared devices. Eight 9-12-year-olds with mild to severe hearing loss repeated Hearing in Noise Test (HINT) sentence lists under controlled conditions in a typical kindergarten classroom with a background noise level of +10 dB signal-to-noise (S/N) ratio and 1.1 s reverberation time. Participants listened to HINT lists using hearing aids alone and hearing aids in combination with three types of S/N-enhancing devices that are currently used in mainstream classrooms: (a) FM systems linked to personal hearing aids, (b) infrared sound field systems with speakers placed throughout the classroom, and (c) desktop personal sound field FM systems. The infrared ceiling sound field system did not provide benefit beyond that provided by hearing aids alone. Desktop and personal FM systems in combination with personal hearing aids provided substantial improvements in speech recognition. This information can assist in making S/N-enhancing device decisions for students using hearing aids. In a reverberant and noisy classroom setting, classroom sound field devices are not beneficial to speech perception for students with hearing aids, whereas either personal FM or desktop sound field systems provide listening benefits.
Lip reading using neural networks
NASA Astrophysics Data System (ADS)
Kalbande, Dhananjay; Mishra, Akassh A.; Patil, Sanjivani; Nirgudkar, Sneha; Patel, Prashant
2011-10-01
Computerized lip reading, or speech reading, is concerned with the difficult task of converting a video signal of a speaking person to written text. It has several applications like teaching deaf and dumb to speak and communicate effectively with the other people, its crime fighting potential and invariance to acoustic environment. We convert the video of the subject speaking vowels into images and then images are further selected manually for processing. However, several factors like fast speech, bad pronunciation, and poor illumination, movement of face, moustaches and beards make lip reading difficult. Contour tracking methods and Template matching are used for the extraction of lips from the face. K Nearest Neighbor algorithm is then used to classify the 'speaking' images and the 'silent' images. The sequence of images is then transformed into segments of utterances. Feature vector is calculated on each frame for all the segments and is stored in the database with properly labeled class. Character recognition is performed using modified KNN algorithm which assigns more weight to nearer neighbors. This paper reports the recognition of vowels using KNN algorithms
On the recognition of emotional vocal expressions: motivations for a holistic approach.
Esposito, Anna; Esposito, Antonietta M
2012-10-01
Human beings seem to be able to recognize emotions from speech very well and information communication technology aims to implement machines and agents that can do the same. However, to be able to automatically recognize affective states from speech signals, it is necessary to solve two main technological problems. The former concerns the identification of effective and efficient processing algorithms capable of capturing emotional acoustic features from speech sentences. The latter focuses on finding computational models able to classify, with an approximation as good as human listeners, a given set of emotional states. This paper will survey these topics and provide some insights for a holistic approach to the automatic analysis, recognition and synthesis of affective states.
NASA Astrophysics Data System (ADS)
Balbin, Jessie R.; Padilla, Dionis A.; Fausto, Janette C.; Vergara, Ernesto M.; Garcia, Ramon G.; Delos Angeles, Bethsedea Joy S.; Dizon, Neil John A.; Mardo, Mark Kevin N.
2017-02-01
This research is about translating series of hand gesture to form a word and produce its equivalent sound on how it is read and said in Filipino accent using Support Vector Machine and Mel Frequency Cepstral Coefficient analysis. The concept is to detect Filipino speech input and translate the spoken words to their text form in Filipino. This study is trying to help the Filipino deaf community to impart their thoughts through the use of hand gestures and be able to communicate to people who do not know how to read hand gestures. This also helps literate deaf to simply read the spoken words relayed to them using the Filipino speech to text system.
Auditory-Perceptual Learning Improves Speech Motor Adaptation in Children
Shiller, Douglas M.; Rochon, Marie-Lyne
2015-01-01
Auditory feedback plays an important role in children’s speech development by providing the child with information about speech outcomes that is used to learn and fine-tune speech motor plans. The use of auditory feedback in speech motor learning has been extensively studied in adults by examining oral motor responses to manipulations of auditory feedback during speech production. Children are also capable of adapting speech motor patterns to perceived changes in auditory feedback, however it is not known whether their capacity for motor learning is limited by immature auditory-perceptual abilities. Here, the link between speech perceptual ability and the capacity for motor learning was explored in two groups of 5–7-year-old children who underwent a period of auditory perceptual training followed by tests of speech motor adaptation to altered auditory feedback. One group received perceptual training on a speech acoustic property relevant to the motor task while a control group received perceptual training on an irrelevant speech contrast. Learned perceptual improvements led to an enhancement in speech motor adaptation (proportional to the perceptual change) only for the experimental group. The results indicate that children’s ability to perceive relevant speech acoustic properties has a direct influence on their capacity for sensory-based speech motor adaptation. PMID:24842067
Speech identification in noise: Contribution of temporal, spectral, and visual speech cues.
Kim, Jeesun; Davis, Chris; Groot, Christopher
2009-12-01
This study investigated the degree to which two types of reduced auditory signals (cochlear implant simulations) and visual speech cues combined for speech identification. The auditory speech stimuli were filtered to have only amplitude envelope cues or both amplitude envelope and spectral cues and were presented with/without visual speech. In Experiment 1, IEEE sentences were presented in quiet and noise. For in-quiet presentation, speech identification was enhanced by the addition of both spectral and visual speech cues. Due to a ceiling effect, the degree to which these effects combined could not be determined. In noise, these facilitation effects were more marked and were additive. Experiment 2 examined consonant and vowel identification in the context of CVC or VCV syllables presented in noise. For consonants, both spectral and visual speech cues facilitated identification and these effects were additive. For vowels, the effect of combined cues was underadditive, with the effect of spectral cues reduced when presented with visual speech cues. Analysis indicated that without visual speech, spectral cues facilitated the transmission of place information and vowel height, whereas with visual speech, they facilitated lip rounding, with little impact on the transmission of place information.
A Speech Controlled Information-Retrieval System,
1983-01-01
instance, monitoring the speed of articulation continuously could lead to a faster time warping algorithm by restricting the amount of overlapping of...M E (1975) "LEX - a lexical analyser generator" CSTR 39, Bell Laboratories. ’.
Rapid tuning shifts in human auditory cortex enhance speech intelligibility
Holdgraf, Christopher R.; de Heer, Wendy; Pasley, Brian; Rieger, Jochem; Crone, Nathan; Lin, Jack J.; Knight, Robert T.; Theunissen, Frédéric E.
2016-01-01
Experience shapes our perception of the world on a moment-to-moment basis. This robust perceptual effect of experience parallels a change in the neural representation of stimulus features, though the nature of this representation and its plasticity are not well-understood. Spectrotemporal receptive field (STRF) mapping describes the neural response to acoustic features, and has been used to study contextual effects on auditory receptive fields in animal models. We performed a STRF plasticity analysis on electrophysiological data from recordings obtained directly from the human auditory cortex. Here, we report rapid, automatic plasticity of the spectrotemporal response of recorded neural ensembles, driven by previous experience with acoustic and linguistic information, and with a neurophysiological effect in the sub-second range. This plasticity reflects increased sensitivity to spectrotemporal features, enhancing the extraction of more speech-like features from a degraded stimulus and providing the physiological basis for the observed ‘perceptual enhancement' in understanding speech. PMID:27996965
Space station interior noise analysis program
NASA Technical Reports Server (NTRS)
Stusnick, E.; Burn, M.
1987-01-01
Documentation is provided for a microcomputer program which was developed to evaluate the effect of the vibroacoustic environment on speech communication inside a space station. The program, entitled Space Station Interior Noise Analysis Program (SSINAP), combines a Statistical Energy Analysis (SEA) prediction of sound and vibration levels within the space station with a speech intelligibility model based on the Modulation Transfer Function and the Speech Transmission Index (MTF/STI). The SEA model provides an effective analysis tool for predicting the acoustic environment based on proposed space station design. The MTF/STI model provides a method for evaluating speech communication in the relatively reverberant and potentially noisy environments that are likely to occur in space stations. The combinations of these two models provides a powerful analysis tool for optimizing the acoustic design of space stations from the point of view of speech communications. The mathematical algorithms used in SSINAP are presented to implement the SEA and MTF/STI models. An appendix provides an explanation of the operation of the program along with details of the program structure and code.
Direct speech quotations promote low relative-clause attachment in silent reading of English.
Yao, Bo; Scheepers, Christoph
2018-07-01
The implicit prosody hypothesis (Fodor, 1998, 2002) proposes that silent reading coincides with a default, implicit form of prosody to facilitate sentence processing. Recent research demonstrated that a more vivid form of implicit prosody is mentally simulated during silent reading of direct speech quotations (e.g., Mary said, "This dress is beautiful"), with neural and behavioural consequences (e.g., Yao, Belin, & Scheepers, 2011; Yao & Scheepers, 2011). Here, we explored the relation between 'default' and 'simulated' implicit prosody in the context of relative-clause (RC) attachment in English. Apart from confirming a general low RC-attachment preference in both production (Experiment 1) and comprehension (Experiments 2 and 3), we found that during written sentence completion (Experiment 1) or when reading silently (Experiment 2), the low RC-attachment preference was reliably enhanced when the critical sentences were embedded in direct speech quotations as compared to indirect speech or narrative sentences. However, when reading aloud (Experiment 3), direct speech did not enhance the general low RC-attachment preference. The results from Experiments 1 and 2 suggest a quantitative boost to implicit prosody (via auditory perceptual simulation) during silent production/comprehension of direct speech. By contrast, when reading aloud (Experiment 3), prosody becomes equally salient across conditions due to its explicit nature; indirect speech and narrative sentences thus become as susceptible to prosody-induced syntactic biases as direct speech. The present findings suggest a shared cognitive basis between default implicit prosody and simulated implicit prosody, providing a new platform for studying the effects of implicit prosody on sentence processing. Copyright © 2018 Elsevier B.V. All rights reserved.
The contribution of dynamic visual cues to audiovisual speech perception.
Jaekl, Philip; Pesquita, Ana; Alsius, Agnes; Munhall, Kevin; Soto-Faraco, Salvador
2015-08-01
Seeing a speaker's facial gestures can significantly improve speech comprehension, especially in noisy environments. However, the nature of the visual information from the speaker's facial movements that is relevant for this enhancement is still unclear. Like auditory speech signals, visual speech signals unfold over time and contain both dynamic configural information and luminance-defined local motion cues; two information sources that are thought to engage anatomically and functionally separate visual systems. Whereas, some past studies have highlighted the importance of local, luminance-defined motion cues in audiovisual speech perception, the contribution of dynamic configural information signalling changes in form over time has not yet been assessed. We therefore attempted to single out the contribution of dynamic configural information to audiovisual speech processing. To this aim, we measured word identification performance in noise using unimodal auditory stimuli, and with audiovisual stimuli. In the audiovisual condition, speaking faces were presented as point light displays achieved via motion capture of the original talker. Point light displays could be isoluminant, to minimise the contribution of effective luminance-defined local motion information, or with added luminance contrast, allowing the combined effect of dynamic configural cues and local motion cues. Audiovisual enhancement was found in both the isoluminant and contrast-based luminance conditions compared to an auditory-only condition, demonstrating, for the first time the specific contribution of dynamic configural cues to audiovisual speech improvement. These findings imply that globally processed changes in a speaker's facial shape contribute significantly towards the perception of articulatory gestures and the analysis of audiovisual speech. Copyright © 2015 Elsevier Ltd. All rights reserved.
A survey of the state-of-the-art and focused research in range systems, task 1
NASA Technical Reports Server (NTRS)
Omura, J. K.
1986-01-01
This final report presents the latest research activity in voice compression. We have designed a non-real time simulation system that is implemented around the IBM-PC where the IBM-PC is used as a speech work station for data acquisition and analysis of voice samples. A real-time implementation is also proposed. This real-time Voice Compression Board (VCB) is built around the Texas Instruments TMS-3220. The voice compression algorithm investigated here was described in an earlier report titled, Low Cost Voice Compression for Mobile Digital Radios, by the author. We will assume the reader is familiar with the voice compression algorithm discussed in this report. The VCB compresses speech waveforms at data rates ranging from 4.8 K bps to 16 K bps. This board interfaces to the IBM-PC 8-bit bus, and plugs into a single expansion slot on the mother board.
An empirical generative framework for computational modeling of language acquisition.
Waterfall, Heidi R; Sandbank, Ben; Onnis, Luca; Edelman, Shimon
2010-06-01
This paper reports progress in developing a computer model of language acquisition in the form of (1) a generative grammar that is (2) algorithmically learnable from realistic corpus data, (3) viable in its large-scale quantitative performance and (4) psychologically real. First, we describe new algorithmic methods for unsupervised learning of generative grammars from raw CHILDES data and give an account of the generative performance of the acquired grammars. Next, we summarize findings from recent longitudinal and experimental work that suggests how certain statistically prominent structural properties of child-directed speech may facilitate language acquisition. We then present a series of new analyses of CHILDES data indicating that the desired properties are indeed present in realistic child-directed speech corpora. Finally, we suggest how our computational results, behavioral findings, and corpus-based insights can be integrated into a next-generation model aimed at meeting the four requirements of our modeling framework.
Maas, Edwin; Mailend, Marja-Liisa
2012-10-01
The purpose of this article is to present an argument for the use of online reaction time (RT) methods to the study of apraxia of speech (AOS) and to review the existing small literature in this area and the contributions it has made to our fundamental understanding of speech planning (deficits) in AOS. Following a brief description of limitations of offline perceptual methods, we provide a narrative review of various types of RT paradigms from the (speech) motor programming and psycholinguistic literatures and their (thus far limited) application with AOS. On the basis of the review of the literature, we conclude that with careful consideration of potential challenges and caveats, RT approaches hold great promise to advance our understanding of AOS, in particular with respect to the speech planning processes that generate the speech signal before initiation. A deeper understanding of the nature and time course of speech planning and its disruptions in AOS may enhance diagnosis and treatment for AOS. Only a handful of published studies on apraxia of speech have used reaction time methods. However, these studies have provided deeper insight into speech planning impairments in AOS based on a variety of experimental paradigms.
Hidden Markov models in automatic speech recognition
NASA Astrophysics Data System (ADS)
Wrzoskowicz, Adam
1993-11-01
This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.
Implementation of a Localization System for Sensor Networks
2006-05-18
its N-point DFT is mathematically formulated as X[k] = N−1∑ n=0 x[n] W nkN , k = 0, 1, . . . , N − 1 (7.1) W knN = e −j(2π/N)kn (7.2) There are two...distributed ad-hoc wireless sensor networks. In Int. Conf. on Acoustics, Speech , and Signal Proc. (ICASSP), pages 2037 – 2040, Salt Lake City, UT. [18] J...Stability of recursive qrd-ls algorithms using finite- precision systolic array implementation. IEEE Trans. on Acoustics, Speech , and Signal Proc., 37(5
ERIC Educational Resources Information Center
Leaby, Margaret M.; Walsh, Irene P.
2008-01-01
The importance of learning about and applying clinical discourse analysis to enhance the talk in interaction in the speech-language pathology clinic is discussed. The benefits of analyzing clinical discourse to explicate therapy dynamics are described.
High visual resolution matters in audiovisual speech perception, but only for some.
Alsius, Agnès; Wayne, Rachel V; Paré, Martin; Munhall, Kevin G
2016-07-01
The basis for individual differences in the degree to which visual speech input enhances comprehension of acoustically degraded speech is largely unknown. Previous research indicates that fine facial detail is not critical for visual enhancement when auditory information is available; however, these studies did not examine individual differences in ability to make use of fine facial detail in relation to audiovisual speech perception ability. Here, we compare participants based on their ability to benefit from visual speech information in the presence of an auditory signal degraded with noise, modulating the resolution of the visual signal through low-pass spatial frequency filtering and monitoring gaze behavior. Participants who benefited most from the addition of visual information (high visual gain) were more adversely affected by the removal of high spatial frequency information, compared to participants with low visual gain, for materials with both poor and rich contextual cues (i.e., words and sentences, respectively). Differences as a function of gaze behavior between participants with the highest and lowest visual gains were observed only for words, with participants with the highest visual gain fixating longer on the mouth region. Our results indicate that the individual variance in audiovisual speech in noise performance can be accounted for, in part, by better use of fine facial detail information extracted from the visual signal and increased fixation on mouth regions for short stimuli. Thus, for some, audiovisual speech perception may suffer when the visual input (in addition to the auditory signal) is less than perfect.
Samson, F; Zeffiro, T A; Doyon, J; Benali, H; Mottron, L
2015-09-01
A continuum of phenotypes makes up the autism spectrum (AS). In particular, individuals show large differences in language acquisition, ranging from precocious speech to severe speech onset delay. However, the neurological origin of this heterogeneity remains unknown. Here, we sought to determine whether AS individuals differing in speech acquisition show different cortical responses to auditory stimulation and morphometric brain differences. Whole-brain activity following exposure to non-social sounds was investigated. Individuals in the AS were classified according to the presence or absence of Speech Onset Delay (AS-SOD and AS-NoSOD, respectively) and were compared with IQ-matched typically developing individuals (TYP). AS-NoSOD participants displayed greater task-related activity than TYP in the inferior frontal gyrus and peri-auditory middle and superior temporal gyri, which are associated with language processing. Conversely, the AS-SOD group only showed enhanced activity in the vicinity of the auditory cortex. We detected no differences in brain structure between groups. This is the first study to demonstrate the existence of differences in functional brain activity between AS individuals divided according to their pattern of speech development. These findings support the Trigger-threshold-target model and indicate that the occurrence of speech onset delay in AS individuals depends on the location of cortical functional reallocation, which favors perception in AS-SOD and language in AS-NoSOD. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Two Stage Data Augmentation for Low Resourced Speech Recognition (Author’s Manuscript)
2016-09-12
speech recognition, deep neural networks, data augmentation 1. Introduction When training data is limited—whether it be audio or text—the obvious...Schwartz, and S. Tsakalidis, “Enhancing low resource keyword spotting with au- tomatically retrieved web documents,” in Interspeech, 2015, pp. 839–843. [2...and F. Seide, “Feature learning in deep neural networks - a study on speech recognition tasks,” in International Conference on Learning Representations
A dynamic multi-channel speech enhancement system for distributed microphones in a car environment
NASA Astrophysics Data System (ADS)
Matheja, Timo; Buck, Markus; Fingscheidt, Tim
2013-12-01
Supporting multiple active speakers in automotive hands-free or speech dialog applications is an interesting issue not least due to comfort reasons. Therefore, a multi-channel system for enhancement of speech signals captured by distributed distant microphones in a car environment is presented. Each of the potential speakers in the car has a dedicated directional microphone close to his position that captures the corresponding speech signal. The aim of the resulting overall system is twofold: On the one hand, a combination of an arbitrary pre-defined subset of speakers' signals can be performed, e.g., to create an output signal in a hands-free telephone conference call for a far-end communication partner. On the other hand, annoying cross-talk components from interfering sound sources occurring in multiple different mixed output signals are to be eliminated, motivated by the possibility of other hands-free applications being active in parallel. The system includes several signal processing stages. A dedicated signal processing block for interfering speaker cancellation attenuates the cross-talk components of undesired speech. Further signal enhancement comprises the reduction of residual cross-talk and background noise. Subsequently, a dynamic signal combination stage merges the processed single-microphone signals to obtain appropriate mixed signals at the system output that may be passed to applications such as telephony or a speech dialog system. Based on signal power ratios between the particular microphone signals, an appropriate speaker activity detection and therewith a robust control mechanism of the whole system is presented. The proposed system may be dynamically configured and has been evaluated for a car setup with four speakers sitting in the car cabin disturbed in various noise conditions.
Enhanced perception of pitch changes in speech and music in early blind adults.
Arnaud, Laureline; Gracco, Vincent; Ménard, Lucie
2018-06-12
It is well known that congenitally blind adults have enhanced auditory processing for some tasks. For instance, they show supra-normal capacity to perceive accelerated speech. However, only a few studies have investigated basic auditory processing in this population. In this study, we investigated if pitch processing enhancement in the blind is a domain-general or domain-specific phenomenon, and if pitch processing shares the same properties as in the sighted regarding how scores from different domains are associated. Fifteen congenitally blind adults and fifteen sighted adults participated in the study. We first created a set of personalized native and non-native vowel stimuli using an identification and rating task. Then, an adaptive discrimination paradigm was used to determine the frequency difference limen for pitch direction identification of speech (native and non-native vowels) and non-speech stimuli (musical instruments and pure tones). The results show that the blind participants had better discrimination thresholds than controls for native vowels, music stimuli, and pure tones. Whereas within the blind group, the discrimination thresholds were smaller for musical stimuli than speech stimuli, replicating previous findings in sighted participants, we did not find this effect in the current control group. Further analyses indicate that older sighted participants show higher thresholds for instrument sounds compared to speech sounds. This effect of age was not found in the blind group. Moreover, the scores across domains were not associated to the same extent in the blind as they were in the sighted. In conclusion, in addition to providing further evidence of compensatory auditory mechanisms in early blind individuals, our results point to differences in how auditory processing is modulated in this population. Copyright © 2018 Elsevier Ltd. All rights reserved.
Kittilstved, Tiffani; Reilly, Kevin J.; Harkrider, Ashley W.; Casenhiser, Devin; Thornton, David; Jenson, David E.; Hedinger, Tricia; Bowers, Andrew L.; Saltuklaroglu, Tim
2018-01-01
Objective: To determine whether changes in sensorimotor control resulting from speaking conditions that induce fluency in people who stutter (PWS) can be measured using electroencephalographic (EEG) mu rhythms in neurotypical speakers. Methods: Non-stuttering (NS) adults spoke in one control condition (solo speaking) and four experimental conditions (choral speech, delayed auditory feedback (DAF), prolonged speech and pseudostuttering). Independent component analysis (ICA) was used to identify sensorimotor μ components from EEG recordings. Time-frequency analyses measured μ-alpha (8–13 Hz) and μ-beta (15–25 Hz) event-related synchronization (ERS) and desynchronization (ERD) during each speech condition. Results: 19/24 participants contributed μ components. Relative to the control condition, the choral and DAF conditions elicited increases in μ-alpha ERD in the right hemisphere. In the pseudostuttering condition, increases in μ-beta ERD were observed in the left hemisphere. No differences were present between the prolonged speech and control conditions. Conclusions: Differences observed in the experimental conditions are thought to reflect sensorimotor control changes. Increases in right hemisphere μ-alpha ERD likely reflect increased reliance on auditory information, including auditory feedback, during the choral and DAF conditions. In the left hemisphere, increases in μ-beta ERD during pseudostuttering may have resulted from the different movement characteristics of this task compared with the solo speaking task. Relationships to findings in stuttering are discussed. Significance: Changes in sensorimotor control related feedforward and feedback control in fluency-enhancing speech manipulations can be measured using time-frequency decompositions of EEG μ rhythms in neurotypical speakers. This quiet, non-invasive, and temporally sensitive technique may be applied to learn more about normal sensorimotor control and fluency enhancement in PWS. PMID:29670516
Kumar, Veena; Croxson, Paula L; Simonyan, Kristina
2016-04-13
The laryngeal motor cortex (LMC) is essential for the production of learned vocal behaviors because bilateral damage to this area renders humans unable to speak but has no apparent effect on innate vocalizations such as human laughing and crying or monkey calls. Several hypotheses have been put forward attempting to explain the evolutionary changes from monkeys to humans that potentially led to enhanced LMC functionality for finer motor control of speech production. These views, however, remain limited to the position of the larynx area within the motor cortex, as well as its connections with the phonatory brainstem regions responsible for the direct control of laryngeal muscles. Using probabilistic diffusion tractography in healthy humans and rhesus monkeys, we show that, whereas the LMC structural network is largely comparable in both species, the LMC establishes nearly 7-fold stronger connectivity with the somatosensory and inferior parietal cortices in humans than in macaques. These findings suggest that important "hard-wired" components of the human LMC network controlling the laryngeal component of speech motor output evolved from an already existing, similar network in nonhuman primates. However, the evolution of enhanced LMC-parietal connections likely allowed for more complex synchrony of higher-order sensorimotor coordination, proprioceptive and tactile feedback, and modulation of learned voice for speech production. The role of the primary motor cortex in the formation of a comprehensive network controlling speech and language has been long underestimated and poorly studied. Here, we provide comparative and quantitative evidence for the significance of this region in the control of a highly learned and uniquely human behavior: speech production. From the viewpoint of structural network organization, we discuss potential evolutionary advances of enhanced temporoparietal cortical connections with the laryngeal motor cortex in humans compared with nonhuman primates that may have contributed to the development of finer vocal motor control necessary for speech production. Copyright © 2016 the authors 0270-6474/16/364170-12$15.00/0.
Davis, Matthew H.
2016-01-01
Successful perception depends on combining sensory input with prior knowledge. However, the underlying mechanism by which these two sources of information are combined is unknown. In speech perception, as in other domains, two functionally distinct coding schemes have been proposed for how expectations influence representation of sensory evidence. Traditional models suggest that expected features of the speech input are enhanced or sharpened via interactive activation (Sharpened Signals). Conversely, Predictive Coding suggests that expected features are suppressed so that unexpected features of the speech input (Prediction Errors) are processed further. The present work is aimed at distinguishing between these two accounts of how prior knowledge influences speech perception. By combining behavioural, univariate, and multivariate fMRI measures of how sensory detail and prior expectations influence speech perception with computational modelling, we provide evidence in favour of Prediction Error computations. Increased sensory detail and informative expectations have additive behavioural and univariate neural effects because they both improve the accuracy of word report and reduce the BOLD signal in lateral temporal lobe regions. However, sensory detail and informative expectations have interacting effects on speech representations shown by multivariate fMRI in the posterior superior temporal sulcus. When prior knowledge was absent, increased sensory detail enhanced the amount of speech information measured in superior temporal multivoxel patterns, but with informative expectations, increased sensory detail reduced the amount of measured information. Computational simulations of Sharpened Signals and Prediction Errors during speech perception could both explain these behavioural and univariate fMRI observations. However, the multivariate fMRI observations were uniquely simulated by a Prediction Error and not a Sharpened Signal model. The interaction between prior expectation and sensory detail provides evidence for a Predictive Coding account of speech perception. Our work establishes methods that can be used to distinguish representations of Prediction Error and Sharpened Signals in other perceptual domains. PMID:27846209
Automatic detection of articulation disorders in children with cleft lip and palate.
Maier, Andreas; Hönig, Florian; Bocklet, Tobias; Nöth, Elmar; Stelzle, Florian; Nkenke, Emeka; Schuster, Maria
2009-11-01
Speech of children with cleft lip and palate (CLP) is sometimes still disordered even after adequate surgical and nonsurgical therapies. Such speech shows complex articulation disorders, which are usually assessed perceptually, consuming time and manpower. Hence, there is a need for an easy to apply and reliable automatic method. To create a reference for an automatic system, speech data of 58 children with CLP were assessed perceptually by experienced speech therapists for characteristic phonetic disorders at the phoneme level. The first part of the article aims to detect such characteristics by a semiautomatic procedure and the second to evaluate a fully automatic, thus simple, procedure. The methods are based on a combination of speech processing algorithms. The semiautomatic method achieves moderate to good agreement (kappa approximately 0.6) for the detection of all phonetic disorders. On a speaker level, significant correlations between the perceptual evaluation and the automatic system of 0.89 are obtained. The fully automatic system yields a correlation on the speaker level of 0.81 to the perceptual evaluation. This correlation is in the range of the inter-rater correlation of the listeners. The automatic speech evaluation is able to detect phonetic disorders at an experts'level without any additional human postprocessing.
Studies in automatic speech recognition and its application in aerospace
NASA Astrophysics Data System (ADS)
Taylor, Michael Robinson
Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.
Is automatic speech-to-text transcription ready for use in psychological experiments?
Ziman, Kirsten; Heusser, Andrew C; Fitzpatrick, Paxton C; Field, Campbell E; Manning, Jeremy R
2018-04-23
Verbal responses are a convenient and naturalistic way for participants to provide data in psychological experiments (Salzinger, The Journal of General Psychology, 61(1),65-94:1959). However, audio recordings of verbal responses typically require additional processing, such as transcribing the recordings into text, as compared with other behavioral response modalities (e.g., typed responses, button presses, etc.). Further, the transcription process is often tedious and time-intensive, requiring human listeners to manually examine each moment of recorded speech. Here we evaluate the performance of a state-of-the-art speech recognition algorithm (Halpern et al., 2016) in transcribing audio data into text during a list-learning experiment. We compare transcripts made by human annotators to the computer-generated transcripts. Both sets of transcripts matched to a high degree and exhibited similar statistical properties, in terms of the participants' recall performance and recall dynamics that the transcripts captured. This proof-of-concept study suggests that speech-to-text engines could provide a cheap, reliable, and rapid means of automatically transcribing speech data in psychological experiments. Further, our findings open the door for verbal response experiments that scale to thousands of participants (e.g., administered online), as well as a new generation of experiments that decode speech on the fly and adapt experimental parameters based on participants' prior responses.
The Unsupervised Acquisition of a Lexicon from Continuous Speech.
1995-11-01
Com- munication, 2(1):57{89, 1982. [42] J. Ziv and A. Lempel . Compression of individual sequences by variable rate coding. IEEE Trans- actions on...parameters of the compression algorithm , in a never-ending attempt to identify and eliminate the predictable. They lead us to a class of grammars in...the rst 10 sentences of the test set, previously unseen by the algorithm . Vertical bars indicate word boundaries. 7.1 Text Compression and Language
Generalization of Perceptual Learning of Vocoded Speech
ERIC Educational Resources Information Center
Hervais-Adelman, Alexis G.; Davis, Matthew H.; Johnsrude, Ingrid S.; Taylor, Karen J.; Carlyon, Robert P.
2011-01-01
Recent work demonstrates that learning to understand noise-vocoded (NV) speech alters sublexical perceptual processes but is enhanced by the simultaneous provision of higher-level, phonological, but not lexical content (Hervais-Adelman, Davis, Johnsrude, & Carlyon, 2008), consistent with top-down learning (Davis, Johnsrude, Hervais-Adelman,…
Strategies for distant speech recognitionin reverberant environments
NASA Astrophysics Data System (ADS)
Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro
2015-12-01
Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.
Study of wavelet packet energy entropy for emotion classification in speech and glottal signals
NASA Astrophysics Data System (ADS)
He, Ling; Lech, Margaret; Zhang, Jing; Ren, Xiaomei; Deng, Lihua
2013-07-01
The automatic speech emotion recognition has important applications in human-machine communication. Majority of current research in this area is focused on finding optimal feature parameters. In recent studies, several glottal features were examined as potential cues for emotion differentiation. In this study, a new type of feature parameter is proposed, which calculates energy entropy on values within selected Wavelet Packet frequency bands. The modeling and classification tasks are conducted using the classical GMM algorithm. The experiments use two data sets: the Speech Under Simulated Emotion (SUSE) data set annotated with three different emotions (angry, neutral and soft) and Berlin Emotional Speech (BES) database annotated with seven different emotions (angry, bored, disgust, fear, happy, sad and neutral). The average classification accuracy achieved for the SUSE data (74%-76%) is significantly higher than the accuracy achieved for the BES data (51%-54%). In both cases, the accuracy was significantly higher than the respective random guessing levels (33% for SUSE and 14.3% for BES).
Fifty years of progress in acoustic phonetics
NASA Astrophysics Data System (ADS)
Stevens, Kenneth N.
2004-10-01
Three events that occurred 50 or 60 years ago shaped the study of acoustic phonetics, and in the following few decades these events influenced research and applications in speech disorders, speech development, speech synthesis, speech recognition, and other subareas in speech communication. These events were: (1) the source-filter theory of speech production (Chiba and Kajiyama; Fant); (2) the development of the sound spectrograph and its interpretation (Potter, Kopp, and Green; Joos); and (3) the birth of research that related distinctive features to acoustic patterns (Jakobson, Fant, and Halle). Following these events there has been systematic exploration of the articulatory, acoustic, and perceptual bases of phonological categories, and some quantification of the sources of variability in the transformation of this phonological representation of speech into its acoustic manifestations. This effort has been enhanced by studies of how children acquire language in spite of this variability and by research on speech disorders. Gaps in our knowledge of this inherent variability in speech have limited the directions of applications such as synthesis and recognition of speech, and have led to the implementation of data-driven techniques rather than theoretical principles. Some examples of advances in our knowledge, and limitations of this knowledge, are reviewed.
Acoustic Event Detection and Classification
NASA Astrophysics Data System (ADS)
Temko, Andrey; Nadeu, Climent; Macho, Dušan; Malkin, Robert; Zieger, Christian; Omologo, Maurizio
The human activity that takes place in meeting rooms or classrooms is reflected in a rich variety of acoustic events (AE), produced either by the human body or by objects handled by humans, so the determination of both the identity of sounds and their position in time may help to detect and describe that human activity. Indeed, speech is usually the most informative sound, but other kinds of AEs may also carry useful information, for example, clapping or laughing inside a speech, a strong yawn in the middle of a lecture, a chair moving or a door slam when the meeting has just started. Additionally, detection and classification of sounds other than speech may be useful to enhance the robustness of speech technologies like automatic speech recognition.
Digital signal processing algorithms for automatic voice recognition
NASA Technical Reports Server (NTRS)
Botros, Nazeih M.
1987-01-01
The current digital signal analysis algorithms are investigated that are implemented in automatic voice recognition algorithms. Automatic voice recognition means, the capability of a computer to recognize and interact with verbal commands. The digital signal is focused on, rather than the linguistic, analysis of speech signal. Several digital signal processing algorithms are available for voice recognition. Some of these algorithms are: Linear Predictive Coding (LPC), Short-time Fourier Analysis, and Cepstrum Analysis. Among these algorithms, the LPC is the most widely used. This algorithm has short execution time and do not require large memory storage. However, it has several limitations due to the assumptions used to develop it. The other 2 algorithms are frequency domain algorithms with not many assumptions, but they are not widely implemented or investigated. However, with the recent advances in the digital technology, namely signal processors, these 2 frequency domain algorithms may be investigated in order to implement them in voice recognition. This research is concerned with real time, microprocessor based recognition algorithms.
Advancements in robust algorithm formulation for speaker identification of whispered speech
NASA Astrophysics Data System (ADS)
Fan, Xing
Whispered speech is an alternative speech production mode from neutral speech, which is used by talkers intentionally in natural conversational scenarios to protect privacy and to avoid certain content from being overheard/made public. Due to the profound differences between whispered and neutral speech in production mechanism and the absence of whispered adaptation data, the performance of speaker identification systems trained with neutral speech degrades significantly. This dissertation therefore focuses on developing a robust closed-set speaker recognition system for whispered speech by using no or limited whispered adaptation data from non-target speakers. This dissertation proposes the concept of "High''/"Low'' performance whispered data for the purpose of speaker identification. A variety of acoustic properties are identified that contribute to the quality of whispered data. An acoustic analysis is also conducted to compare the phoneme/speaker dependency of the differences between whispered and neutral data in the feature domain. The observations from those acoustic analysis are new in this area and also serve as a guidance for developing robust speaker identification systems for whispered speech. This dissertation further proposes two systems for speaker identification of whispered speech. One system focuses on front-end processing. A two-dimensional feature space is proposed to search for "Low''-quality performance based whispered utterances and separate feature mapping functions are applied to vowels and consonants respectively in order to retain the speaker's information shared between whispered and neutral speech. The other system focuses on speech-mode-independent model training. The proposed method generates pseudo whispered features from neutral features by using the statistical information contained in a whispered Universal Background model (UBM) trained from extra collected whispered data from non-target speakers. Four modeling methods are proposed for the transformation estimation in order to generate the pseudo whispered features. Both of the above two systems demonstrate a significant improvement over the baseline system on the evaluation data. This dissertation has therefore contributed to providing a scientific understanding of the differences between whispered and neutral speech as well as improved front-end processing and modeling method for speaker identification of whispered speech. Such advancements will ultimately contribute to improve the robustness of speech processing systems.
Álvarez, Aitor; Sierra, Basilio; Arruti, Andoni; López-Gil, Juan-Miguel; Garay-Vitoria, Nestor
2015-01-01
In this paper, a new supervised classification paradigm, called classifier subset selection for stacked generalization (CSS stacking), is presented to deal with speech emotion recognition. The new approach consists of an improvement of a bi-level multi-classifier system known as stacking generalization by means of an integration of an estimation of distribution algorithm (EDA) in the first layer to select the optimal subset from the standard base classifiers. The good performance of the proposed new paradigm was demonstrated over different configurations and datasets. First, several CSS stacking classifiers were constructed on the RekEmozio dataset, using some specific standard base classifiers and a total of 123 spectral, quality and prosodic features computed using in-house feature extraction algorithms. These initial CSS stacking classifiers were compared to other multi-classifier systems and the employed standard classifiers built on the same set of speech features. Then, new CSS stacking classifiers were built on RekEmozio using a different set of both acoustic parameters (extended version of the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS)) and standard classifiers and employing the best meta-classifier of the initial experiments. The performance of these two CSS stacking classifiers was evaluated and compared. Finally, the new paradigm was tested on the well-known Berlin Emotional Speech database. We compared the performance of single, standard stacking and CSS stacking systems using the same parametrization of the second phase. All of the classifications were performed at the categorical level, including the six primary emotions plus the neutral one. PMID:26712757
Li, Kan; Príncipe, José C.
2018-01-01
This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime. PMID:29666568
Li, Kan; Príncipe, José C
2018-01-01
This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.
Evaluating Text-to-Speech Synthesizers
ERIC Educational Resources Information Center
Cardoso, Walcir; Smith, George; Fuentes, Cesar Garcia
2015-01-01
Text-To-Speech (TTS) synthesizers have piqued the interest of researchers for their potential to enhance the L2 acquisition of writing (Kirstein, 2006), vocabulary and reading (Proctor, Dalton, & Grisham, 2007) and pronunciation (Cardoso, Collins, & White, 2012; Soler-Urzua, 2011). Despite their proven effectiveness, there is a need for…
Saltuklaroglu, Tim; Kalinowski, Joseph; Robbins, Mary; Crawcour, Stephen; Bowers, Andrew
2009-01-01
Stuttering is prone to strike during speech initiation more so than at any other point in an utterance. The use of auditory feedback (AAF) has been found to produce robust decreases in the stuttering frequency by creating an electronic rendition of choral speech (i.e., speaking in unison). However, AAF requires users to self-initiate speech before it can go into effect and, therefore, it might not be as helpful as true choral speech during speech initiation. To examine how AAF and choral speech differentially enhance fluency during speech initiation and in subsequent portions of utterances. Ten participants who stuttered read passages without altered feedback (NAF), under four AAF conditions and under a true choral speech condition. Each condition was blocked into ten 10 s trials separated by 5 s intervals so each trial required 'cold' speech initiation. In the first analysis, comparisons of stuttering frequencies were made across conditions. A second, finer grain analysis involved examining stuttering frequencies on the initial syllable, the subsequent four syllables produced and the five syllables produced immediately after the midpoint of each trial. On average, AAF reduced stuttering by approximately 68% relative to the NAF condition. Stuttering frequencies on the initial syllables were considerably higher than on the other syllables analysed (0.45 and 0.34 for NAF and AAF conditions, respectively). After the first syllable was produced, stuttering frequencies dropped precipitously and remained stable. However, this drop in stuttering frequency was significantly greater (approximately 84%) in the AAF conditions than in the NAF condition (approximately 66%) with frequencies on the last nine syllables analysed averaging 0.15 and 0.05 for NAF and AAF conditions, respectively. In the true choral speech condition, stuttering was virtually (approximately 98%) eliminated across all utterances and all syllable positions. Altered auditory feedback effectively inhibits stuttering immediately after speech has been initiated. However, unlike a true choral signal, which is exogenously initiated and offers the most complete fluency enhancement, AAF requires speech to be initiated by the user and 'fed back' before it can directly inhibit stuttering. It is suggested that AAF can be a viable clinical option for those who stutter and should often be used in combination with therapeutic techniques, particularly those that aid speech initiation. The substantially higher rate of stuttering occurring on initiation supports a hypothesis that overt stuttering events help 'release' and 'inhibit' central stuttering blocks. This perspective is examined in the context of internal models and mirror neurons.
Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo
2014-01-01
Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society's Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children's speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation.
Vowel Space Characteristics of Speech Directed to Children With and Without Hearing Loss
Wieland, Elizabeth A.; Burnham, Evamarie B.; Kondaurova, Maria; Bergeson, Tonya R.
2015-01-01
Purpose This study examined vowel characteristics in adult-directed (AD) and infant-directed (ID) speech to children with hearing impairment who received cochlear implants or hearing aids compared with speech to children with normal hearing. Method Mothers' AD and ID speech to children with cochlear implants (Study 1, n = 20) or hearing aids (Study 2, n = 11) was compared with mothers' speech to controls matched on age and hearing experience. The first and second formants of vowels /i/, /ɑ/, and /u/ were measured, and vowel space area and dispersion were calculated. Results In both studies, vowel space was modified in ID compared with AD speech to children with and without hearing loss. Study 1 showed larger vowel space area and dispersion in ID compared with AD speech regardless of infant hearing status. The pattern of effects of ID and AD speech on vowel space characteristics in Study 2 was similar to that in Study 1, but depended partly on children's hearing status. Conclusion Given previously demonstrated associations between expanded vowel space in ID compared with AD speech and enhanced speech perception skills, this research supports a focus on vowel pronunciation in developing intervention strategies for improving speech-language skills in children with hearing impairment. PMID:25658071
A hardware experimental platform for neural circuits in the auditory cortex
NASA Astrophysics Data System (ADS)
Rodellar-Biarge, Victoria; García-Dominguez, Pablo; Ruiz-Rizaldos, Yago; Gómez-Vilda, Pedro
2011-05-01
Speech processing in the human brain is a very complex process far from being fully understood although much progress has been done recently. Neuromorphic Speech Processing is a new research orientation in bio-inspired systems approach to find solutions to automatic treatment of specific problems (recognition, synthesis, segmentation, diarization, etc) which can not be adequately solved using classical algorithms. In this paper a neuromorphic speech processing architecture is presented. The systematic bottom-up synthesis of layered structures reproduce the dynamic feature detection of speech related to plausible neural circuits which work as interpretation centres located in the Auditory Cortex. The elementary model is based on Hebbian neuron-like units. For the computation of the architecture a flexible framework is proposed in the environment of Matlab®/Simulink®/HDL, which allows building models in different description styles, complexity and implementation levels. It provides a flexible platform for experimenting on the influence of the number of neurons and interconnections, in the precision of the results and in performance evaluation. The experimentation with different architecture configurations may help both in better understanding how neural circuits may work in the brain as well as in how speech processing can benefit from this understanding.
Detecting Parkinson's disease from sustained phonation and speech signals.
Vaiciukynas, Evaldas; Verikas, Antanas; Gelzinis, Adas; Bacauskiene, Marija
2017-01-01
This study investigates signals from sustained phonation and text-dependent speech modalities for Parkinson's disease screening. Phonation corresponds to the vowel /a/ voicing task and speech to the pronunciation of a short sentence in Lithuanian language. Signals were recorded through two channels simultaneously, namely, acoustic cardioid (AC) and smart phone (SP) microphones. Additional modalities were obtained by splitting speech recording into voiced and unvoiced parts. Information in each modality is summarized by 18 well-known audio feature sets. Random forest (RF) is used as a machine learning algorithm, both for individual feature sets and for decision-level fusion. Detection performance is measured by the out-of-bag equal error rate (EER) and the cost of log-likelihood-ratio. Essentia audio feature set was the best using the AC speech modality and YAAFE audio feature set was the best using the SP unvoiced modality, achieving EER of 20.30% and 25.57%, respectively. Fusion of all feature sets and modalities resulted in EER of 19.27% for the AC and 23.00% for the SP channel. Non-linear projection of a RF-based proximity matrix into the 2D space enriched medical decision support by visualization.
Speech perception at the interface of neurobiology and linguistics.
Poeppel, David; Idsardi, William J; van Wassenhove, Virginie
2008-03-12
Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the speech perception enter into subsequent linguistic computation, the format that is used for lexical representation and processing fundamentally constrains the speech perceptual processes. Consequently, theories of speech perception must, at some level, be tightly linked to theories of lexical representation. Minimally, speech perception must yield representations that smoothly and rapidly interface with stored lexical items. Adopting the perspective of Marr, we argue and provide neurobiological and psychophysical evidence for the following research programme. First, at the implementational level, speech perception is a multi-time resolution process, with perceptual analyses occurring concurrently on at least two time scales (approx. 20-80 ms, approx. 150-300 ms), commensurate with (sub)segmental and syllabic analyses, respectively. Second, at the algorithmic level, we suggest that perception proceeds on the basis of internal forward models, or uses an 'analysis-by-synthesis' approach. Third, at the computational level (in the sense of Marr), the theory of lexical representation that we adopt is principally informed by phonological research and assumes that words are represented in the mental lexicon in terms of sequences of discrete segments composed of distinctive features. One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives (e.g. temporal primitives) and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.
Monkeys and Humans Share a Common Computation for Face/Voice Integration
Chandrasekaran, Chandramouli; Lemus, Luis; Trubanova, Andrea; Gondan, Matthias; Ghazanfar, Asif A.
2011-01-01
Speech production involves the movement of the mouth and other regions of the face resulting in visual motion cues. These visual cues enhance intelligibility and detection of auditory speech. As such, face-to-face speech is fundamentally a multisensory phenomenon. If speech is fundamentally multisensory, it should be reflected in the evolution of vocal communication: similar behavioral effects should be observed in other primates. Old World monkeys share with humans vocal production biomechanics and communicate face-to-face with vocalizations. It is unknown, however, if they, too, combine faces and voices to enhance their perception of vocalizations. We show that they do: monkeys combine faces and voices in noisy environments to enhance their detection of vocalizations. Their behavior parallels that of humans performing an identical task. We explored what common computational mechanism(s) could explain the pattern of results we observed across species. Standard explanations or models such as the principle of inverse effectiveness and a “race” model failed to account for their behavior patterns. Conversely, a “superposition model”, positing the linear summation of activity patterns in response to visual and auditory components of vocalizations, served as a straightforward but powerful explanatory mechanism for the observed behaviors in both species. As such, it represents a putative homologous mechanism for integrating faces and voices across primates. PMID:21998576
D'Ausilio, A.; Maffongelli, L.; Bartoli, E.; Campanella, M.; Ferrari, E.; Berry, J.; Fadiga, L.
2014-01-01
The activation of listener's motor system during speech processing was first demonstrated by the enhancement of electromyographic tongue potentials as evoked by single-pulse transcranial magnetic stimulation (TMS) over tongue motor cortex. This technique is, however, technically challenging and enables only a rather coarse measurement of this motor mirroring. Here, we applied TMS to listeners’ tongue motor area in association with ultrasound tissue Doppler imaging to describe fine-grained tongue kinematic synergies evoked by passive listening to speech. Subjects listened to syllables requiring different patterns of dorso-ventral and antero-posterior movements (/ki/, /ko/, /ti/, /to/). Results show that passive listening to speech sounds evokes a pattern of motor synergies mirroring those occurring during speech production. Moreover, mirror motor synergies were more evident in those subjects showing good performances in discriminating speech in noise demonstrating a role of the speech-related mirror system in feed-forward processing the speaker's ongoing motor plan. PMID:24778384
Mechanisms Underlying Selective Neuronal Tracking of Attended Speech at a ‘Cocktail Party’
Zion Golumbic, Elana M.; Ding, Nai; Bickel, Stephan; Lakatos, Peter; Schevon, Catherine A.; McKhann, Guy M.; Goodman, Robert R.; Emerson, Ronald; Mehta, Ashesh D.; Simon, Jonathan Z.; Poeppel, David; Schroeder, Charles E.
2013-01-01
Summary The ability to focus on and understand one talker in a noisy social environment is a critical social-cognitive capacity, whose underlying neuronal mechanisms are unclear. We investigated the manner in which speech streams are represented in brain activity and the way that selective attention governs the brain’s representation of speech using a ‘Cocktail Party’ Paradigm, coupled with direct recordings from the cortical surface in surgical epilepsy patients. We find that brain activity dynamically tracks speech streams using both low frequency phase and high frequency amplitude fluctuations, and that optimal encoding likely combines the two. In and near low level auditory cortices, attention ‘modulates’ the representation by enhancing cortical tracking of attended speech streams, but ignored speech remains represented. In higher order regions, the representation appears to become more ‘selective,’ in that there is no detectable tracking of ignored speech. This selectivity itself seems to sharpen as a sentence unfolds. PMID:23473326
NASA Technical Reports Server (NTRS)
Vidulich, Michael A.; Bortolussi, Michael R.
1988-01-01
Among the new technologies that are expected to aid helicopter designers are speech controls. Proponents suggest that speech controls could reduce the potential for manual control overloads and improve time-sharing performance in environments that have heavy demands for manual control. This was tested in a simulation of an advanced single-pilot, scout/attack helicopter. Objective performance indicated that the speech controls were effective in decreasing the interference of discrete responses during moments of heavy flight control activity. However, subjective ratings indicated that the use of speech controls required extra effort to speak precisely and to attend to feedback. Although the operational reliability of speech controls must be improved, the present results indicate that reliable speech controls could enhance the time-sharing efficiency of helicopter pilots. Furthermore, the results demonstrated the importance of using multiple assessment techniques to completely assess a task. Neither the objective nor the subjective measures alone provided complete information. It was the contrast between the measures that was most informative.
Raghavan, Ramesh; Camarata, Stephen; White, Karl; Barbaresi, William; Parish, Susan; Krahn, Gloria
2018-05-17
The aim of the study was to provide an overview of population science as applied to speech and language disorders, illustrate data sources, and advance a research agenda on the epidemiology of these conditions. Computer-aided database searches were performed to identify key national surveys and other sources of data necessary to establish the incidence, prevalence, and course and outcome of speech and language disorders. This article also summarizes a research agenda that could enhance our understanding of the epidemiology of these disorders. Although the data yielded estimates of prevalence and incidence for speech and language disorders, existing sources of data are inadequate to establish reliable rates of incidence, prevalence, and outcomes for speech and language disorders at the population level. Greater support for inclusion of speech and language disorder-relevant questions is necessary in national health surveys to build the population science in the field.
Enhancing Computer-Based Lessons for Effective Speech Education.
ERIC Educational Resources Information Center
Hemphill, Michael R.; Standerfer, Christina C.
1987-01-01
Assesses the advantages of computer-based instruction on speech education. Concludes that, while it offers tremendous flexibility to the instructor--especially in dynamic lesson design, feedback, graphics, and artificial intelligence--there is no inherent advantage to the use of computer technology in the classroom, unless the student interacts…
NASA Technical Reports Server (NTRS)
An, S. H.; Yao, K.
1986-01-01
Lattice algorithm has been employed in numerous adaptive filtering applications such as speech analysis/synthesis, noise canceling, spectral analysis, and channel equalization. In this paper the application to adaptive-array processing is discussed. The advantages are fast convergence rate as well as computational accuracy independent of the noise and interference conditions. The results produced by this technique are compared to those obtained by the direct matrix inverse method.
Influence of musical training on understanding voiced and whispered speech in noise.
Ruggles, Dorea R; Freyman, Richard L; Oxenham, Andrew J
2014-01-01
This study tested the hypothesis that the previously reported advantage of musicians over non-musicians in understanding speech in noise arises from more efficient or robust coding of periodic voiced speech, particularly in fluctuating backgrounds. Speech intelligibility was measured in listeners with extensive musical training, and in those with very little musical training or experience, using normal (voiced) or whispered (unvoiced) grammatically correct nonsense sentences in noise that was spectrally shaped to match the long-term spectrum of the speech, and was either continuous or gated with a 16-Hz square wave. Performance was also measured in clinical speech-in-noise tests and in pitch discrimination. Musicians exhibited enhanced pitch discrimination, as expected. However, no systematic or statistically significant advantage for musicians over non-musicians was found in understanding either voiced or whispered sentences in either continuous or gated noise. Musicians also showed no statistically significant advantage in the clinical speech-in-noise tests. Overall, the results provide no evidence for a significant difference between young adult musicians and non-musicians in their ability to understand speech in noise.
A probabilistic union model with automatic order selection for noisy speech recognition.
Jancovic, P; Ming, J
2001-09-01
A critical issue in exploiting the potential of the sub-band-based approach to robust speech recognition is the method of combining the sub-band observations, for selecting the bands unaffected by noise. A new method for this purpose, i.e., the probabilistic union model, was recently introduced. This model has been shown to be capable of dealing with band-limited corruption, requiring no knowledge about the band position and statistical distribution of the noise. A parameter within the model, which we call its order, gives the best results when it equals the number of noisy bands. Since this information may not be available in practice, in this paper we introduce an automatic algorithm for selecting the order, based on the state duration pattern generated by the hidden Markov model (HMM). The algorithm has been tested on the TIDIGITS database corrupted by various types of additive band-limited noise with unknown noisy bands. The results have shown that the union model equipped with the new algorithm can achieve a recognition performance similar to that achieved when the number of noisy bands is known. The results show a very significant improvement over the traditional full-band model, without requiring prior information on either the position or the number of noisy bands. The principle of the algorithm for selecting the order based on state duration may also be applied to other sub-band combination methods.
Sound stream segregation: a neuromorphic approach to solve the “cocktail party problem” in real-time
Thakur, Chetan Singh; Wang, Runchun M.; Afshar, Saeed; Hamilton, Tara J.; Tapson, Jonathan C.; Shamma, Shihab A.; van Schaik, André
2015-01-01
The human auditory system has the ability to segregate complex auditory scenes into a foreground component and a background, allowing us to listen to specific speech sounds from a mixture of sounds. Selective attention plays a crucial role in this process, colloquially known as the “cocktail party effect.” It has not been possible to build a machine that can emulate this human ability in real-time. Here, we have developed a framework for the implementation of a neuromorphic sound segregation algorithm in a Field Programmable Gate Array (FPGA). This algorithm is based on the principles of temporal coherence and uses an attention signal to separate a target sound stream from background noise. Temporal coherence implies that auditory features belonging to the same sound source are coherently modulated and evoke highly correlated neural response patterns. The basis for this form of sound segregation is that responses from pairs of channels that are strongly positively correlated belong to the same stream, while channels that are uncorrelated or anti-correlated belong to different streams. In our framework, we have used a neuromorphic cochlea as a frontend sound analyser to extract spatial information of the sound input, which then passes through band pass filters that extract the sound envelope at various modulation rates. Further stages include feature extraction and mask generation, which is finally used to reconstruct the targeted sound. Using sample tonal and speech mixtures, we show that our FPGA architecture is able to segregate sound sources in real-time. The accuracy of segregation is indicated by the high signal-to-noise ratio (SNR) of the segregated stream (90, 77, and 55 dB for simple tone, complex tone, and speech, respectively) as compared to the SNR of the mixture waveform (0 dB). This system may be easily extended for the segregation of complex speech signals, and may thus find various applications in electronic devices such as for sound segregation and speech recognition. PMID:26388721
Thakur, Chetan Singh; Wang, Runchun M; Afshar, Saeed; Hamilton, Tara J; Tapson, Jonathan C; Shamma, Shihab A; van Schaik, André
2015-01-01
The human auditory system has the ability to segregate complex auditory scenes into a foreground component and a background, allowing us to listen to specific speech sounds from a mixture of sounds. Selective attention plays a crucial role in this process, colloquially known as the "cocktail party effect." It has not been possible to build a machine that can emulate this human ability in real-time. Here, we have developed a framework for the implementation of a neuromorphic sound segregation algorithm in a Field Programmable Gate Array (FPGA). This algorithm is based on the principles of temporal coherence and uses an attention signal to separate a target sound stream from background noise. Temporal coherence implies that auditory features belonging to the same sound source are coherently modulated and evoke highly correlated neural response patterns. The basis for this form of sound segregation is that responses from pairs of channels that are strongly positively correlated belong to the same stream, while channels that are uncorrelated or anti-correlated belong to different streams. In our framework, we have used a neuromorphic cochlea as a frontend sound analyser to extract spatial information of the sound input, which then passes through band pass filters that extract the sound envelope at various modulation rates. Further stages include feature extraction and mask generation, which is finally used to reconstruct the targeted sound. Using sample tonal and speech mixtures, we show that our FPGA architecture is able to segregate sound sources in real-time. The accuracy of segregation is indicated by the high signal-to-noise ratio (SNR) of the segregated stream (90, 77, and 55 dB for simple tone, complex tone, and speech, respectively) as compared to the SNR of the mixture waveform (0 dB). This system may be easily extended for the segregation of complex speech signals, and may thus find various applications in electronic devices such as for sound segregation and speech recognition.
Wang, Kun-Ching
2015-01-14
The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI). In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII). The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD) algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC) and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.
van den Tillaart-Haverkate, Maj; de Ronde-Brons, Inge; Dreschler, Wouter A; Houben, Rolph
2017-01-01
Single-microphone noise reduction leads to subjective benefit, but not to objective improvements in speech intelligibility. We investigated whether response times (RTs) provide an objective measure of the benefit of noise reduction and whether the effect of noise reduction is reflected in rated listening effort. Twelve normal-hearing participants listened to digit triplets that were either unprocessed or processed with one of two noise-reduction algorithms: an ideal binary mask (IBM) and a more realistic minimum mean square error estimator (MMSE). For each of these three processing conditions, we measured (a) speech intelligibility, (b) RTs on two different tasks (identification of the last digit and arithmetic summation of the first and last digit), and (c) subjective listening effort ratings. All measurements were performed at four signal-to-noise ratios (SNRs): -5, 0, +5, and +∞ dB. Speech intelligibility was high (>97% correct) for all conditions. A significant decrease in response time, relative to the unprocessed condition, was found for both IBM and MMSE for the arithmetic but not the identification task. Listening effort ratings were significantly lower for IBM than for MMSE and unprocessed speech in noise. We conclude that RT for an arithmetic task can provide an objective measure of the benefit of noise reduction. For young normal-hearing listeners, both ideal and realistic noise reduction can reduce RTs at SNRs where speech intelligibility is close to 100%. Ideal noise reduction can also reduce perceived listening effort.
The Downside of Greater Lexical Influences: Selectively Poorer Speech Perception in Noise
Xie, Zilong; Tessmer, Rachel; Chandrasekaran, Bharath
2017-01-01
Purpose Although lexical information influences phoneme perception, the extent to which reliance on lexical information enhances speech processing in challenging listening environments is unclear. We examined the extent to which individual differences in lexical influences on phonemic processing impact speech processing in maskers containing varying degrees of linguistic information (2-talker babble or pink noise). Method Twenty-nine monolingual English speakers were instructed to ignore the lexical status of spoken syllables (e.g., gift vs. kift) and to only categorize the initial phonemes (/g/ vs. /k/). The same participants then performed speech recognition tasks in the presence of 2-talker babble or pink noise in audio-only and audiovisual conditions. Results Individuals who demonstrated greater lexical influences on phonemic processing experienced greater speech processing difficulties in 2-talker babble than in pink noise. These selective difficulties were present across audio-only and audiovisual conditions. Conclusion Individuals with greater reliance on lexical processes during speech perception exhibit impaired speech recognition in listening conditions in which competing talkers introduce audible linguistic interferences. Future studies should examine the locus of lexical influences/interferences on phonemic processing and speech-in-speech processing. PMID:28586824
Two Dimensional Processing Of Speech And Ecg Signals Using The Wigner-Ville Distribution
NASA Astrophysics Data System (ADS)
Boashash, Boualem; Abeysekera, Saman S.
1986-12-01
The Wigner-Ville Distribution (WVD) has been shown to be a valuable tool for the analysis of non-stationary signals such as speech and Electrocardiogram (ECG) data. The one-dimensional real data are first transformed into a complex analytic signal using the Hilbert Transform and then a 2-dimensional image is formed using the Wigner-Ville Transform. For speech signals, a contour plot is determined and used as a basic feature. for a pattern recognition algorithm. This method is compared with the classical Short Time Fourier Transform (STFT) and is shown, to be able to recognize isolated words better in a noisy environment. The same method together with the concept of instantaneous frequency of the signal is applied to the analysis of ECG signals. This technique allows one to classify diseased heart-beat signals. Examples are shown.
Task-dependent modulation of the visual sensory thalamus assists visual-speech recognition.
Díaz, Begoña; Blank, Helen; von Kriegstein, Katharina
2018-05-14
The cerebral cortex modulates early sensory processing via feed-back connections to sensory pathway nuclei. The functions of this top-down modulation for human behavior are poorly understood. Here, we show that top-down modulation of the visual sensory thalamus (the lateral geniculate body, LGN) is involved in visual-speech recognition. In two independent functional magnetic resonance imaging (fMRI) studies, LGN response increased when participants processed fast-varying features of articulatory movements required for visual-speech recognition, as compared to temporally more stable features required for face identification with the same stimulus material. The LGN response during the visual-speech task correlated positively with the visual-speech recognition scores across participants. In addition, the task-dependent modulation was present for speech movements and did not occur for control conditions involving non-speech biological movements. In face-to-face communication, visual speech recognition is used to enhance or even enable understanding what is said. Speech recognition is commonly explained in frameworks focusing on cerebral cortex areas. Our findings suggest that task-dependent modulation at subcortical sensory stages has an important role for communication: Together with similar findings in the auditory modality the findings imply that task-dependent modulation of the sensory thalami is a general mechanism to optimize speech recognition. Copyright © 2018. Published by Elsevier Inc.
Detection and tracking of drones using advanced acoustic cameras
NASA Astrophysics Data System (ADS)
Busset, Joël.; Perrodin, Florian; Wellig, Peter; Ott, Beat; Heutschi, Kurt; Rühl, Torben; Nussbaumer, Thomas
2015-10-01
Recent events of drones flying over city centers, official buildings and nuclear installations stressed the growing threat of uncontrolled drone proliferation and the lack of real countermeasure. Indeed, detecting and tracking them can be difficult with traditional techniques. A system to acoustically detect and track small moving objects, such as drones or ground robots, using acoustic cameras is presented. The described sensor, is completely passive, and composed of a 120-element microphone array and a video camera. The acoustic imaging algorithm determines in real-time the sound power level coming from all directions, using the phase of the sound signals. A tracking algorithm is then able to follow the sound sources. Additionally, a beamforming algorithm selectively extracts the sound coming from each tracked sound source. This extracted sound signal can be used to identify sound signatures and determine the type of object. The described techniques can detect and track any object that produces noise (engines, propellers, tires, etc). It is a good complementary approach to more traditional techniques such as (i) optical and infrared cameras, for which the object may only represent few pixels and may be hidden by the blooming of a bright background, and (ii) radar or other echo-localization techniques, suffering from the weakness of the echo signal coming back to the sensor. The distance of detection depends on the type (frequency range) and volume of the noise emitted by the object, and on the background noise of the environment. Detection range and resilience to background noise were tested in both, laboratory environments and outdoor conditions. It was determined that drones can be tracked up to 160 to 250 meters, depending on their type. Speech extraction was also experimentally investigated: the speech signal of a person being 80 to 100 meters away can be captured with acceptable speech intelligibility.
Unobtrusive Multimodal Biometric Authentication: The HUMABIO Project Concept
NASA Astrophysics Data System (ADS)
Damousis, Ioannis G.; Tzovaras, Dimitrios; Bekiaris, Evangelos
2008-12-01
Human Monitoring and Authentication using Biodynamic Indicators and Behavioural Analysis (HUMABIO) (2007) is an EU Specific Targeted Research Project (STREP) where new types of biometrics are combined with state of the art sensorial technologies in order to enhance security in a wide spectrum of applications. The project aims to develop a modular, robust, multimodal biometrics security authentication and monitoring system which utilizes a biodynamic physiological profile, unique for each individual, and advancements of the state-of-the art in behavioural and other biometrics, such as face, speech, gait recognition, and seat-based anthropometrics. Several shortcomings in biometric authentication will be addressed in the course of HUMABIO which will provide the basis for improving existing sensors, develop new algorithms, and design applications, towards creating new, unobtrusive biometric authentication procedures in security sensitive, controlled environments. This paper presents the concept of this project, describes its unobtrusive authentication demonstrator, and reports some preliminary results.
Zhou, Guoxu; Yang, Zuyuan; Xie, Shengli; Yang, Jun-Mei
2011-04-01
Online blind source separation (BSS) is proposed to overcome the high computational cost problem, which limits the practical applications of traditional batch BSS algorithms. However, the existing online BSS methods are mainly used to separate independent or uncorrelated sources. Recently, nonnegative matrix factorization (NMF) shows great potential to separate the correlative sources, where some constraints are often imposed to overcome the non-uniqueness of the factorization. In this paper, an incremental NMF with volume constraint is derived and utilized for solving online BSS. The volume constraint to the mixing matrix enhances the identifiability of the sources, while the incremental learning mode reduces the computational cost. The proposed method takes advantage of the natural gradient based multiplication updating rule, and it performs especially well in the recovery of dependent sources. Simulations in BSS for dual-energy X-ray images, online encrypted speech signals, and high correlative face images show the validity of the proposed method.
Boucher, Victor J
2006-01-01
Language learning requires a capacity to recall novel series of speech sounds. Research shows that prosodic marks create grouping effects enhancing serial recall. However, any restriction on memory affecting the reproduction of prosody would limit the set of patterns that could be learned and subsequently used in speech. By implication, grouping effects of prosody would also be limited to reproducible patterns. This view of the role of prosody and the contribution of memory processes in the organization of prosodic patterns is examined by evaluating the correspondence between a reported tendency to restrict stress intervals in speech and size limits on stress-grouping effects. French speech is used where stress defines the endpoints of groups. In Experiment 1, 40 speakers recalled novel series of syllables containing stress-groups of varying size. Recall was not enhanced by groupings exceeding four syllables, which corresponded to a restriction on the reproducibility of stress-groups. In Experiment 2, the subjects produced given sentences containing phrases of differing length. The results show a strong tendency to insert stress within phrases that exceed four syllables. Since prosody can arise in the recall of syntactically unstructured lists, the results offer initial support for viewing memory processes as a factor of stress-rhythm organization.
NASA Astrophysics Data System (ADS)
Oung, Qi Wei; Nisha Basah, Shafriza; Muthusamy, Hariharan; Vijean, Vikneswaran; Lee, Hoileong
2018-03-01
Parkinson’s disease (PD) is one type of progressive neurodegenerative disease known as motor system syndrome, which is due to the death of dopamine-generating cells, a region of the human midbrain. PD normally affects people over 60 years of age, which at present has influenced a huge part of worldwide population. Lately, many researches have shown interest into the connection between PD and speech disorders. Researches have revealed that speech signals may be a suitable biomarker for distinguishing between people with Parkinson’s (PWP) from healthy subjects. Therefore, early diagnosis of PD through the speech signals can be considered for this aim. In this research, the speech data are acquired based on speech behaviour as the biomarker for differentiating PD severity levels (mild and moderate) from healthy subjects. Feature extraction algorithms applied are Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Coefficients (LPC), Linear Prediction Cepstral Coefficients (LPCC), and Weighted Linear Prediction Cepstral Coefficients (WLPCC). For classification, two types of classifiers are used: k-Nearest Neighbour (KNN) and Probabilistic Neural Network (PNN). The experimental results demonstrated that PNN classifier and KNN classifier achieve the best average classification performance of 92.63% and 88.56% respectively through 10-fold cross-validation measures. Favourably, the suggested techniques have the possibilities of becoming a new choice of promising tools for the PD detection with tremendous performance.
Desjardins, Jamie L
2016-01-01
Older listeners with hearing loss may exert more cognitive resources to maintain a level of listening performance similar to that of younger listeners with normal hearing. Unfortunately, this increase in cognitive load, which is often conceptualized as increased listening effort, may come at the cost of cognitive processing resources that might otherwise be available for other tasks. The purpose of this study was to evaluate the independent and combined effects of a hearing aid directional microphone and a noise reduction (NR) algorithm on reducing the listening effort older listeners with hearing loss expend on a speech-in-noise task. Participants were fitted with study worn commercially available behind-the-ear hearing aids. Listening effort on a sentence recognition in noise task was measured using an objective auditory-visual dual-task paradigm. The primary task required participants to repeat sentences presented in quiet and in a four-talker babble. The secondary task was a digital visual pursuit rotor-tracking test, for which participants were instructed to use a computer mouse to track a moving target around an ellipse that was displayed on a computer screen. Each of the two tasks was presented separately and concurrently at a fixed overall speech recognition performance level of 50% correct with and without the directional microphone and/or the NR algorithm activated in the hearing aids. In addition, participants reported how effortful it was to listen to the sentences in quiet and in background noise in the different hearing aid listening conditions. Fifteen older listeners with mild sloping to severe sensorineural hearing loss participated in this study. Listening effort in background noise was significantly reduced with the directional microphones activated in the hearing aids. However, there was no significant change in listening effort with the hearing aid NR algorithm compared to no noise processing. Correlation analysis between objective and self-reported ratings of listening effort showed no significant relation. Directional microphone processing effectively reduced the cognitive load of listening to speech in background noise. This is significant because it is likely that listeners with hearing impairment will frequently encounter noisy speech in their everyday communications. American Academy of Audiology.
Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo
2014-01-01
Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. The tool aimed to help physicians achieve three main goals: early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society’s Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children’s speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation. PMID:24627648
Peeters, David; Snijders, Tineke M; Hagoort, Peter; Özyürek, Aslı
2017-01-27
In everyday communication speakers often refer in speech and/or gesture to objects in their immediate environment, thereby shifting their addressee's attention to an intended referent. The neurobiological infrastructure involved in the comprehension of such basic multimodal communicative acts remains unclear. In an event-related fMRI study, we presented participants with pictures of a speaker and two objects while they concurrently listened to her speech. In each picture, one of the objects was singled out, either through the speaker's index-finger pointing gesture or through a visual cue that made the object perceptually more salient in the absence of gesture. A mismatch (compared to a match) between speech and the object singled out by the speaker's pointing gesture led to enhanced activation in left IFG and bilateral pMTG, showing the importance of these areas in conceptual matching between speech and referent. Moreover, a match (compared to a mismatch) between speech and the object made salient through a visual cue led to enhanced activation in the mentalizing system, arguably reflecting an attempt to converge on a jointly attended referent in the absence of pointing. These findings shed new light on the neurobiological underpinnings of the core communicative process of comprehending a speaker's multimodal referential act and stress the power of pointing as an important natural device to link speech to objects. Copyright © 2016 Elsevier Ltd. All rights reserved.
ERIC Educational Resources Information Center
Bione, Tiago; Grimshaw, Jennica; Cardoso, Walcir
2016-01-01
As stated in Cardoso, Smith, and Garcia Fuentes (2015), second language researchers and practitioners have explored the pedagogical capabilities of Text-To-Speech synthesizers (TTS) for their potential to enhance the acquisition of writing (e.g. Kirstein, 2006), vocabulary and reading (e.g. Proctor, Dalton, & Grisham, 2007), and pronunciation…
Parmanto, Bambang; Saptono, Andi; Murthi, Raymond; Safos, Charlotte; Lathan, Corinna E
2008-11-01
A secure telemonitoring system was developed to transform CosmoBot system, a stand-alone speech-language therapy software, into a telerehabilitation system. The CosmoBot system is a motivating, computer-based play character designed to enhance children's communication skills and stimulate verbal interaction during the remediation of speech and language disorders. The CosmoBot system consists of the Mission Control human interface device and Cosmo's Play and Learn software featuring a robot character named Cosmo that targets educational goals for children aged 3-5 years. The secure telemonitoring infrastructure links a distant speech-language therapist and child/parents at home or school settings. The result is a telerehabilitation system that allows a speech-language therapist to monitor children's activities at home while providing feedback and therapy materials remotely. We have developed the means for telerehabilitation of communication skills that can be implemented in children's home settings. The architecture allows the therapist to remotely monitor the children after completion of the therapy session and to provide feedback for the following session.
NASA Astrophysics Data System (ADS)
Natarajan, Ajay; Hansen, John H. L.; Arehart, Kathryn Hoberg; Rossi-Katz, Jessica
2005-12-01
This study describes a new noise suppression scheme for hearing aid applications based on the auditory masking threshold (AMT) in conjunction with a modified generalized minimum mean square error estimator (GMMSE) for individual subjects with hearing loss. The representation of cochlear frequency resolution is achieved in terms of auditory filter equivalent rectangular bandwidths (ERBs). Estimation of AMT and spreading functions for masking are implemented in two ways: with normal auditory thresholds and normal auditory filter bandwidths (GMMSE-AMT[ERB]-NH) and with elevated thresholds and broader auditory filters characteristic of cochlear hearing loss (GMMSE-AMT[ERB]-HI). Evaluation is performed using speech corpora with objective quality measures (segmental SNR, Itakura-Saito), along with formal listener evaluations of speech quality rating and intelligibility. While no measurable changes in intelligibility occurred, evaluations showed quality improvement with both algorithm implementations. However, the customized formulation based on individual hearing losses was similar in performance to the formulation based on the normal auditory system.
Modeling the human mental lexicon with self-organizing feature maps
NASA Astrophysics Data System (ADS)
Wittenburg, Peter; Frauenfelder, Uli H.
1992-10-01
Recent efforts to model the remarkable ability of humans to recognize speech and words are described. Different techniques including the use of neural nets for representing phonological similarity between words in the lexicon with self organizing algorithms are discussed. Simulations using the standard Kohonen algorithm are presented to illustrate some problems confronted with this technique in modeling similarity relations of form in the human mental lexicon. Alternative approaches that can potentially deal with some of these limitations are sketched.
Contrast-marking prosodic emphasis in Williams syndrome: results of detailed phonetic analysis.
Ito, Kiwako; Martens, Marilee A
2017-01-01
Past reports on the speech production of individuals with Williams syndrome (WS) suggest that their prosody is anomalous and may lead to challenges in spoken communication. While existing prosodic assessments confirm that individuals with WS fail to use prosodic emphasis to express contrast, those reports typically lack detailed phonetic analysis of speech data. The present study examines the acoustic properties of speech prosody, aiming for the future development of targeted speech interventions. The study examines the three primary acoustic correlates of prosodic emphasis (duration, intensity, F0) and determines whether individuals with WS have difficulty in producing all or a particular set of the three prosodic cues. Speech produced by 12 individuals with WS and 12 chronological age (CA)-matched typically developing individuals were recorded. A sequential picture-naming task elicited production of target phrases in three contexts: (1) no contrast: gorilla with a racket → rabbit with a balloon; (2) contrast on the animal: fox with a balloon → rabbit with a balloon; and (3) contrast on the object: rabbit with a ball → rabbit with a balloon. The three acoustic correlates of prosodic prominence (duration, intensity and F0) were compared across the three referential contexts. The two groups exhibited striking similarities in their use of word duration and intensity for expressing contrast. Both groups showed the reduction and enhancement of final lengthening, and the enhancement and reduction of intensity difference for the animal contrast and for the object contrast conditions, respectively. The two groups differed in their use of F0: the CA group produced higher F0 for the animal than for the object regardless of the context, and this difference was enhanced when the animal noun was contrastive. In contrast, the WS group produced higher F0 for the object than for the animal when the object was contrastive. The present data contradict previous assessment results that report a lack of prosodic skills to mark contrast in individuals with WS. The methodological differences that may account for this variability are discussed. The present data suggest that individuals with WS produce appropriate prosodic cues to express contrast, although their use of pitch may be somewhat atypical. Additional data and future speech comprehension studies will determine whether pitch modulation can be targeted for speech intervention in individuals with WS. © 2016 Royal College of Speech and Language Therapists.
Articulatory speech synthesis and speech production modelling
NASA Astrophysics Data System (ADS)
Huang, Jun
This dissertation addresses the problem of speech synthesis and speech production modelling based on the fundamental principles of human speech production. Unlike the conventional source-filter model, which assumes the independence of the excitation and the acoustic filter, we treat the entire vocal apparatus as one system consisting of a fluid dynamic aspect and a mechanical part. We model the vocal tract by a three-dimensional moving geometry. We also model the sound propagation inside the vocal apparatus as a three-dimensional nonplane-wave propagation inside a viscous fluid described by Navier-Stokes equations. In our work, we first propose a combined minimum energy and minimum jerk criterion to estimate the dynamic vocal tract movements during speech production. Both theoretical error bound analysis and experimental results show that this method can achieve very close match at the target points and avoid the abrupt change in articulatory trajectory at the same time. Second, a mechanical vocal fold model is used to compute the excitation signal of the vocal tract. The advantage of this model is that it is closely coupled with the vocal tract system based on fundamental aerodynamics. As a result, we can obtain an excitation signal with much more detail than the conventional parametric vocal fold excitation model. Furthermore, strong evidence of source-tract interaction is observed. Finally, we propose a computational model of the fricative and stop types of sounds based on the physical principles of speech production. The advantage of this model is that it uses an exogenous process to model the additional nonsteady and nonlinear effects due to the flow mode, which are ignored by the conventional source- filter speech production model. A recursive algorithm is used to estimate the model parameters. Experimental results show that this model is able to synthesize good quality fricative and stop types of sounds. Based on our dissertation work, we carefully argue that the articulatory speech production model has the potential to flexibly synthesize natural-quality speech sounds and to provide a compact computational model for speech production that can be beneficial to a wide range of areas in speech signal processing.
Communication in a noisy environment: Perception of one's own voice and speech enhancement
NASA Astrophysics Data System (ADS)
Le Cocq, Cecile
Workers in noisy industrial environments are often confronted to communication problems. Lost of workers complain about not being able to communicate easily with their coworkers when they wear hearing protectors. In consequence, they tend to remove their protectors, which expose them to the risk of hearing loss. In fact this communication problem is a double one: first the hearing protectors modify one's own voice perception; second they interfere with understanding speech from others. This double problem is examined in this thesis. When wearing hearing protectors, the modification of one's own voice perception is partly due to the occlusion effect which is produced when an earplug is inserted in the car canal. This occlusion effect has two main consequences: first the physiological noises in low frequencies are better perceived, second the perception of one's own voice is modified. In order to have a better understanding of this phenomenon, the literature results are analyzed systematically, and a new method to quantify the occlusion effect is developed. Instead of stimulating the skull with a bone vibrator or asking the subject to speak as is usually done in the literature, it has been decided to excite the buccal cavity with an acoustic wave. The experiment has been designed in such a way that the acoustic wave which excites the buccal cavity does not excite the external car or the rest of the body directly. The measurement of the hearing threshold in open and occluded car has been used to quantify the subjective occlusion effect for an acoustic wave in the buccal cavity. These experimental results as well as those reported in the literature have lead to a better understanding of the occlusion effect and an evaluation of the role of each internal path from the acoustic source to the internal car. The speech intelligibility from others is altered by both the high sound levels of noisy industrial environments and the speech signal attenuation due to hearing protectors. A possible solution to this problem is to denoise the speech signal and transmit it under the hearing protector. Lots of denoising techniques are available and are often used for denoising speech in telecommunication. In the framework of this thesis, denoising by wavelet thresholding is considered. A first study on "classical" wavelet denoising technics is conducted in order to evaluate their performance in noisy industrial environments. The tested speech signals are altered by industrial noises according to a wide range of signal to noise ratios. The speech denoised signals are evaluated with four criteria. A large database is obtained and analyzed with a selection algorithm which has been designed for this purpose. This first study has lead to the identification of the influence from the different parameters of the wavelet denoising method on its quality and has identified the "classical" method which has given the best performances in terms of denoising quality. This first study has also generated ideas for designing a new thresholding rule suitable for speech wavelet denoising in an industrial noisy environment. In a second study, this new thresholding rule is presented and evaluated. Its performances are better than the "classical" method found in the first study when the signal to noise ratio from the speech signal is between --10 dB and 15 dB.
Wiggins, Ian M; Anderson, Carly A; Kitterick, Pádraig T; Hartley, Douglas E H
2016-09-01
Functional near-infrared spectroscopy (fNIRS) is a silent, non-invasive neuroimaging technique that is potentially well suited to auditory research. However, the reliability of auditory-evoked activation measured using fNIRS is largely unknown. The present study investigated the test-retest reliability of speech-evoked fNIRS responses in normally-hearing adults. Seventeen participants underwent fNIRS imaging in two sessions separated by three months. In a block design, participants were presented with auditory speech, visual speech (silent speechreading), and audiovisual speech conditions. Optode arrays were placed bilaterally over the temporal lobes, targeting auditory brain regions. A range of established metrics was used to quantify the reproducibility of cortical activation patterns, as well as the amplitude and time course of the haemodynamic response within predefined regions of interest. The use of a signal processing algorithm designed to reduce the influence of systemic physiological signals was found to be crucial to achieving reliable detection of significant activation at the group level. For auditory speech (with or without visual cues), reliability was good to excellent at the group level, but highly variable among individuals. Temporal-lobe activation in response to visual speech was less reliable, especially in the right hemisphere. Consistent with previous reports, fNIRS reliability was improved by averaging across a small number of channels overlying a cortical region of interest. Overall, the present results confirm that fNIRS can measure speech-evoked auditory responses in adults that are highly reliable at the group level, and indicate that signal processing to reduce physiological noise may substantially improve the reliability of fNIRS measurements. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Arnold, Denis; Tomaschek, Fabian; Sering, Konstantin; Lopez, Florence; Baayen, R Harald
2017-01-01
Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20-44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a 'wide' yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.
Martinelli, Eugenio; Mencattini, Arianna; Daprati, Elena; Di Natale, Corrado
2016-01-01
Humans can communicate their emotions by modulating facial expressions or the tone of their voice. Albeit numerous applications exist that enable machines to read facial emotions and recognize the content of verbal messages, methods for speech emotion recognition are still in their infancy. Yet, fast and reliable applications for emotion recognition are the obvious advancement of present 'intelligent personal assistants', and may have countless applications in diagnostics, rehabilitation and research. Taking inspiration from the dynamics of human group decision-making, we devised a novel speech emotion recognition system that applies, for the first time, a semi-supervised prediction model based on consensus. Three tests were carried out to compare this algorithm with traditional approaches. Labeling performances relative to a public database of spontaneous speeches are reported. The novel system appears to be fast, robust and less computationally demanding than traditional methods, allowing for easier implementation in portable voice-analyzers (as used in rehabilitation, research, industry, etc.) and for applications in the research domain (such as real-time pairing of stimuli to participants' emotional state, selective/differential data collection based on emotional content, etc.).
Audio-video feature correlation: faces and speech
NASA Astrophysics Data System (ADS)
Durand, Gwenael; Montacie, Claude; Caraty, Marie-Jose; Faudemay, Pascal
1999-08-01
This paper presents a study of the correlation of features automatically extracted from the audio stream and the video stream of audiovisual documents. In particular, we were interested in finding out whether speech analysis tools could be combined with face detection methods, and to what extend they should be combined. A generic audio signal partitioning algorithm as first used to detect Silence/Noise/Music/Speech segments in a full length movie. A generic object detection method was applied to the keyframes extracted from the movie in order to detect the presence or absence of faces. The correlation between the presence of a face in the keyframes and of the corresponding voice in the audio stream was studied. A third stream, which is the script of the movie, is warped on the speech channel in order to automatically label faces appearing in the keyframes with the name of the corresponding character. We naturally found that extracted audio and video features were related in many cases, and that significant benefits can be obtained from the joint use of audio and video analysis methods.
Drijvers, Linda; Özyürek, Asli; Jensen, Ole
2018-06-19
Previous work revealed that visual semantic information conveyed by gestures can enhance degraded speech comprehension, but the mechanisms underlying these integration processes under adverse listening conditions remain poorly understood. We used MEG to investigate how oscillatory dynamics support speech-gesture integration when integration load is manipulated by auditory (e.g., speech degradation) and visual semantic (e.g., gesture congruency) factors. Participants were presented with videos of an actress uttering an action verb in clear or degraded speech, accompanied by a matching (mixing gesture + "mixing") or mismatching (drinking gesture + "walking") gesture. In clear speech, alpha/beta power was more suppressed in the left inferior frontal gyrus and motor and visual cortices when integration load increased in response to mismatching versus matching gestures. In degraded speech, beta power was less suppressed over posterior STS and medial temporal lobe for mismatching compared with matching gestures, showing that integration load was lowest when speech was degraded and mismatching gestures could not be integrated and disambiguate the degraded signal. Our results thus provide novel insights on how low-frequency oscillatory modulations in different parts of the cortex support the semantic audiovisual integration of gestures in clear and degraded speech: When speech is clear, the left inferior frontal gyrus and motor and visual cortices engage because higher-level semantic information increases semantic integration load. When speech is degraded, posterior STS/middle temporal gyrus and medial temporal lobe are less engaged because integration load is lowest when visual semantic information does not aid lexical retrieval and speech and gestures cannot be integrated.
Robust Frequency Invariant Beamforming with Low Sidelobe for Speech Enhancement
NASA Astrophysics Data System (ADS)
Zhu, Yiting; Pan, Xiang
2018-01-01
Frequency invariant beamformers (FIBs) are widely used in speech enhancement and source localization. There are two traditional optimization methods for FIB design. The first one is convex optimization, which is simple but the frequency invariant characteristic of the beam pattern is poor with respect to frequency band of five octaves. The least squares (LS) approach using spatial response variation (SRV) constraint is another optimization method. Although, it can provide good frequency invariant property, it usually couldn’t be used in speech enhancement for its lack of weight norm constraint which is related to the robustness of a beamformer. In this paper, a robust wideband beamforming method with a constant beamwidth is proposed. The frequency invariant beam pattern is achieved by resolving an optimization problem of the SRV constraint to cover speech frequency band. With the control of sidelobe level, it is available for the frequency invariant beamformer (FIB) to prevent distortion of interference from the undesirable direction. The approach is completed in time-domain by placing tapped delay lines(TDL) and finite impulse response (FIR) filter at the output of each sensor which is more convenient than the Frost processor. By invoking the weight norm constraint, the robustness of the beamformer is further improved against random errors. Experiment results show that the proposed method has a constant beamwidth and almost the same white noise gain as traditional delay-and-sum (DAS) beamformer.
Ching, Teresa Yc; Zhang, Vicky W; Flynn, Christopher; Burns, Lauren; Button, Laura; Hou, Sanna; McGhie, Karen; Van Buynder, Patricia
2017-07-07
We investigated the factors influencing speech perception in babble for 5-year-old children with hearing loss who were using hearing aids (HAs) or cochlear implants (CIs). Speech reception thresholds (SRTs) for 50% correct identification were measured in two conditions - speech collocated with babble, and speech with spatially separated babble. The difference in SRTs between the two conditions give a measure of binaural unmasking, commonly known as spatial release from masking (SRM). Multiple linear regression analyses were conducted to examine the influence of a range of demographic factors on outcomes. Participants were 252 children enrolled in the Longitudinal Outcomes of Children with Hearing Impairment (LOCHI) study. Children using HAs or CIs required a better signal-to-noise ratio to achieve the same level of performance as their normal-hearing peers but demonstrated SRM of a similar magnitude. For children using HAs, speech perception was significantly influenced by cognitive and language abilities. For children using CIs, age at CI activation and language ability were significant predictors of speech perception outcomes. Speech perception in children with hearing loss can be enhanced by improving their language abilities. Early age at cochlear implantation was also associated with better outcomes.
Civier, Oren; Tasko, Stephen M.; Guenther, Frank H.
2010-01-01
This paper investigates the hypothesis that stuttering may result in part from impaired readout of feedforward control of speech, which forces persons who stutter (PWS) to produce speech with a motor strategy that is weighted too much toward auditory feedback control. Over-reliance on feedback control leads to production errors which, if they grow large enough, can cause the motor system to “reset” and repeat the current syllable. This hypothesis is investigated using computer simulations of a “neurally impaired” version of the DIVA model, a neural network model of speech acquisition and production. The model’s outputs are compared to published acoustic data from PWS’ fluent speech, and to combined acoustic and articulatory movement data collected from the dysfluent speech of one PWS. The simulations mimic the errors observed in the PWS subject’s speech, as well as the repairs of these errors. Additional simulations were able to account for enhancements of fluency gained by slowed/prolonged speech and masking noise. Together these results support the hypothesis that many dysfluencies in stuttering are due to a bias away from feedforward control and toward feedback control. PMID:20831971
Finke, Mareike; Büchner, Andreas; Ruigendijk, Esther; Meyer, Martin; Sandmann, Pascale
2016-07-01
There is a high degree of variability in speech intelligibility outcomes across cochlear-implant (CI) users. To better understand how auditory cognition affects speech intelligibility with the CI, we performed an electroencephalography study in which we examined the relationship between central auditory processing, cognitive abilities, and speech intelligibility. Postlingually deafened CI users (N=13) and matched normal-hearing (NH) listeners (N=13) performed an oddball task with words presented in different background conditions (quiet, stationary noise, modulated noise). Participants had to categorize words as living (targets) or non-living entities (standards). We also assessed participants' working memory (WM) capacity and verbal abilities. For the oddball task, we found lower hit rates and prolonged response times in CI users when compared with NH listeners. Noise-related prolongation of the N1 amplitude was found for all participants. Further, we observed group-specific modulation effects of event-related potentials (ERPs) as a function of background noise. While NH listeners showed stronger noise-related modulation of the N1 latency, CI users revealed enhanced modulation effects of the N2/N4 latency. In general, higher-order processing (N2/N4, P3) was prolonged in CI users in all background conditions when compared with NH listeners. Longer N2/N4 latency in CI users suggests that these individuals have difficulties to map acoustic-phonetic features to lexical representations. These difficulties seem to be increased for speech-in-noise conditions when compared with speech in quiet background. Correlation analyses showed that shorter ERP latencies were related to enhanced speech intelligibility (N1, N2/N4), better lexical fluency (N1), and lower ratings of listening effort (N2/N4) in CI users. In sum, our findings suggest that CI users and NH listeners differ with regards to both the sensory and the higher-order processing of speech in quiet as well as in noisy background conditions. Our results also revealed that verbal abilities are related to speech processing and speech intelligibility in CI users, confirming the view that auditory cognition plays an important role for CI outcome. We conclude that differences in auditory-cognitive processing contribute to the variability in speech performance outcomes observed in CI users. Copyright © 2016 Elsevier Ltd. All rights reserved.
Musical training sharpens and bonds ears and tongue to hear speech better.
Du, Yi; Zatorre, Robert J
2017-12-19
The idea that musical training improves speech perception in challenging listening environments is appealing and of clinical importance, yet the mechanisms of any such musician advantage are not well specified. Here, using functional magnetic resonance imaging (fMRI), we found that musicians outperformed nonmusicians in identifying syllables at varying signal-to-noise ratios (SNRs), which was associated with stronger activation of the left inferior frontal and right auditory regions in musicians compared with nonmusicians. Moreover, musicians showed greater specificity of phoneme representations in bilateral auditory and speech motor regions (e.g., premotor cortex) at higher SNRs and in the left speech motor regions at lower SNRs, as determined by multivoxel pattern analysis. Musical training also enhanced the intrahemispheric and interhemispheric functional connectivity between auditory and speech motor regions. Our findings suggest that improved speech in noise perception in musicians relies on stronger recruitment of, finer phonological representations in, and stronger functional connectivity between auditory and frontal speech motor cortices in both hemispheres, regions involved in bottom-up spectrotemporal analyses and top-down articulatory prediction and sensorimotor integration, respectively.
Musical training sharpens and bonds ears and tongue to hear speech better
Du, Yi; Zatorre, Robert J.
2017-01-01
The idea that musical training improves speech perception in challenging listening environments is appealing and of clinical importance, yet the mechanisms of any such musician advantage are not well specified. Here, using functional magnetic resonance imaging (fMRI), we found that musicians outperformed nonmusicians in identifying syllables at varying signal-to-noise ratios (SNRs), which was associated with stronger activation of the left inferior frontal and right auditory regions in musicians compared with nonmusicians. Moreover, musicians showed greater specificity of phoneme representations in bilateral auditory and speech motor regions (e.g., premotor cortex) at higher SNRs and in the left speech motor regions at lower SNRs, as determined by multivoxel pattern analysis. Musical training also enhanced the intrahemispheric and interhemispheric functional connectivity between auditory and speech motor regions. Our findings suggest that improved speech in noise perception in musicians relies on stronger recruitment of, finer phonological representations in, and stronger functional connectivity between auditory and frontal speech motor cortices in both hemispheres, regions involved in bottom-up spectrotemporal analyses and top-down articulatory prediction and sensorimotor integration, respectively. PMID:29203648
Wang, Kun-Ching
2015-01-01
The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI). In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII). The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD) algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC) and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech. PMID:25594590
Acoustic change detection algorithm using an FM radio
NASA Astrophysics Data System (ADS)
Goldman, Geoffrey H.; Wolfe, Owen
2012-06-01
The U.S. Army is interested in developing low-cost, low-power, non-line-of-sight sensors for monitoring human activity. One modality that is often overlooked is active acoustics using sources of opportunity such as speech or music. Active acoustics can be used to detect human activity by generating acoustic images of an area at different times, then testing for changes among the imagery. A change detection algorithm was developed to detect physical changes in a building, such as a door changing positions or a large box being moved using acoustics sources of opportunity. The algorithm is based on cross correlating the acoustic signal measured from two microphones. The performance of the algorithm was shown using data generated with a hand-held FM radio as a sound source and two microphones. The algorithm could detect a door being opened in a hallway.
Callan, Daniel E.; Jones, Jeffery A.; Callan, Akiko
2014-01-01
Behavioral and neuroimaging studies have demonstrated that brain regions involved with speech production also support speech perception, especially under degraded conditions. The premotor cortex (PMC) has been shown to be active during both observation and execution of action (“Mirror System” properties), and may facilitate speech perception by mapping unimodal and multimodal sensory features onto articulatory speech gestures. For this functional magnetic resonance imaging (fMRI) study, participants identified vowels produced by a speaker in audio-visual (saw the speaker's articulating face and heard her voice), visual only (only saw the speaker's articulating face), and audio only (only heard the speaker's voice) conditions with varying audio signal-to-noise ratios in order to determine the regions of the PMC involved with multisensory and modality specific processing of visual speech gestures. The task was designed so that identification could be made with a high level of accuracy from visual only stimuli to control for task difficulty and differences in intelligibility. The results of the functional magnetic resonance imaging (fMRI) analysis for visual only and audio-visual conditions showed overlapping activity in inferior frontal gyrus and PMC. The left ventral inferior premotor cortex (PMvi) showed properties of multimodal (audio-visual) enhancement with a degraded auditory signal. The left inferior parietal lobule and right cerebellum also showed these properties. The left ventral superior and dorsal premotor cortex (PMvs/PMd) did not show this multisensory enhancement effect, but there was greater activity for the visual only over audio-visual conditions in these areas. The results suggest that the inferior regions of the ventral premotor cortex are involved with integrating multisensory information, whereas, more superior and dorsal regions of the PMC are involved with mapping unimodal (in this case visual) sensory features of the speech signal with articulatory speech gestures. PMID:24860526
Guidi, Andrea; Salvi, Sergio; Ottaviano, Manuel; Gentili, Claudio; Bertschy, Gilles; de Rossi, Danilo; Scilingo, Enzo Pasquale; Vanello, Nicola
2015-11-06
Bipolar disorder is one of the most common mood disorders characterized by large and invalidating mood swings. Several projects focus on the development of decision support systems that monitor and advise patients, as well as clinicians. Voice monitoring and speech signal analysis can be exploited to reach this goal. In this study, an Android application was designed for analyzing running speech using a smartphone device. The application can record audio samples and estimate speech fundamental frequency, F0, and its changes. F0-related features are estimated locally on the smartphone, with some advantages with respect to remote processing approaches in terms of privacy protection and reduced upload costs. The raw features can be sent to a central server and further processed. The quality of the audio recordings, algorithm reliability and performance of the overall system were evaluated in terms of voiced segment detection and features estimation. The results demonstrate that mean F0 from each voiced segment can be reliably estimated, thus describing prosodic features across the speech sample. Instead, features related to F0 variability within each voiced segment performed poorly. A case study performed on a bipolar patient is presented.
Oral reading fluency analysis in patients with Alzheimer disease and asymptomatic control subjects.
Martínez-Sánchez, F; Meilán, J J G; García-Sevilla, J; Carro, J; Arana, J M
2013-01-01
Many studies highlight that an impaired ability to communicate is one of the key clinical features of Alzheimer disease (AD). To study temporal organisation of speech in an oral reading task in patients with AD and in matched healthy controls using a semi-automatic method, and evaluate that method's ability to discriminate between the 2 groups. A test with an oral reading task was administered to 70 subjects, comprising 35 AD patients and 35 controls. Before speech samples were recorded, participants completed a battery of neuropsychological tests. There were no differences between groups with regard to age, sex, or educational level. All of the study variables showed impairment in the AD group. According to the results, AD patients' oral reading was marked by reduced speech and articulation rates, low effectiveness of phonation time, and increases in the number and proportion of pauses. Signal processing algorithms applied to reading fluency recordings were shown to be capable of differentiating between AD patients and controls with an accuracy of 80% (specificity 74.2%, sensitivity 77.1%) based on speech rate. Analysis of oral reading fluency may be useful as a tool for the objective study and quantification of speech deficits in AD. Copyright © 2012 Sociedad Española de Neurología. Published by Elsevier Espana. All rights reserved.
Guidi, Andrea; Salvi, Sergio; Ottaviano, Manuel; Gentili, Claudio; Bertschy, Gilles; de Rossi, Danilo; Scilingo, Enzo Pasquale; Vanello, Nicola
2015-01-01
Bipolar disorder is one of the most common mood disorders characterized by large and invalidating mood swings. Several projects focus on the development of decision support systems that monitor and advise patients, as well as clinicians. Voice monitoring and speech signal analysis can be exploited to reach this goal. In this study, an Android application was designed for analyzing running speech using a smartphone device. The application can record audio samples and estimate speech fundamental frequency, F0, and its changes. F0-related features are estimated locally on the smartphone, with some advantages with respect to remote processing approaches in terms of privacy protection and reduced upload costs. The raw features can be sent to a central server and further processed. The quality of the audio recordings, algorithm reliability and performance of the overall system were evaluated in terms of voiced segment detection and features estimation. The results demonstrate that mean F0 from each voiced segment can be reliably estimated, thus describing prosodic features across the speech sample. Instead, features related to F0 variability within each voiced segment performed poorly. A case study performed on a bipolar patient is presented. PMID:26561811
[Prosody, speech input and language acquisition].
Jungheim, M; Miller, S; Kühn, D; Ptok, M
2014-04-01
In order to acquire language, children require speech input. The prosody of the speech input plays an important role. In most cultures adults modify their code when communicating with children. Compared to normal speech this code differs especially with regard to prosody. For this review a selective literature search in PubMed and Scopus was performed. Prosodic characteristics are a key feature of spoken language. By analysing prosodic features, children gain knowledge about underlying grammatical structures. Child-directed speech (CDS) is modified in a way that meaningful sequences are highlighted acoustically so that important information can be extracted from the continuous speech flow more easily. CDS is said to enhance the representation of linguistic signs. Taking into consideration what has previously been described in the literature regarding the perception of suprasegmentals, CDS seems to be able to support language acquisition due to the correspondence of prosodic and syntactic units. However, no findings have been reported, stating that the linguistically reduced CDS could hinder first language acquisition.
How musical expertise shapes speech perception: evidence from auditory classification images.
Varnet, Léo; Wang, Tianyun; Peter, Chloe; Meunier, Fanny; Hoen, Michel
2015-09-24
It is now well established that extensive musical training percolates to higher levels of cognition, such as speech processing. However, the lack of a precise technique to investigate the specific listening strategy involved in speech comprehension has made it difficult to determine how musicians' higher performance in non-speech tasks contributes to their enhanced speech comprehension. The recently developed Auditory Classification Image approach reveals the precise time-frequency regions used by participants when performing phonemic categorizations in noise. Here we used this technique on 19 non-musicians and 19 professional musicians. We found that both groups used very similar listening strategies, but the musicians relied more heavily on the two main acoustic cues, at the first formant onset and at the onsets of the second and third formants onsets. Additionally, they responded more consistently to stimuli. These observations provide a direct visualization of auditory plasticity resulting from extensive musical training and shed light on the level of functional transfer between auditory processing and speech perception.
A Networking of Community-Based Speech Therapy: Borabue District, Maha Sarakham.
Pumnum, Tawitree; Kum-ud, Weawta; Prathanee, Benjamas
2015-08-01
Most children with cleft lip and palate have articulation problems because of compensatory articulation disorders from velopharyngeal insufficiency. Theoretically, children should receive speech therapy from a speech and language pathologist (SLP) 1-2 sessions per week. For developing countries, particularly Thailand, most of them cannot reach standard speech services because of limitation of speech services and SLP Networking of a Community-Based Speech Model might be an appropriate way to solve this problem. To study the effectiveness of a networking of Khon Kaen University (KKU) Community-Based Speech Model, Non Thong Tambon Health Promotion Hospital, Borabue, Maha Sarakham, in decreasing the number of articulation errors for children with CLP. Six children with cleft lip and palate (CLP) who lived in Borabue and the surrounding district, Maha Sarakham, and had medical records in Srinagarind Hospital. They were assessed for pre- and post-articulation errors and provided speech therapy by SLP via teaching on service for speech assistant (SA). Then, children with CLP received speech correction (SC) by SA based on assignment and caregivers practiced home program for a year. Networking of Non Thong Tambon Health Promotion Hospital, Borabue, Maha Sarakham significantly reduce the number of post-articulation errors for 3 children with CLP. There were factors affecting the results in treatment of other children as follows: delayed speech and language development, hypernaslaity, and consistency of SC at local hospital and home. A networking of KKU Community-Based Speech Model, Non Thong Tambon Health Promotion Hospital, Borabue, and Maha Sarakham was a good way to enhance speech therapy in Thailand or other developing countries, where have limitation of speech services or lack of professionals.
Park, Hyojin; Ince, Robin A A; Schyns, Philippe G; Thut, Gregor; Gross, Joachim
2015-06-15
Humans show a remarkable ability to understand continuous speech even under adverse listening conditions. This ability critically relies on dynamically updated predictions of incoming sensory information, but exactly how top-down predictions improve speech processing is still unclear. Brain oscillations are a likely mechanism for these top-down predictions [1, 2]. Quasi-rhythmic components in speech are known to entrain low-frequency oscillations in auditory areas [3, 4], and this entrainment increases with intelligibility [5]. We hypothesize that top-down signals from frontal brain areas causally modulate the phase of brain oscillations in auditory cortex. We use magnetoencephalography (MEG) to monitor brain oscillations in 22 participants during continuous speech perception. We characterize prominent spectral components of speech-brain coupling in auditory cortex and use causal connectivity analysis (transfer entropy) to identify the top-down signals driving this coupling more strongly during intelligible speech than during unintelligible speech. We report three main findings. First, frontal and motor cortices significantly modulate the phase of speech-coupled low-frequency oscillations in auditory cortex, and this effect depends on intelligibility of speech. Second, top-down signals are significantly stronger for left auditory cortex than for right auditory cortex. Third, speech-auditory cortex coupling is enhanced as a function of stronger top-down signals. Together, our results suggest that low-frequency brain oscillations play a role in implementing predictive top-down control during continuous speech perception and that top-down control is largely directed at left auditory cortex. This suggests a close relationship between (left-lateralized) speech production areas and the implementation of top-down control in continuous speech perception. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.
Park, Hyojin; Ince, Robin A.A.; Schyns, Philippe G.; Thut, Gregor; Gross, Joachim
2015-01-01
Summary Humans show a remarkable ability to understand continuous speech even under adverse listening conditions. This ability critically relies on dynamically updated predictions of incoming sensory information, but exactly how top-down predictions improve speech processing is still unclear. Brain oscillations are a likely mechanism for these top-down predictions [1, 2]. Quasi-rhythmic components in speech are known to entrain low-frequency oscillations in auditory areas [3, 4], and this entrainment increases with intelligibility [5]. We hypothesize that top-down signals from frontal brain areas causally modulate the phase of brain oscillations in auditory cortex. We use magnetoencephalography (MEG) to monitor brain oscillations in 22 participants during continuous speech perception. We characterize prominent spectral components of speech-brain coupling in auditory cortex and use causal connectivity analysis (transfer entropy) to identify the top-down signals driving this coupling more strongly during intelligible speech than during unintelligible speech. We report three main findings. First, frontal and motor cortices significantly modulate the phase of speech-coupled low-frequency oscillations in auditory cortex, and this effect depends on intelligibility of speech. Second, top-down signals are significantly stronger for left auditory cortex than for right auditory cortex. Third, speech-auditory cortex coupling is enhanced as a function of stronger top-down signals. Together, our results suggest that low-frequency brain oscillations play a role in implementing predictive top-down control during continuous speech perception and that top-down control is largely directed at left auditory cortex. This suggests a close relationship between (left-lateralized) speech production areas and the implementation of top-down control in continuous speech perception. PMID:26028433
ERIC Educational Resources Information Center
Krishnan, Ananthanarayan; Gandour, Jackson T.; Smalt, Christopher J.; Bidelman, Gavin M.
2010-01-01
Experience-dependent enhancement of neural encoding of pitch in the auditory brainstem has been observed for only specific portions of native pitch contours exhibiting high rates of pitch acceleration, irrespective of speech or nonspeech contexts. This experiment allows us to determine whether this language-dependent advantage transfers to…
Interlingual Influence in Bilingual Speech: Cognate Status Effect in a Continuum of Bilingualism
ERIC Educational Resources Information Center
Amengual, Mark
2012-01-01
The present study investigates voice onset times (VOTs) to determine if cognates enhance the cross-language phonetic influences in the speech production of a range of Spanish-English bilinguals: Spanish heritage speakers, English heritage speakers, advanced L2 Spanish learners, and advanced L2 English learners. To answer this question, lexical…
Enhancing Listener Strategies Using a Payoff Matrix in Speech-on-speech Masking Experiments
2015-09-03
five of the eighteen listeners would have achieved the 1200 point goal, had it been in place for experiment 1. Seven out of the ten listeners in...listeners would revert back to reporting more masker words at negative TMRs in the ab- sence of specific instructions and incentives. The listeners could have
ERIC Educational Resources Information Center
Esch, Barbara E.; Carr, James E.; Grow, Laura L.
2009-01-01
Evidence to support stimulus-stimulus pairing (SSP) in speech acquisition is less than robust, calling into question the ability of SSP to reliably establish automatically reinforcing properties of speech and limiting the procedure's clinical utility for increasing vocalizations. We evaluated the effects of a modified SSP procedure on…
2017-03-01
the Center for Technology Enhanced Language Learning (CTELL), a research cell in the Department of Foreign Languages, United States Military Academy...models for automatic speech recognition (ASR), and to, thereby, investigate the utility of ASR in pedagogical technology . The corpus is a sample of...lexical resources, language technology 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT UU 18. NUMBER OF
User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning
ERIC Educational Resources Information Center
Ahn, Tae youn; Lee, Sangmin-Michelle
2016-01-01
With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…
ERIC Educational Resources Information Center
Mundy, Marie-Anne; Padilla Oviedo, Andres; Ramirez, Juan; Taylor, Nick; Flores, Itza
2014-01-01
One of the main goals of universities is to graduate students who are capable and competent in competing in the workforce. As presentational communication skills are critical in today's job market, Hispanic university students need to be trained to effectively develop and deliver presentational speeches. Web/technology enhanced training techniques…
An empirical investigation of sparse distributed memory using discrete speech recognition
NASA Technical Reports Server (NTRS)
Danforth, Douglas G.
1990-01-01
Presented here is a step by step analysis of how the basic Sparse Distributed Memory (SDM) model can be modified to enhance its generalization capabilities for classification tasks. Data is taken from speech generated by a single talker. Experiments are used to investigate the theory of associative memories and the question of generalization from specific instances.
Long-Term Effectiveness of the SpeechEasy Fluency-Enhancement Device
ERIC Educational Resources Information Center
Gallop, Ronald F.; Runyan, Charles M.
2012-01-01
The SpeechEasy has been found to be an effective device for reduction of stuttering frequency for many people who stutter (PWS); published studies typically have compared stuttering reduction at initial fitting of the device to results achieved up to one year later. This study examines long-term effectiveness by examining whether effects of the…
ERIC Educational Resources Information Center
Justice, Laura M.; Breit-Smith, Allison; Rogers, Margaret
2010-01-01
Purpose: This clinical forum was organized to provide a means for informing the research and clinical communities of one mechanism through which research capacity might be enhanced within the field of speech-language pathology. Specifically, forum authors describe the process of conducting secondary analyses of extant databases to answer questions…
ERIC Educational Resources Information Center
Nelson, Lori A.
2011-01-01
Speech-language pathology literature is limited in describing the clinical practicum process from the student perspective. Much of the supervision literature in this field focuses on quantitative research and/or the point of view of the supervisor. Understanding the student experience serves to enhance the quality of clinical supervision. Of…
ERIC Educational Resources Information Center
Carranza, Mario
2016-01-01
This paper addresses the process of transcribing and annotating spontaneous non-native speech with the aim of compiling a training corpus for the development of Computer Assisted Pronunciation Training (CAPT) applications, enhanced with Automatic Speech Recognition (ASR) technology. To better adapt ASR technology to CAPT tools, the recognition…
The Use of Spatialized Speech in Auditory Interfaces for Computer Users Who Are Visually Impaired
ERIC Educational Resources Information Center
Sodnik, Jaka; Jakus, Grega; Tomazic, Saso
2012-01-01
Introduction: This article reports on a study that explored the benefits and drawbacks of using spatially positioned synthesized speech in auditory interfaces for computer users who are visually impaired (that is, are blind or have low vision). The study was a practical application of such systems--an enhanced word processing application compared…
Using speech recognition to enhance the Tongue Drive System functionality in computer access.
Huo, Xueliang; Ghovanloo, Maysam
2011-01-01
Tongue Drive System (TDS) is a wireless tongue operated assistive technology (AT), which can enable people with severe physical disabilities to access computers and drive powered wheelchairs using their volitional tongue movements. TDS offers six discrete commands, simultaneously available to the users, for pointing and typing as a substitute for mouse and keyboard in computer access, respectively. To enhance the TDS performance in typing, we have added a microphone, an audio codec, and a wireless audio link to its readily available 3-axial magnetic sensor array, and combined it with a commercially available speech recognition software, the Dragon Naturally Speaking, which is regarded as one of the most efficient ways for text entry. Our preliminary evaluations indicate that the combined TDS and speech recognition technologies can provide end users with significantly higher performance than using each technology alone, particularly in completing tasks that require both pointing and text entry, such as web surfing.
Ylinen, Sari; Nora, Anni; Leminen, Alina; Hakala, Tero; Huotilainen, Minna; Shtyrov, Yury; Mäkelä, Jyrki P; Service, Elisabet
2015-06-01
Speech production, both overt and covert, down-regulates the activation of auditory cortex. This is thought to be due to forward prediction of the sensory consequences of speech, contributing to a feedback control mechanism for speech production. Critically, however, these regulatory effects should be specific to speech content to enable accurate speech monitoring. To determine the extent to which such forward prediction is content-specific, we recorded the brain's neuromagnetic responses to heard multisyllabic pseudowords during covert rehearsal in working memory, contrasted with a control task. The cortical auditory processing of target syllables was significantly suppressed during rehearsal compared with control, but only when they matched the rehearsed items. This critical specificity to speech content enables accurate speech monitoring by forward prediction, as proposed by current models of speech production. The one-to-one phonological motor-to-auditory mappings also appear to serve the maintenance of information in phonological working memory. Further findings of right-hemispheric suppression in the case of whole-item matches and left-hemispheric enhancement for last-syllable mismatches suggest that speech production is monitored by 2 auditory-motor circuits operating on different timescales: Finer grain in the left versus coarser grain in the right hemisphere. Taken together, our findings provide hemisphere-specific evidence of the interface between inner and heard speech. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Hashizume, Hiroshi; Taki, Yasuyuki; Sassa, Yuko; Thyreau, Benjamin; Asano, Michiko; Asano, Kohei; Takeuchi, Hikaru; Nouchi, Rui; Kotozaki, Yuka; Jeong, Hyeonjeong; Sugiura, Motoaki; Kawashima, Ryuta
2014-08-01
Older children are more successful at producing unfamiliar, non-native speech sounds than younger children during the initial stages of learning. To reveal the neuronal underpinning of the age-related increase in the accuracy of non-native speech production, we examined the developmental changes in activation involved in the production of novel speech sounds using functional magnetic resonance imaging. Healthy right-handed children (aged 6-18 years) were scanned while performing an overt repetition task and a perceptual task involving aurally presented non-native and native syllables. Productions of non-native speech sounds were recorded and evaluated by native speakers. The mouth regions in the bilateral primary sensorimotor areas were activated more significantly during the repetition task relative to the perceptual task. The hemodynamic response in the left inferior frontal gyrus pars opercularis (IFG pOp) specific to non-native speech sound production (defined by prior hypothesis) increased with age. Additionally, the accuracy of non-native speech sound production increased with age. These results provide the first evidence of developmental changes in the neural processes underlying the production of novel speech sounds. Our data further suggest that the recruitment of the left IFG pOp during the production of novel speech sounds was possibly enhanced due to the maturation of the neuronal circuits needed for speech motor planning. This, in turn, would lead to improvement in the ability to immediately imitate non-native speech. Copyright © 2014 Wiley Periodicals, Inc.
Prediction and constraint in audiovisual speech perception
Peelle, Jonathan E.; Sommers, Mitchell S.
2015-01-01
During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing precision of prediction. Electrophysiological studies demonstrate oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to auditory information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration, supported by distinct neuroanatomical mechanisms. PMID:25890390
The effect of filtered speech feedback on the frequency of stuttering
NASA Astrophysics Data System (ADS)
Rami, Manish Krishnakant
2000-10-01
This study investigated the effects of filtered components of speech and whispered speech on the frequency of stuttering. It is known that choral speech, shadowing, and altered auditory feedback are the only conditions which induce fluency without any additional effort than normally required to speak on the part of people who stutter. All these conditions use speech as a second signal. This experiment examined the role of components of speech signal as delineated by the source- filter theory of speech production. Three filtered speech signals, a whispered speech signal, and a choral speech signal formed the stimuli. It was postulated that if the speech signal in whole was necessary for producing fluency in people who stutter, then all other conditions except choral speech should fail to produce fluency enhancement. If the glottal source alone was adequate in restoring fluency, then only the conditions of NAF and whispered speech should fail in promoting fluency. In the event that full filter characteristics are necessary for the fluency creating effects, then all conditions except the choral speech and whispered speech should fail to produce fluency. If any part of the filter characteristics is sufficient in yielding fluency, then only the NAF and the approximate glottal source should fail to demonstrate an increase in the amount of fluency. Twelve adults who stuttered read passages under the six conditions while receiving auditory feedback consisting of one of the six experimental conditions: (a)NAF; (b)approximate glottal source; (c)glottal source and first formant; (d)glottal source and first two formants; and (e)whispered speech. Frequencies of stuttering were obtained for each condition and submitted to descriptive and inferential statistical analysis. Statistically significant differences in means were found within the choral feedback conditions. Specifically, the choral speech, the source and first formant, source and the first two formants, and the whispered speech conditions all decreased the frequency of stuttering while the approximate glottal source did not. It is suggested that articulatory events, chiefly the encoded speech output of the vocal tract origin, afford effective cues and induces fluent speech in people who stutter.
Audiomotor Perceptual Training Enhances Speech Intelligibility in Background Noise.
Whitton, Jonathon P; Hancock, Kenneth E; Shannon, Jeffrey M; Polley, Daniel B
2017-11-06
Sensory and motor skills can be improved with training, but learning is often restricted to practice stimuli. As an exception, training on closed-loop (CL) sensorimotor interfaces, such as action video games and musical instruments, can impart a broad spectrum of perceptual benefits. Here we ask whether computerized CL auditory training can enhance speech understanding in levels of background noise that approximate a crowded restaurant. Elderly hearing-impaired subjects trained for 8 weeks on a CL game that, like a musical instrument, challenged them to monitor subtle deviations between predicted and actual auditory feedback as they moved their fingertip through a virtual soundscape. We performed our study as a randomized, double-blind, placebo-controlled trial by training other subjects in an auditory working-memory (WM) task. Subjects in both groups improved at their respective auditory tasks and reported comparable expectations for improved speech processing, thereby controlling for placebo effects. Whereas speech intelligibility was unchanged after WM training, subjects in the CL training group could correctly identify 25% more words in spoken sentences or digit sequences presented in high levels of background noise. Numerically, CL audiomotor training provided more than three times the benefit of our subjects' hearing aids for speech processing in noisy listening conditions. Gains in speech intelligibility could be predicted from gameplay accuracy and baseline inhibitory control. However, benefits did not persist in the absence of continuing practice. These studies employ stringent clinical standards to demonstrate that perceptual learning on a computerized audio game can transfer to "real-world" communication challenges. Copyright © 2017 Elsevier Ltd. All rights reserved.
Revisiting the "enigma" of musicians with dyslexia: Auditory sequencing and speech abilities.
Zuk, Jennifer; Bishop-Liebler, Paula; Ozernov-Palchik, Ola; Moore, Emma; Overy, Katie; Welch, Graham; Gaab, Nadine
2017-04-01
Previous research has suggested a link between musical training and auditory processing skills. Musicians have shown enhanced perception of auditory features critical to both music and speech, suggesting that this link extends beyond basic auditory processing. It remains unclear to what extent musicians who also have dyslexia show these specialized abilities, considering often-observed persistent deficits that coincide with reading impairments. The present study evaluated auditory sequencing and speech discrimination in 52 adults comprised of musicians with dyslexia, nonmusicians with dyslexia, and typical musicians. An auditory sequencing task measuring perceptual acuity for tone sequences of increasing length was administered. Furthermore, subjects were asked to discriminate synthesized syllable continua varying in acoustic components of speech necessary for intraphonemic discrimination, which included spectral (formant frequency) and temporal (voice onset time [VOT] and amplitude envelope) features. Results indicate that musicians with dyslexia did not significantly differ from typical musicians and performed better than nonmusicians with dyslexia for auditory sequencing as well as discrimination of spectral and VOT cues within syllable continua. However, typical musicians demonstrated superior performance relative to both groups with dyslexia for discrimination of syllables varying in amplitude information. These findings suggest a distinct profile of speech processing abilities in musicians with dyslexia, with specific weaknesses in discerning amplitude cues within speech. Because these difficulties seem to remain persistent in adults with dyslexia despite musical training, this study only partly supports the potential for musical training to enhance the auditory processing skills known to be crucial for literacy in individuals with dyslexia. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Nilsson, Jan-Erik; Lundh, Lars-Gunnar; Faghihi, Shahriar; Roth-Andersson, Gun
2011-12-01
According to cognitive models, negatively biased processing of the publicly observable self is an important aspect of social phobia; if this is true, effective methods for producing corrective feedback concerning the public self should be strived for. Video feedback is proven effective, but since one's voice represents another aspect of the self, audio feedback should produce equivalent results. This is the first study to assess the enhancement of audio feedback by cognitive preparation in a single-session randomized controlled experiment. Forty socially anxious participants were asked to give a speech, then to listen to and evaluate a taped recording of their performance. Half of the sample was given cognitive preparation prior to the audio feedback and the remainder received audio feedback only. Cognitive preparation involved asking participants to (1) predict in detail what they would hear on the audiotape, (2) form an image of themselves giving the speech and (3) listen to the audio recording as though they were listening to a stranger. To assess generalization effects all participants were asked to give a second speech. Audio feedback with cognitive preparation was shown to produce less negative ratings after the first speech, and effects generalized to the evaluation of the second speech. More positive speech evaluations were associated with corresponding reductions of state anxiety. Social anxiety as indexed by the Implicit Association Test was reduced in participants given cognitive preparation. Small sample size; analogue study. Audio feedback with cognitive preparation may be utilized as a treatment intervention for social phobia. Copyright © 2011 Elsevier Ltd. All rights reserved.
Comparison of different speech tasks among adults who stutter and adults who do not stutter
Ritto, Ana Paula; Costa, Julia Biancalana; Juste, Fabiola Staróbole; de Andrade, Claudia Regina Furquim
2016-01-01
OBJECTIVES: In this study, we compared the performance of both fluent speakers and people who stutter in three different speaking situations: monologue speech, oral reading and choral reading. This study follows the assumption that the neuromotor control of speech can be influenced by external auditory stimuli in both speakers who stutter and speakers who do not stutter. METHOD: Seventeen adults who stutter and seventeen adults who do not stutter were assessed in three speaking tasks: monologue, oral reading (solo reading aloud) and choral reading (reading in unison with the evaluator). Speech fluency and rate were measured for each task. RESULTS: The participants who stuttered had a lower frequency of stuttering during choral reading than during monologue and oral reading. CONCLUSIONS: According to the dual premotor system model, choral speech enhanced fluency by providing external cues for the timing of each syllable compensating for deficient internal cues. PMID:27074176
Sleep-Driven Computations in Speech Processing
Frost, Rebecca L. A.; Monaghan, Padraic
2017-01-01
Acquiring language requires segmenting speech into individual words, and abstracting over those words to discover grammatical structure. However, these tasks can be conflicting—on the one hand requiring memorisation of precise sequences that occur in speech, and on the other requiring a flexible reconstruction of these sequences to determine the grammar. Here, we examine whether speech segmentation and generalisation of grammar can occur simultaneously—with the conflicting requirements for these tasks being over-come by sleep-related consolidation. After exposure to an artificial language comprising words containing non-adjacent dependencies, participants underwent periods of consolidation involving either sleep or wake. Participants who slept before testing demonstrated a sustained boost to word learning and a short-term improvement to grammatical generalisation of the non-adjacencies, with improvements after sleep outweighing gains seen after an equal period of wake. Thus, we propose that sleep may facilitate processing for these conflicting tasks in language acquisition, but with enhanced benefits for speech segmentation. PMID:28056104
Sleep-Driven Computations in Speech Processing.
Frost, Rebecca L A; Monaghan, Padraic
2017-01-01
Acquiring language requires segmenting speech into individual words, and abstracting over those words to discover grammatical structure. However, these tasks can be conflicting-on the one hand requiring memorisation of precise sequences that occur in speech, and on the other requiring a flexible reconstruction of these sequences to determine the grammar. Here, we examine whether speech segmentation and generalisation of grammar can occur simultaneously-with the conflicting requirements for these tasks being over-come by sleep-related consolidation. After exposure to an artificial language comprising words containing non-adjacent dependencies, participants underwent periods of consolidation involving either sleep or wake. Participants who slept before testing demonstrated a sustained boost to word learning and a short-term improvement to grammatical generalisation of the non-adjacencies, with improvements after sleep outweighing gains seen after an equal period of wake. Thus, we propose that sleep may facilitate processing for these conflicting tasks in language acquisition, but with enhanced benefits for speech segmentation.
A Binaural Grouping Model for Predicting Speech Intelligibility in Multitalker Environments
Colburn, H. Steven
2016-01-01
Spatially separating speech maskers from target speech often leads to a large intelligibility improvement. Modeling this phenomenon has long been of interest to binaural-hearing researchers for uncovering brain mechanisms and for improving signal-processing algorithms in hearing-assistive devices. Much of the previous binaural modeling work focused on the unmasking enabled by binaural cues at the periphery, and little quantitative modeling has been directed toward the grouping or source-separation benefits of binaural processing. In this article, we propose a binaural model that focuses on grouping, specifically on the selection of time-frequency units that are dominated by signals from the direction of the target. The proposed model uses Equalization-Cancellation (EC) processing with a binary decision rule to estimate a time-frequency binary mask. EC processing is carried out to cancel the target signal and the energy change between the EC input and output is used as a feature that reflects target dominance in each time-frequency unit. The processing in the proposed model requires little computational resources and is straightforward to implement. In combination with the Coherence-based Speech Intelligibility Index, the model is applied to predict the speech intelligibility data measured by Marrone et al. The predicted speech reception threshold matches the pattern of the measured data well, even though the predicted intelligibility improvements relative to the colocated condition are larger than some of the measured data, which may reflect the lack of internal noise in this initial version of the model. PMID:27698261
A Binaural Grouping Model for Predicting Speech Intelligibility in Multitalker Environments.
Mi, Jing; Colburn, H Steven
2016-10-03
Spatially separating speech maskers from target speech often leads to a large intelligibility improvement. Modeling this phenomenon has long been of interest to binaural-hearing researchers for uncovering brain mechanisms and for improving signal-processing algorithms in hearing-assistive devices. Much of the previous binaural modeling work focused on the unmasking enabled by binaural cues at the periphery, and little quantitative modeling has been directed toward the grouping or source-separation benefits of binaural processing. In this article, we propose a binaural model that focuses on grouping, specifically on the selection of time-frequency units that are dominated by signals from the direction of the target. The proposed model uses Equalization-Cancellation (EC) processing with a binary decision rule to estimate a time-frequency binary mask. EC processing is carried out to cancel the target signal and the energy change between the EC input and output is used as a feature that reflects target dominance in each time-frequency unit. The processing in the proposed model requires little computational resources and is straightforward to implement. In combination with the Coherence-based Speech Intelligibility Index, the model is applied to predict the speech intelligibility data measured by Marrone et al. The predicted speech reception threshold matches the pattern of the measured data well, even though the predicted intelligibility improvements relative to the colocated condition are larger than some of the measured data, which may reflect the lack of internal noise in this initial version of the model. © The Author(s) 2016.
Intimate insight: MDMA changes how people talk about significant others
Baggott, Matthew J.; Kirkpatrick, Matthew G.; Bedi, Gillinder; de Wit, Harriet
2015-01-01
Rationale ±3,4-methylenedioxymethamphetamine (MDMA) is widely believed to increase sociability. The drug alters speech production and fluency, and may influence speech content. Here, we investigated the effect of MDMA on speech content, which may reveal how this drug affects social interactions. Method 35 healthy volunteers with prior MDMA experience completed this two-session, within-subjects, double-blind study during which they received 1.5 mg/kg oral MDMA and placebo. Participants completed a 5-min standardized talking task during which they discussed a close personal relationship (e.g., a friend or family member) with a research assistant. The conversations were analyzed for selected content categories (e.g., words pertaining to affect, social interaction, and cognition), using both a standard dictionary method (Pennebaker’s Linguistic Inquiry and Word Count: LIWC) and a machine learning method using random forest classifiers. Results Both analytic methods revealed that MDMA altered speech content relative to placebo. Using LIWC scores, the drug increased use of social and sexual words, consistent with reports that MDMA increases willingness to disclose. Using the machine learning algorithm, we found that MDMA increased use of social words and words relating to both positive and negative emotions. Conclusions These findings are consistent with reports that MDMA acutely alters speech content, specifically increasing emotional and social content during a brief semistructured dyadic interaction. Studying effects of psychoactive drugs on speech content may offer new insights into drug effects on mental states, and on emotional and psychosocial interaction. PMID:25922420
Intimate insight: MDMA changes how people talk about significant others.
Baggott, Matthew J; Kirkpatrick, Matthew G; Bedi, Gillinder; de Wit, Harriet
2015-06-01
±3,4-methylenedioxymethamphetamine (MDMA) is widely believed to increase sociability. The drug alters speech production and fluency, and may influence speech content. Here, we investigated the effect of MDMA on speech content, which may reveal how this drug affects social interactions. Thirty-five healthy volunteers with prior MDMA experience completed this two-session, within-subjects, double-blind study during which they received 1.5 mg/kg oral MDMA and placebo. Participants completed a five-minute standardized talking task during which they discussed a close personal relationship (e.g. a friend or family member) with a research assistant. The conversations were analyzed for selected content categories (e.g. words pertaining to affect, social interaction, and cognition), using both a standard dictionary method (Pennebaker's Linguistic Inquiry and Word Count: LIWC) and a machine learning method using random forest classifiers. Both analytic methods revealed that MDMA altered speech content relative to placebo. Using LIWC scores, the drug increased use of social and sexual words, consistent with reports that MDMA increases willingness to disclose. Using the machine learning algorithm, we found that MDMA increased use of social words and words relating to both positive and negative emotions. These findings are consistent with reports that MDMA acutely alters speech content, specifically increasing emotional and social content during a brief semistructured dyadic interaction. Studying effects of psychoactive drugs on speech content may offer new insights into drug effects on mental states, and on emotional and psychosocial interaction. © The Author(s) 2015.
Mirror neurons as a model for the science and treatment of stuttering.
Snyder, Gregory J; Waddell, Dwight E; Blanchet, Paul
2016-01-06
Persistent developmental stuttering is generally considered a speech disorder and affects ∼1% of the global population. While mainstream treatments continue to rely on unreliable behavioral speech motor targets, an emerging research perspective utilizes the mirror neuron system hypothesis as a neural substrate in the science and treatment of stuttering. The purpose of this exploratory study is to test the viability of the mirror neuron system hypothesis in the fluency enhancement of those who stutter. Participants were asked to speak while they were producing self-generated manual gestures, producing and visually perceiving self-generated manual gestures, and visually perceiving manual gestures, relative to a nonmanual gesture control speaking condition. Data reveal that all experimental speaking conditions enhanced fluent speech in all research participants, and the simultaneous perception and production of manual gesturing trended toward greater efficacious fluency enhancement. Coupled with existing research, we interpret these data as suggestive of fluency enhancement through subcortical involvement within multiple levels of an action understanding mirror neuron network. In addition, incidental findings report that stuttering moments were observed to simultaneously occur both orally and manually. Consequently, these data suggest that stuttering behaviors are compensatory, distal manifestations over multiple expressive modalities to an underlying centralized genetic neural substrate of the disorder.
Fu, Szu-Wei; Li, Pei-Chun; Lai, Ying-Hui; Yang, Cheng-Chien; Hsieh, Li-Chun; Tsao, Yu
2017-11-01
Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients. Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients.
Remote listening and passive acoustic detection in a 3-D environment
NASA Astrophysics Data System (ADS)
Barnhill, Colin
Teleconferencing environments are a necessity in business, education and personal communication. They allow for the communication of information to remote locations without the need for travel and the necessary time and expense required for that travel. Visual information can be communicated using cameras and monitors. The advantage of visual communication is that an image can capture multiple objects and convey them, using a monitor, to a large group of people regardless of the receiver's location. This is not the case for audio. Currently, most experimental teleconferencing systems' audio is based on stereo recording and reproduction techniques. The problem with this solution is that it is only effective for one or two receivers. To accurately capture a sound environment consisting of multiple sources and to recreate that for a group of people is an unsolved problem. This work will focus on new methods of multiple source 3-D environment sound capture and applications using these captured environments. Using spherical microphone arrays, it is now possible to capture a true 3-D environment A spherical harmonic transform on the array's surface allows us to determine the basis functions (spherical harmonics) for all spherical wave solutions (up to a fixed order). This spherical harmonic decomposition (SHD) allows us to not only look at the time and frequency characteristics of an audio signal but also the spatial characteristics of an audio signal. In this way, a spherical harmonic transform is analogous to a Fourier transform in that a Fourier transform transforms a signal into the frequency domain and a spherical harmonic transform transforms a signal into the spatial domain. The SHD also decouples the input signals from the microphone locations. Using the SHD of a soundfield, new algorithms are available for remote listening, acoustic detection, and signal enhancement The new algorithms presented in this paper show distinct advantages over previous detection and listening algorithms especially for multiple speech sources and room environments. The algorithms use high order (spherical harmonic) beamforming and power signal characteristics for source localization and signal enhancement These methods are applied to remote listening, surveillance, and teleconferencing.
ERIC Educational Resources Information Center
Beltyukova, Svetlana A.; Stone, Gregory M.; Ellis, Lee W.
2008-01-01
Purpose: Speech intelligibility research typically relies on traditional evidence of reliability and validity. This investigation used Rasch analysis to enhance understanding of the functioning and meaning of scores obtained with 2 commonly used procedures: word identification (WI) and magnitude estimation scaling (MES). Method: Narrative samples…
ERIC Educational Resources Information Center
Marvel, Cherie L.; Desmond, John E.
2012-01-01
The ability to store and manipulate online information may be enhanced by an inner speech mechanism that draws upon motor brain regions. Neural correlates of this mechanism were examined using event-related functional magnetic resonance imaging (fMRI). Sixteen participants completed two conditions of a verbal working memory task. In both…
ERIC Educational Resources Information Center
Verdon, Sarah; Wong, Sandie; McLeod, Sharynne
2016-01-01
Collaboration with families and communities has been identified as one of six overarching principles to speech and language therapists' (SLTs') engagement in culturally competent practice (Verdon et al., 2015a). The aim of this study was to describe SLTs' collaboration with families and communities when engaging in practice to support the speech,…
Pitch and Time Processing in Speech and Tones: The Effects of Musical Training and Attention
ERIC Educational Resources Information Center
Sares, Anastasia G.; Foster, Nicholas E. V.; Allen, Kachina; Hyde, Krista L.
2018-01-01
Purpose: Musical training is often linked to enhanced auditory discrimination, but the relative roles of pitch and time in music and speech are unclear. Moreover, it is unclear whether pitch and time processing are correlated across individuals and how they may be affected by attention. This study aimed to examine pitch and time processing in…
ERIC Educational Resources Information Center
Shadiev, Rustam; Huang, Yueh-Min; Hwang, Jan-Pan
2017-01-01
In this study, the effectiveness of the application of speech-to-text recognition (STR) technology on enhancing learning and concentration in a calm state of mind, hereafter referred to as meditation (An intentional and self-regulated focusing of attention in order to relax and calm the mind), was investigated. This effectiveness was further…
Narayanan, Shrikanth
2009-01-01
We describe a method for unsupervised region segmentation of an image using its spatial frequency domain representation. The algorithm was designed to process large sequences of real-time magnetic resonance (MR) images containing the 2-D midsagittal view of a human vocal tract airway. The segmentation algorithm uses an anatomically informed object model, whose fit to the observed image data is hierarchically optimized using a gradient descent procedure. The goal of the algorithm is to automatically extract the time-varying vocal tract outline and the position of the articulators to facilitate the study of the shaping of the vocal tract during speech production. PMID:19244005
Early speech perception in Mandarin-speaking children at one-year post cochlear implantation.
Chen, Yuan; Wong, Lena L N; Zhu, Shufeng; Xi, Xin
2016-01-01
The aim in this study was to examine early speech perception outcomes in Mandarin-speaking children during the first year of cochlear implant (CI) use. A hierarchical early speech perception battery was administered to 80 children before and 3, 6, and 12 months after implantation. Demographic information was obtained to evaluate its relationship with these outcomes. Regardless of dialect exposure and whether a hearing aid was trialed before implantation, implant recipients were able to attain similar pre-lingual auditory skills after 12 months of CI use. Children speaking Mandarin developed early Mandarin speech perception faster than those with greater exposure to other Chinese dialects. In addition, children with better pre-implant hearing levels and younger age at implantation attained significantly better speech perception scores after 12 months of CI use. Better pre-implant hearing levels and higher maternal education level were also associated with a significantly steeper growth in early speech perception ability. Mandarin-speaking children with CIs are able to attain early speech perception results comparable to those of their English-speaking counterparts. In addition, consistent single language input via CI probably enhances early speech perception development at least during the first-year of CI use. Copyright © 2015 Elsevier Ltd. All rights reserved.
Fuller, Christina; Free, Rolien; Maat, Bert; Başkent, Deniz
2012-08-01
In normal-hearing listeners, musical background has been observed to change the sound representation in the auditory system and produce enhanced performance in some speech perception tests. Based on these observations, it has been hypothesized that musical background can influence sound and speech perception, and as an extension also the quality of life, by cochlear-implant users. To test this hypothesis, this study explored musical background [using the Dutch Musical Background Questionnaire (DMBQ)], and self-perceived sound and speech perception and quality of life [using the Nijmegen Cochlear Implant Questionnaire (NCIQ) and the Speech Spatial and Qualities of Hearing Scale (SSQ)] in 98 postlingually deafened adult cochlear-implant recipients. In addition to self-perceived measures, speech perception scores (percentage of phonemes recognized in words presented in quiet) were obtained from patient records. The self-perceived hearing performance was associated with the objective speech perception. Forty-one respondents (44% of 94 respondents) indicated some form of formal musical training. Fifteen respondents (18% of 83 respondents) judged themselves as having musical training, experience, and knowledge. No association was observed between musical background (quantified by DMBQ), and self-perceived hearing-related performance or quality of life (quantified by NCIQ and SSQ), or speech perception in quiet.
Lim, Hayoung A; Draper, Ellary
2011-01-01
This study compared a common form of Applied Behavior Analysis Verbal Behavior (ABA VB) approach and music incorporated with ABA VB method as part of developmental speech-language training in the speech production of children with Autism Spectrum Disorders (ASD). This study explored how the perception of musical patterns incorporated in ABA VB operants impacted the production of speech in children with ASD. Participants were 22 children with ASD, age range 3 to 5 years, who were verbal or pre verbal with presence of immediate echolalia. They were randomly assigned a set of target words for each of the 3 training conditions: (a) music incorporated ABA VB, (b) speech (ABA VB) and (c) no-training. Results showed both music and speech trainings were effective for production of the four ABA verbal operants; however, the difference between music and speech training was not statistically different. Results also indicated that music incorporated ABA VB training was most effective in echoic production, and speech training was most effective in tact production. Music can be incorporated into the ABA VB training method, and musical stimuli can be used as successfully as ABA VB speech training to enhance the functional verbal production in children with ASD.
Auditory models for speech analysis
NASA Astrophysics Data System (ADS)
Maybury, Mark T.
This paper reviews the psychophysical basis for auditory models and discusses their application to automatic speech recognition. First an overview of the human auditory system is presented, followed by a review of current knowledge gleaned from neurological and psychoacoustic experimentation. Next, a general framework describes established peripheral auditory models which are based on well-understood properties of the peripheral auditory system. This is followed by a discussion of current enhancements to that models to include nonlinearities and synchrony information as well as other higher auditory functions. Finally, the initial performance of auditory models in the task of speech recognition is examined and additional applications are mentioned.
Speech disorders of Parkinsonism: a review.
Critchley, E M
1981-01-01
Study of the speech disorders of Parkinsonism provides a paradigm of the integration of phonation, articulation and language in the production of speech. The initial defect in the untreated patient is a failure to control respiration for the purpose of speech and there follows a forward progression of articulatory symptoms involving larynx, pharynx, tongue and finally lips. There is evidence that the integration of speech production is organised asymmetrically at thalamic level. Experimental or therapeutic lesions in the region of the inferior medial portion of ventro-lateral thalamus may influence the initiation, respiratory control, rate and prosody of speech. Higher language functions may also be involved in thalamic integration: different forms of anomia are reported with pulvinar and ventrolateral thalamic lesions and transient aphasia may follow stereotaxis. The results of treatment with levodopa indicates that neurotransmitter substances enhance the clarity, volume and persistence of phonation and the latency and smoothness of articulation. The improvement of speech performance is not necessarily in phase with locomotor changes. The dose-related dyskinetic effects of levodopa, which appear to have a physiological basis in observations previously made in post-encephalitic Parkinsonism, not only influence the prosody of speech with near-mutism, hesitancy and dysfluency but may affect work-finding ability and in instances of excitement (erethism) even involve the association of long-term memory with speech. In future, neurologists will need to examine more closely the role of neurotransmitters in speech production and formulation. PMID:7031185
Schreitmüller, Stefan; Frenken, Miriam; Bentz, Lüder; Ortmann, Magdalene; Walger, Martin; Meister, Hartmut
Watching a talker's mouth is beneficial for speech reception (SR) in many communication settings, especially in noise and when hearing is impaired. Measures for audiovisual (AV) SR can be valuable in the framework of diagnosing or treating hearing disorders. This study addresses the lack of standardized methods in many languages for assessing lipreading, AV gain, and integration. A new method is validated that supplements a German speech audiometric test with visualizations of the synthetic articulation of an avatar that was used, for it is feasible to lip-sync auditory speech in a highly standardized way. Three hypotheses were formed according to the literature on AV SR that used live or filmed talkers. It was tested whether respective effects could be reproduced with synthetic articulation: (1) cochlear implant (CI) users have a higher visual-only SR than normal-hearing (NH) individuals, and younger individuals obtain higher lipreading scores than older persons. (2) Both CI and NH gain from presenting AV over unimodal (auditory or visual) sentences in noise. (3) Both CI and NH listeners efficiently integrate complementary auditory and visual speech features. In a controlled, cross-sectional study with 14 experienced CI users (mean age 47.4) and 14 NH individuals (mean age 46.3, similar broad age distribution), lipreading, AV gain, and integration of a German matrix sentence test were assessed. Visual speech stimuli were synthesized by the articulation of the Talking Head system "MASSY" (Modular Audiovisual Speech Synthesizer), which displayed standardized articulation with respect to the visibility of German phones. In line with the hypotheses and previous literature, CI users had a higher mean visual-only SR than NH individuals (CI, 38%; NH, 12%; p < 0.001). Age was correlated with lipreading such that within each group, younger individuals obtained higher visual-only scores than older persons (rCI = -0.54; p = 0.046; rNH = -0.78; p < 0.001). Both CI and NH benefitted by AV over unimodal speech as indexed by calculations of the measures visual enhancement and auditory enhancement (each p < 0.001). Both groups efficiently integrated complementary auditory and visual speech features as indexed by calculations of the measure integration enhancement (each p < 0.005). Given the good agreement between results from literature and the outcome of supplementing an existing validated auditory test with synthetic visual cues, the introduced method can be considered an interesting candidate for clinical and scientific applications to assess measures important for AV SR in a standardized manner. This could be beneficial for optimizing the diagnosis and treatment of individual listening and communication disorders, such as cochlear implantation.
Sensory-Cognitive Interaction in the Neural Encoding of Speech in Noise: A Review
Anderson, Samira; Kraus, Nina
2011-01-01
Background Speech-in-noise (SIN) perception is one of the most complex tasks faced by listeners on a daily basis. Although listening in noise presents challenges for all listeners, background noise inordinately affects speech perception in older adults and in children with learning disabilities. Hearing thresholds are an important factor in SIN perception, but they are not the only factor. For successful comprehension, the listener must perceive and attend to relevant speech features, such as the pitch, timing, and timbre of the target speaker’s voice. Here, we review recent studies linking SIN and brainstem processing of speech sounds. Purpose To review recent work that has examined the ability of the auditory brainstem response to complex sounds (cABR), which reflects the nervous system’s transcription of pitch, timing, and timbre, to be used as an objective neural index for hearing-in-noise abilities. Study Sample We examined speech-evoked brainstem responses in a variety of populations, including children who are typically developing, children with language-based learning impairment, young adults, older adults, and auditory experts (i.e., musicians). Data Collection and Analysis In a number of studies, we recorded brainstem responses in quiet and babble noise conditions to the speech syllable /da/ in all age groups, as well as in a variable condition in children in which /da/ was presented in the context of seven other speech sounds. We also measured speech-in-noise perception using the Hearing-in-Noise Test (HINT) and the Quick Speech-in-Noise Test (QuickSIN). Results Children and adults with poor SIN perception have deficits in the subcortical spectrotemporal representation of speech, including low-frequency spectral magnitudes and the timing of transient response peaks. Furthermore, auditory expertise, as engendered by musical training, provides both behavioral and neural advantages for processing speech in noise. Conclusions These results have implications for future assessment and management strategies for young and old populations whose primary complaint is difficulty hearing in background noise. The cABR provides a clinically applicable metric for objective assessment of individuals with SIN deficits, for determination of the biologic nature of disorders affecting SIN perception, for evaluation of appropriate hearing aid algorithms, and for monitoring the efficacy of auditory remediation and training. PMID:21241645
Suppression of competing speech through entrainment of cortical oscillations
D'Zmura, Michael; Srinivasan, Ramesh
2013-01-01
People are highly skilled at attending to one speaker in the presence of competitors, but the neural mechanisms supporting this remain unclear. Recent studies have argued that the auditory system enhances the gain of a speech stream relative to competitors by entraining (or “phase-locking”) to the rhythmic structure in its acoustic envelope, thus ensuring that syllables arrive during periods of high neuronal excitability. We hypothesized that such a mechanism could also suppress a competing speech stream by ensuring that syllables arrive during periods of low neuronal excitability. To test this, we analyzed high-density EEG recorded from human adults while they attended to one of two competing, naturalistic speech streams. By calculating the cross-correlation between the EEG channels and the speech envelopes, we found evidence of entrainment to the attended speech's acoustic envelope as well as weaker yet significant entrainment to the unattended speech's envelope. An independent component analysis (ICA) decomposition of the data revealed sources in the posterior temporal cortices that displayed robust correlations to both the attended and unattended envelopes. Critically, in these components the signs of the correlations when attended were opposite those when unattended, consistent with the hypothesized entrainment-based suppressive mechanism. PMID:23515789
Lip Movement Exaggerations During Infant-Directed Speech
Green, Jordan R.; Nip, Ignatius S. B.; Wilson, Erin M.; Mefferd, Antje S.; Yunusova, Yana
2011-01-01
Purpose Although a growing body of literature has indentified the positive effects of visual speech on speech and language learning, oral movements of infant-directed speech (IDS) have rarely been studied. This investigation used 3-dimensional motion capture technology to describe how mothers modify their lip movements when talking to their infants. Method Lip movements were recorded from 25 mothers as they spoke to their infants and other adults. Lip shapes were analyzed for differences across speaking conditions. The maximum fundamental frequency, duration, acoustic intensity, and first and second formant frequency of each vowel also were measured. Results Lip movements were significantly larger during IDS than during adult-directed speech, although the exaggerations were vowel specific. All of the vowels produced during IDS were characterized by an elevated vocal pitch and a slowed speaking rate when compared with vowels produced during adult-directed speech. Conclusion The pattern of lip-shape exaggerations did not provide support for the hypothesis that mothers produce exemplar visual models of vowels during IDS. Future work is required to determine whether the observed increases in vertical lip aperture engender visual and acoustic enhancements that facilitate the early learning of speech. PMID:20699342
Acoustical conditions for speech communication in active elementary school classrooms
NASA Astrophysics Data System (ADS)
Sato, Hiroshi; Bradley, John
2005-04-01
Detailed acoustical measurements were made in 34 active elementary school classrooms with typical rectangular room shape in schools near Ottawa, Canada. There was an average of 21 students in classrooms. The measurements were made to obtain accurate indications of the acoustical quality of conditions for speech communication during actual teaching activities. Mean speech and noise levels were determined from the distribution of recorded sound levels and the average speech-to-noise ratio was 11 dBA. Measured mid-frequency reverberation times (RT) during the same occupied conditions varied from 0.3 to 0.6 s, and were a little less than for the unoccupied rooms. RT values were not related to noise levels. Octave band speech and noise levels, useful-to-detrimental ratios, and Speech Transmission Index values were also determined. Key results included: (1) The average vocal effort of teachers corresponded to louder than Pearsons Raised voice level; (2) teachers increase their voice level to overcome ambient noise; (3) effective speech levels can be enhanced by up to 5 dB by early reflection energy; and (4) student activity is seen to be the dominant noise source, increasing average noise levels by up to 10 dBA during teaching activities. [Work supported by CLLRnet.
Li, Juanhua; Wu, Chao; Zheng, Yingjun; Li, Ruikeng; Li, Xuanzi; She, Shenglin; Wu, Haibo; Peng, Hongjun; Ning, Yuping; Li, Liang
2017-09-17
The superior temporal gyrus (STG) is involved in speech recognition against informational masking under cocktail-party-listening conditions. Compared to healthy listeners, people with schizophrenia perform worse in speech recognition under informational speech-on-speech masking conditions. It is not clear whether the schizophrenia-related vulnerability to informational masking is associated with certain changes in FC of the STG with some critical brain regions. Using sparse-sampling fMRI design, this study investigated the differences between people with schizophrenia and healthy controls in FC of the STG for target-speech listening against informational speech-on-speech masking, when a listening condition with either perceived spatial separation (PSS, with a spatial release of informational masking) or perceived spatial co-location (PSC, without the spatial release) between target speech and masking speech was introduced. The results showed that in healthy participants, but not participants with schizophrenia, the contrast of either the PSS or PSC condition against the masker-only condition induced an enhancement of functional connectivity (FC) of the STG with the left superior parietal lobule and the right precuneus. Compared to healthy participants, participants with schizophrenia showed declined FC of the STG with the bilateral precuneus, right SPL, and right supplementary motor area. Thus, FC of the STG with the parietal areas is normally involved in speech listening against informational masking under either the PSS or PSC conditions, and declined FC of the STG in people with schizophrenia with the parietal areas may be associated with the increased vulnerability to informational masking. Copyright © 2017 IBRO. Published by Elsevier Ltd. All rights reserved.
Rowbotham, Samantha; Wardy, April J; Lloyd, Donna M; Wearden, Alison; Holler, Judith
2014-01-01
Effective pain communication is essential if adequate treatment and support are to be provided. Pain communication is often multimodal, with sufferers utilising speech, nonverbal behaviours (such as facial expressions), and co-speech gestures (bodily movements, primarily of the hands and arms that accompany speech and can convey semantic information) to communicate their experience. Research suggests that the production of nonverbal pain behaviours is positively associated with pain intensity, but it is not known whether this is also the case for speech and co-speech gestures. The present study explored whether increased pain intensity is associated with greater speech and gesture production during face-to-face communication about acute, experimental pain. Participants (N = 26) were exposed to experimentally elicited pressure pain to the fingernail bed at high and low intensities and took part in video-recorded semi-structured interviews. Despite rating more intense pain as more difficult to communicate (t(25) = 2.21, p = .037), participants produced significantly longer verbal pain descriptions and more co-speech gestures in the high intensity pain condition (Words: t(25) = 3.57, p = .001; Gestures: t(25) = 3.66, p = .001). This suggests that spoken and gestural communication about pain is enhanced when pain is more intense. Thus, in addition to conveying detailed semantic information about pain, speech and co-speech gestures may provide a cue to pain intensity, with implications for the treatment and support received by pain sufferers. Future work should consider whether these findings are applicable within the context of clinical interactions about pain.
Meeuws, Matthias; Pascoal, David; Bermejo, Iñigo; Artaso, Miguel; De Ceulaer, Geert; Govaerts, Paul J
2017-07-01
The software application FOX ('Fitting to Outcome eXpert') is an intelligent agent to assist in the programing of cochlear implant (CI) processors. The current version utilizes a mixture of deterministic and probabilistic logic which is able to improve over time through a learning effect. This study aimed at assessing whether this learning capacity yields measurable improvements in speech understanding. A retrospective study was performed on 25 consecutive CI recipients with a median CI use experience of 10 years who came for their annual CI follow-up fitting session. All subjects were assessed by means of speech audiometry with open set monosyllables at 40, 55, 70, and 85 dB SPL in quiet with their home MAP. Other psychoacoustic tests were executed depending on the audiologist's clinical judgment. The home MAP and the corresponding test results were entered into FOX. If FOX suggested to make MAP changes, they were implemented and another speech audiometry was performed with the new MAP. FOX suggested MAP changes in 21 subjects (84%). The within-subject comparison showed a significant median improvement of 10, 3, 1, and 7% at 40, 55, 70, and 85 dB SPL, respectively. All but two subjects showed an instantaneous improvement in their mean speech audiometric score. Persons with long-term CI use, who received a FOX-assisted CI fitting at least 6 months ago, display improved speech understanding after MAP modifications, as recommended by the current version of FOX. This can be explained only by intrinsic improvements in FOX's algorithms, as they have resulted from learning. This learning is an inherent feature of artificial intelligence and it may yield measurable benefit in speech understanding even in long-term CI recipients.
Tóth, László; Hoffmann, Ildikó; Gosztolya, Gábor; Vincze, Veronika; Szatlóczki, Gréta; Bánréti, Zoltán; Pákáski, Magdolna; Kálmán, János
2018-01-01
Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer’s disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive de-cline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech sig-nals, first manually (using the Praat software), and then automatically, with an automatic speech recogni-tion (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. Conclusion: The temporal analysis of spontaneous speech can be exploited in implementing a new, auto-matic detection-based tool for screening MCI for the community. PMID:29165085
Toth, Laszlo; Hoffmann, Ildiko; Gosztolya, Gabor; Vincze, Veronika; Szatloczki, Greta; Banreti, Zoltan; Pakaski, Magdolna; Kalman, Janos
2018-01-01
Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. The temporal analysis of spontaneous speech can be exploited in implementing a new, automatic detection-based tool for screening MCI for the community. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Prediction and constraint in audiovisual speech perception.
Peelle, Jonathan E; Sommers, Mitchell S
2015-07-01
During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing the precision of prediction. Electrophysiological studies demonstrate that oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to acoustic information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration, supported by distinct neuroanatomical mechanisms. Copyright © 2015 Elsevier Ltd. All rights reserved.
Why do children pay more attention to grammatical morphemes at the ends of sentences?
Sundara, Megha
2018-05-01
Children pay more attention to the beginnings and ends of sentences rather than the middle. In natural speech, ends of sentences are prosodically and segmentally enhanced; they are also privileged by sensory and recall advantages. We contrasted whether acoustic enhancement or sensory and recall-related advantages are necessary and sufficient for the salience of grammatical morphemes at the ends of sentences. We measured 22-month-olds' listening times to grammatical and ungrammatical sentences with third person singular -s. Crucially, by cross-splicing the speech stimuli, acoustic enhancement and sensory and recall advantages were fully crossed. Only children presented with the verb in sentence-final position, a position with sensory and recall advantages, distinguished between the grammatical and ungrammatical sentences. Thus, sensory and recall advantages alone were necessary and sufficient to make grammatical morphemes at ends of sentences salient. These general processing constraints privilege ends of sentences over middles, regardless of the acoustic enhancement.
Novel medical image enhancement algorithms
NASA Astrophysics Data System (ADS)
Agaian, Sos; McClendon, Stephen A.
2010-01-01
In this paper, we present two novel medical image enhancement algorithms. The first, a global image enhancement algorithm, utilizes an alpha-trimmed mean filter as its backbone to sharpen images. The second algorithm uses a cascaded unsharp masking technique to separate the high frequency components of an image in order for them to be enhanced using a modified adaptive contrast enhancement algorithm. Experimental results from enhancing electron microscopy, radiological, CT scan and MRI scan images, using the MATLAB environment, are then compared to the original images as well as other enhancement methods, such as histogram equalization and two forms of adaptive contrast enhancement. An image processing scheme for electron microscopy images of Purkinje cells will also be implemented and utilized as a comparison tool to evaluate the performance of our algorithm.
A Robust Approach For Acoustic Noise Suppression In Speech Using ANFIS
NASA Astrophysics Data System (ADS)
Martinek, Radek; Kelnar, Michal; Vanus, Jan; Bilik, Petr; Zidek, Jan
2015-11-01
The authors of this article deals with the implementation of a combination of techniques of the fuzzy system and artificial intelligence in the application area of non-linear noise and interference suppression. This structure used is called an Adaptive Neuro Fuzzy Inference System (ANFIS). This system finds practical use mainly in audio telephone (mobile) communication in a noisy environment (transport, production halls, sports matches, etc). Experimental methods based on the two-input adaptive noise cancellation concept was clearly outlined. Within the experiments carried out, the authors created, based on the ANFIS structure, a comprehensive system for adaptive suppression of unwanted background interference that occurs in audio communication and degrades the audio signal. The system designed has been tested on real voice signals. This article presents the investigation and comparison amongst three distinct approaches to noise cancellation in speech; they are LMS (least mean squares) and RLS (recursive least squares) adaptive filtering and ANFIS. A careful review of literatures indicated the importance of non-linear adaptive algorithms over linear ones in noise cancellation. It was concluded that the ANFIS approach had the overall best performance as it efficiently cancelled noise even in highly noise-degraded speech. Results were drawn from the successful experimentation, subjective-based tests were used to analyse their comparative performance while objective tests were used to validate them. Implementation of algorithms was experimentally carried out in Matlab to justify the claims and determine their relative performances.
Multiple benefits of personal FM system use by children with auditory processing disorder (APD).
Johnston, Kristin N; John, Andrew B; Kreisman, Nicole V; Hall, James W; Crandell, Carl C
2009-01-01
Children with auditory processing disorders (APD) were fitted with Phonak EduLink FM devices for home and classroom use. Baseline measures of the children with APD, prior to FM use, documented significantly lower speech-perception scores, evidence of decreased academic performance, and psychosocial problems in comparison to an age- and gender-matched control group. Repeated measures during the school year demonstrated speech-perception improvement in noisy classroom environments as well as significant academic and psychosocial benefits. Compared with the control group, the children with APD showed greater speech-perception advantage with FM technology. Notably, after prolonged FM use, even unaided (no FM device) speech-perception performance was improved in the children with APD, suggesting the possibility of fundamentally enhanced auditory system function.
Hollien, Harry; Huntley Bahr, Ruth; Harnsberger, James D
2014-03-01
The following article provides a general review of an area that can be referred to as Forensic Voice. Its goals will be outlined and that discussion will be followed by a description of its major elements. Considered are (1) the processing and analysis of spoken utterances, (2) distorted speech, (3) enhancement of speech intelligibility (re: surveillance and other recordings), (4) transcripts, (5) authentication of recordings, (6) speaker identification, and (7) the detection of deception, intoxication, and emotions in speech. Stress in speech and the psychological stress evaluation systems (that some individuals attempt to use as lie detectors) also will be considered. Points of entry will be suggested for individuals with the kinds of backgrounds possessed by professionals already working in the voice area. Copyright © 2014 The Voice Foundation. Published by Mosby, Inc. All rights reserved.
Bai, Mingsian R; Hsieh, Ping-Ju; Hur, Kur-Nan
2009-02-01
The performance of the minimum mean-square error noise reduction (MMSE-NR) algorithm in conjunction with time-recursive averaging (TRA) for noise estimation is found to be very sensitive to the choice of two recursion parameters. To address this problem in a more systematic manner, this paper proposes an optimization method to efficiently search the optimal parameters of the MMSE-TRA-NR algorithms. The objective function is based on a regression model, whereas the optimization process is carried out with the simulated annealing algorithm that is well suited for problems with many local optima. Another NR algorithm proposed in the paper employs linear prediction coding as a preprocessor for extracting the correlated portion of human speech. Objective and subjective tests were undertaken to compare the optimized MMSE-TRA-NR algorithm with several conventional NR algorithms. The results of subjective tests were processed by using analysis of variance to justify the statistic significance. A post hoc test, Tukey's Honestly Significant Difference, was conducted to further assess the pairwise difference between the NR algorithms.
Audiovisual integration in children listening to spectrally degraded speech.
Maidment, David W; Kang, Hi Jee; Stewart, Hannah J; Amitay, Sygal
2015-02-01
The study explored whether visual information improves speech identification in typically developing children with normal hearing when the auditory signal is spectrally degraded. Children (n=69) and adults (n=15) were presented with noise-vocoded sentences from the Children's Co-ordinate Response Measure (Rosen, 2011) in auditory-only or audiovisual conditions. The number of bands was adaptively varied to modulate the degradation of the auditory signal, with the number of bands required for approximately 79% correct identification calculated as the threshold. The youngest children (4- to 5-year-olds) did not benefit from accompanying visual information, in comparison to 6- to 11-year-old children and adults. Audiovisual gain also increased with age in the child sample. The current data suggest that children younger than 6 years of age do not fully utilize visual speech cues to enhance speech perception when the auditory signal is degraded. This evidence not only has implications for understanding the development of speech perception skills in children with normal hearing but may also inform the development of new treatment and intervention strategies that aim to remediate speech perception difficulties in pediatric cochlear implant users.
Bidelman, Gavin M; Dexter, Lauren
2015-04-01
We examined a consistent deficit observed in bilinguals: poorer speech-in-noise (SIN) comprehension for their nonnative language. We recorded neuroelectric mismatch potentials in mono- and bi-lingual listeners in response to contrastive speech sounds in noise. Behaviorally, late bilinguals required ∼10dB more favorable signal-to-noise ratios to match monolinguals' SIN abilities. Source analysis of cortical activity demonstrated monotonic increase in response latency with noise in superior temporal gyrus (STG) for both groups, suggesting parallel degradation of speech representations in auditory cortex. Contrastively, we found differential speech encoding between groups within inferior frontal gyrus (IFG)-adjacent to Broca's area-where noise delays observed in nonnative listeners were offset in monolinguals. Notably, brain-behavior correspondences double dissociated between language groups: STG activation predicted bilinguals' SIN, whereas IFG activation predicted monolinguals' performance. We infer higher-order brain areas act compensatorily to enhance impoverished sensory representations but only when degraded speech recruits linguistic brain mechanisms downstream from initial auditory-sensory inputs. Copyright © 2015 Elsevier Inc. All rights reserved.
Scalable Parallel Algorithms for Multidimensional Digital Signal Processing
1991-12-31
Proceedings, San Diego CL., August 1989, pp. 132-146. 53 [13] A. L. Gorin, L. Auslander, and A. Silberger . Balanced computation of 2D trans- forms on a tree...Speech, Signal Processing. ASSP-34, Oct. 1986,pp. 1301-1309. [24] A. Norton and A. Silberger . Parallelization and performance analysis of the Cooley-Tukey
Dysphonia Detected by Pattern Recognition of Spectral Composition.
ERIC Educational Resources Information Center
Leinonen, Lea; And Others
1992-01-01
This study analyzed production of a long vowel sound within Finnish words by normal or dysphonic voices, using the Self-Organizing Map, the artificial neural network algorithm of T. Kohonen which produces two-dimensional representations of speech. The method was found to be both sensitive and specific in the detection of dysphonia. (Author/JDD)
Martinelli, Eugenio; Mencattini, Arianna; Di Natale, Corrado
2016-01-01
Humans can communicate their emotions by modulating facial expressions or the tone of their voice. Albeit numerous applications exist that enable machines to read facial emotions and recognize the content of verbal messages, methods for speech emotion recognition are still in their infancy. Yet, fast and reliable applications for emotion recognition are the obvious advancement of present ‘intelligent personal assistants’, and may have countless applications in diagnostics, rehabilitation and research. Taking inspiration from the dynamics of human group decision-making, we devised a novel speech emotion recognition system that applies, for the first time, a semi-supervised prediction model based on consensus. Three tests were carried out to compare this algorithm with traditional approaches. Labeling performances relative to a public database of spontaneous speeches are reported. The novel system appears to be fast, robust and less computationally demanding than traditional methods, allowing for easier implementation in portable voice-analyzers (as used in rehabilitation, research, industry, etc.) and for applications in the research domain (such as real-time pairing of stimuli to participants’ emotional state, selective/differential data collection based on emotional content, etc.). PMID:27563724
Exploring super-Gaussianity toward robust information-theoretical time delay estimation.
Petsatodis, Theodoros; Talantzis, Fotios; Boukis, Christos; Tan, Zheng-Hua; Prasad, Ramjee
2013-03-01
Time delay estimation (TDE) is a fundamental component of speaker localization and tracking algorithms. Most of the existing systems are based on the generalized cross-correlation method assuming gaussianity of the source. It has been shown that the distribution of speech, captured with far-field microphones, is highly varying, depending on the noise and reverberation conditions. Thus the performance of TDE is expected to fluctuate depending on the underlying assumption for the speech distribution, being also subject to multi-path reflections and competitive background noise. This paper investigates the effect upon TDE when modeling the source signal with different speech-based distributions. An information theoretical TDE method indirectly encapsulating higher order statistics (HOS) formed the basis of this work. The underlying assumption of Gaussian distributed source has been replaced by that of generalized Gaussian distribution that allows evaluating the problem under a larger set of speech-shaped distributions, ranging from Gaussian to Laplacian and Gamma. Closed forms of the univariate and multivariate entropy expressions of the generalized Gaussian distribution are derived to evaluate the TDE. The results indicate that TDE based on the specific criterion is independent of the underlying assumption for the distribution of the source, for the same covariance matrix.
Lancia, Leonardo; Fuchs, Susanne; Tiede, Mark
2014-06-01
The aim of this article was to introduce an important tool, cross-recurrence analysis, to speech production applications by showing how it can be adapted to evaluate the similarity of multivariate patterns of articulatory motion. The method differs from classical applications of cross-recurrence analysis because no phase space reconstruction is conducted, and a cleaning algorithm removes the artifacts from the recurrence plot. The main features of the proposed approach are robustness to nonstationarity and efficient separation of amplitude variability from temporal variability. The authors tested these claims by applying their method to synthetic stimuli whose variability had been carefully controlled. The proposed method was also demonstrated in a practical application: It was used to investigate the role of biomechanical constraints in articulatory reorganization as a consequence of speeded repetition of CVCV utterances containing a labial and a coronal consonant. Overall, the proposed approach provided more reliable results than other methods, particularly in the presence of high variability. The proposed method is a useful and appropriate tool for quantifying similarity and dissimilarity in patterns of speech articulator movement, especially in such research areas as speech errors and pathologies, where unpredictable divergent behavior is expected.
NASA Astrophysics Data System (ADS)
Wu, Bo; Yang, Minglei; Li, Kehuang; Huang, Zhen; Siniscalchi, Sabato Marco; Wang, Tong; Lee, Chin-Hui
2017-12-01
A reverberation-time-aware deep-neural-network (DNN)-based multi-channel speech dereverberation framework is proposed to handle a wide range of reverberation times (RT60s). There are three key steps in designing a robust system. First, to accomplish simultaneous speech dereverberation and beamforming, we propose a framework, namely DNNSpatial, by selectively concatenating log-power spectral (LPS) input features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on a single deep neural network (DNN). Next, the temporal auto-correlation function of received signals at different RT60s is investigated to show that RT60-dependent temporal-spatial contexts in feature selection are needed in the DNNSpatial training stage in order to optimize the system performance in diverse reverberant environments. Finally, the RT60 is estimated to select the proper temporal and spatial contexts before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. The experimental evidence gathered in this study indicates that the proposed framework outperforms the state-of-the-art signal processing dereverberation algorithm weighted prediction error (WPE) and conventional DNNSpatial systems without taking the reverberation time into account, even for extremely weak and severe reverberant conditions. The proposed technique generalizes well to unseen room size, array geometry and loudspeaker position, and is robust to reverberation time estimation error.
Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language
Narayanan, Shrikanth; Georgiou, Panayiotis G.
2013-01-01
The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion. PMID:24039277
A Construction System for CALL Materials from TV News with Captions
NASA Astrophysics Data System (ADS)
Kobayashi, Satoshi; Tanaka, Takashi; Mori, Kazumasa; Nakagawa, Seiichi
Many language learning materials have been published. In language learning, although repetition training is obviously necessary, it is difficult to maintain the learner's interest/motivation using existing learning materials, because those materials are limited in their scope and contents. In addition, we doubt whether the speech sounds used in most materials are natural in various situations. Nowadays, some TV news programs (CNN, ABC, PBS, NHK, etc.) have closed/open captions corresponding to the announcer's speech. We have developed a system that makes Computer Assisted Language Learning (CALL) materials for both English learning by Japanese and Japanese learning by foreign students from such captioned newscasts. This system computes the synchronization between captions and speech by using HMMs and a forced alignment algorithm. Materials made by the system have following functions: full/partial text caption display, repetition listening, consulting an electronic dictionary, display of the user's/announcer's sound waveform and pitch contour, and automatic construction of a dictation test. Materials have following advantages: materials present polite and natural speech, various and timely topics. Furthermore, the materials have the following possibility: automatic creation of listening/understanding tests, and storage/retrieval of the many materials. In this paper, firstly, we present the organization of the system. Then, we describe results of questionnaires on trial use of the materials. As the result, we got enough accuracy on the synchronization between captions and speech. Speaking totally, we encouraged to research this system.
Henkin, Yael; Kishon-Rabin, Liat; Tatin-Schneider, Simona; Urbach, Doron; Hildesheimer, Minka; Kileny, Paul R
2004-12-01
The current preliminary report describes the utilization of low-resolution electromagnetic tomography (LORETA) in a small group of highly performing children using the Nucleus 22 cochlear implant (CI) and in normal-hearing (NH) adults. LORETA current density estimations were performed on an averaged target P3 component that was elicited by non-speech and speech oddball discrimination tasks. The results indicated that, when stimulated with tones, patients with right implants and NH adults (regardless of stimulated ear) showed enhanced activation in the right temporal lobe, whereas patients with left implants showed enhanced activation in the left temporal lobe. When stimulated with speech, patients with right implants showed bilateral activation of the temporal and frontal lobes, whereas patients with left implants showed only left temporal lobe activation. NH adults (regardless of stimulated ear) showed enhanced bilateral activation of the temporal and parietal lobes. The differences in activation patterns between patients with CI and NH subjects may be attributed to the long-term exposure to degraded input conditions which may have resulted in reorganization in terms of functional specialization. The difference between patients with right versus left implants, however, is intriguing and requires further investigation.
Enhanced Sensitivity to Subphonemic Segments in Dyslexia: A New Instance of Allophonic Perception
Serniclaes, Willy; Seck, M’ballo
2018-01-01
Although dyslexia can be individuated in many different ways, it has only three discernable sources: a visual deficit that affects the perception of letters, a phonological deficit that affects the perception of speech sounds, and an audio-visual deficit that disturbs the association of letters with speech sounds. However, the very nature of each of these core deficits remains debatable. The phonological deficit in dyslexia, which is generally attributed to a deficit of phonological awareness, might result from a specific mode of speech perception characterized by the use of allophonic (i.e., subphonemic) units. Here we will summarize the available evidence and present new data in support of the “allophonic theory” of dyslexia. Previous studies have shown that the dyslexia deficit in the categorical perception of phonemic features (e.g., the voicing contrast between /t/ and /d/) is due to the enhanced sensitivity to allophonic features (e.g., the difference between two variants of /d/). Another consequence of allophonic perception is that it should also give rise to an enhanced sensitivity to allophonic segments, such as those that take place within a consonant cluster. This latter prediction is validated by the data presented in this paper. PMID:29587419
Novel magnet-retained prosthetic system for facial reconstruction.
Ahmed, Mostafa M; Piper, James M; Hansen, Nancy A; Sutton, Alan J; Schmalbach, Cecelia E
2014-01-01
Traumatic facial defects negatively impact speech, mastication, deglutition, dental hygiene, and psychosocial well-being. Reconstruction must address restoration of function and aesthetics to provide quality of life. This report describes soft-tissue reconstruction using a novel magnet-retained facial prosthesis without osseointegrated abutments, performed in a patient after traumatic loss of the entire left lower part of the face, including lips, commissure, and mentum. This reconstructive technique successfully addressed the cosmetic defect while also restoring function with respect to speech and oral nutrition. For this reason, magnet-retained facial prosthesis should be added to free tissue transfer and regional flaps as a reasonable option in the reconstructive algorithm for complex soft-tissue defects of the lower face.
Post interaural neural net-based vowel recognition
NASA Astrophysics Data System (ADS)
Jouny, Ismail I.
2001-10-01
Interaural head related transfer functions are used to process speech signatures prior to neural net based recognition. Data representing the head related transfer function of a dummy has been collected at MIT and made available on the Internet. This data is used to pre-process vowel signatures to mimic the effects of human ear on speech perception. Signatures representing various vowels of the English language are then presented to a multi-layer perceptron trained using the back propagation algorithm for recognition purposes. The focus in this paper is to assess the effects of human interaural system on vowel recognition performance particularly when using a classification system that mimics the human brain such as a neural net.
Speech summer camp for treating articulation disorders in cleft palate patients.
Pamplona, Carmen; Ysunza, Antonio; Patiño, Carmeluza; Ramírez, Elena; Drucker, Mónica; Mazón, Juán J
2005-03-01
Compensatory articulation disorder (CAD) severely affects speech intelligibility of cleft palate children. CAD must be treated with speech therapy. Children can manage articulation better when they use language in event contexts such as every day routines. The purpose of this paper is to study and compare two modalities of speech intervention in cleft palate children with associated CAD. The first modality is a conventional approach providing speech therapy in 1-h sessions, twice a week. The second modality is a speech summer camp in which children received therapy 4h per day, 5 days a week for a period of 3 weeks. We were aimed to determine if a speech summer camp could significantly enhance articulation in CP children with CAD. Forty-five children with repaired cleft palates who exhibited CAD were studied. A matched control group of 45 children with repaired cleft palate who also exhibited CAD were identified. The patients included in the first group attended a speech summer camp for 3 weeks. The matched control subjects included in the second group received speech therapy aimed to correct CAD twice per-week in 1-h sessions. At the onset of either the summer camp or the speech therapy period, the severity of CAD was evenly distributed with non-significant differences across both groups of patients (p > 0.05). After the summer camp (3 weeks) or 12 months of speech therapy sessions at a frequency of twice per-week, both groups of patients showed a significant decrease in the severity of their CAD (p < 0.05). However, when the distribution of the severity of CAD was compared at the end of the summer camp or the speech therapy period, non-significant differences were found between both groups of patients (p > 0.05). A speech summer camp is a valid and efficient method for providing speech therapy in cleft palate children with compensatory articulation disorder.
Asad, Areej Nimer; Purdy, Suzanne C; Ballard, Elaine; Fairgray, Liz; Bowen, Caroline
2018-04-27
In this descriptive study, phonological processes were examined in the speech of children aged 5;0-7;6 (years; months) with mild to profound hearing loss using hearing aids (HAs) and cochlear implants (CIs), in comparison to their peers. A second aim was to compare phonological processes of HA and CI users. Children with hearing loss (CWHL, N = 25) were compared to children with normal hearing (CWNH, N = 30) with similar age, gender, linguistic, and socioeconomic backgrounds. Speech samples obtained from a list of 88 words, derived from three standardized speech tests, were analyzed using the CASALA (Computer Aided Speech and Language Analysis) program to evaluate participants' phonological systems, based on lax (a process appeared at least twice in the speech of at least two children) and strict (a process appeared at least five times in the speech of at least two children) counting criteria. Developmental phonological processes were eliminated in the speech of younger and older CWNH while eleven developmental phonological processes persisted in the speech of both age groups of CWHL. CWHL showed a similar trend of age of elimination to CWNH, but at a slower rate. Children with HAs and CIs produced similar phonological processes. Final consonant deletion, weak syllable deletion, backing, and glottal replacement were present in the speech of HA users, affecting their overall speech intelligibility. Developmental and non-developmental phonological processes persist in the speech of children with mild to profound hearing loss compared to their peers with typical hearing. The findings indicate that it is important for clinicians to consider phonological assessment in pre-school CWHL and the use of evidence-based speech therapy in order to reduce non-developmental and non-age-appropriate developmental processes, thereby enhancing their speech intelligibility. Copyright © 2018 Elsevier Inc. All rights reserved.
Agnew, Zarinah; Nagarajan, Srikantan; Houde, John; Ivry, Richard B.
2017-01-01
The cerebellum has been hypothesized to form a crucial part of the speech motor control network. Evidence for this comes from patients with cerebellar damage, who exhibit a variety of speech deficits, as well as imaging studies showing cerebellar activation during speech production in healthy individuals. To date, the precise role of the cerebellum in speech motor control remains unclear, as it has been implicated in both anticipatory (feedforward) and reactive (feedback) control. Here, we assess both anticipatory and reactive aspects of speech motor control, comparing the performance of patients with cerebellar degeneration and matched controls. Experiment 1 tested feedforward control by examining speech adaptation across trials in response to a consistent perturbation of auditory feedback. Experiment 2 tested feedback control, examining online corrections in response to inconsistent perturbations of auditory feedback. Both male and female patients and controls were tested. The patients were impaired in adapting their feedforward control system relative to controls, exhibiting an attenuated anticipatory response to the perturbation. In contrast, the patients produced even larger compensatory responses than controls, suggesting an increased reliance on sensory feedback to guide speech articulation in this population. Together, these results suggest that the cerebellum is crucial for maintaining accurate feedforward control of speech, but relatively uninvolved in feedback control. SIGNIFICANCE STATEMENT Speech motor control is a complex activity that is thought to rely on both predictive, feedforward control as well as reactive, feedback control. While the cerebellum has been shown to be part of the speech motor control network, its functional contribution to feedback and feedforward control remains controversial. Here, we use real-time auditory perturbations of speech to show that patients with cerebellar degeneration are impaired in adapting feedforward control of speech but retain the ability to make online feedback corrections; indeed, the patients show an increased sensitivity to feedback. These results indicate that the cerebellum forms a crucial part of the feedforward control system for speech but is not essential for online, feedback control. PMID:28842410
Brain dynamics that correlate with effects of learning on auditory distance perception.
Wisniewski, Matthew G; Mercado, Eduardo; Church, Barbara A; Gramann, Klaus; Makeig, Scott
2014-01-01
Accuracy in auditory distance perception can improve with practice and varies for sounds differing in familiarity. Here, listeners were trained to judge the distances of English, Bengali, and backwards speech sources pre-recorded at near (2-m) and far (30-m) distances. Listeners' accuracy was tested before and after training. Improvements from pre-test to post-test were greater for forward speech, demonstrating a learning advantage for forward speech sounds. Independent component (IC) processes identified in electroencephalographic (EEG) data collected during pre- and post-testing revealed three clusters of ICs across subjects with stimulus-locked spectral perturbations related to learning and accuracy. One cluster exhibited a transient stimulus-locked increase in 4-8 Hz power (theta event-related synchronization; ERS) that was smaller after training and largest for backwards speech. For a left temporal cluster, 8-12 Hz decreases in power (alpha event-related desynchronization; ERD) were greatest for English speech and less prominent after training. In contrast, a cluster of IC processes centered at or near anterior portions of the medial frontal cortex showed learning-related enhancement of sustained increases in 10-16 Hz power (upper-alpha/low-beta ERS). The degree of this enhancement was positively correlated with the degree of behavioral improvements. Results suggest that neural dynamics in non-auditory cortical areas support distance judgments. Further, frontal cortical networks associated with attentional and/or working memory processes appear to play a role in perceptual learning for source distance.
Image contrast enhancement using adjacent-blocks-based modification for local histogram equalization
NASA Astrophysics Data System (ADS)
Wang, Yang; Pan, Zhibin
2017-11-01
Infrared images usually have some non-ideal characteristics such as weak target-to-background contrast and strong noise. Because of these characteristics, it is necessary to apply the contrast enhancement algorithm to improve the visual quality of infrared images. Histogram equalization (HE) algorithm is a widely used contrast enhancement algorithm due to its effectiveness and simple implementation. But a drawback of HE algorithm is that the local contrast of an image cannot be equally enhanced. Local histogram equalization algorithms are proved to be the effective techniques for local image contrast enhancement. However, over-enhancement of noise and artifacts can be easily found in the local histogram equalization enhanced images. In this paper, a new contrast enhancement technique based on local histogram equalization algorithm is proposed to overcome the drawbacks mentioned above. The input images are segmented into three kinds of overlapped sub-blocks using the gradients of them. To overcome the over-enhancement effect, the histograms of these sub-blocks are then modified by adjacent sub-blocks. We pay more attention to improve the contrast of detail information while the brightness of the flat region in these sub-blocks is well preserved. It will be shown that the proposed algorithm outperforms other related algorithms by enhancing the local contrast without introducing over-enhancement effects and additional noise.