NASA Astrophysics Data System (ADS)
Gao, Pei-pei; Liu, Feng
2016-10-01
With the development of information technology and artificial intelligence, speech synthesis plays a significant role in the fields of Human-Computer Interaction Techniques. However, the main problem of current speech synthesis techniques is lacking of naturalness and expressiveness so that it is not yet close to the standard of natural language. Another problem is that the human-computer interaction based on the speech synthesis is too monotonous to realize mechanism of user subjective drive. This thesis introduces the historical development of speech synthesis and summarizes the general process of this technique. It is pointed out that prosody generation module is an important part in the process of speech synthesis. On the basis of further research, using eye activity rules when reading to control and drive prosody generation was introduced as a new human-computer interaction method to enrich the synthetic form. In this article, the present situation of speech synthesis technology is reviewed in detail. Based on the premise of eye gaze data extraction, using eye movement signal in real-time driving, a speech synthesis method which can express the real speech rhythm of the speaker is proposed. That is, when reader is watching corpora with its eyes in silent reading, capture the reading information such as the eye gaze duration per prosodic unit, and establish a hierarchical prosodic pattern of duration model to determine the duration parameters of synthesized speech. At last, after the analysis, the feasibility of the above method is verified.
ERIC Educational Resources Information Center
Tierney, Joseph; Mack, Molly
1987-01-01
Stimuli used in research on the perception of the speech signal have often been obtained from simple filtering and distortion of the speech waveform, sometimes accompanied by noise. However, for more complex stimulus generation, the parameters of speech can be manipulated, after analysis and before synthesis, using various types of algorithms to…
Speech Synthesis Applied to Language Teaching.
ERIC Educational Resources Information Center
Sherwood, Bruce
1981-01-01
The experimental addition of speech output to computer-based Esperanto lessons using speech synthesized from text is described. Because of Esperanto's phonetic spelling and simple rhythm, it is particularly easy to describe the mechanisms of Esperanto synthesis. Attention is directed to how the text-to-speech conversion is performed and the ways…
A Program Structure for Event-Based Speech Synthesis by Rules within a Flexible Segmental Framework.
ERIC Educational Resources Information Center
Hill, David R.
1978-01-01
A program structure based on recently developed techniques for operating system simulation has the required flexibility for use as a speech synthesis algorithm research framework. This program makes synthesis possible with less rigid time and frequency-component structure than simpler schemes. It also meets real-time operation and memory-size…
Speech transformations based on a sinusoidal representation
NASA Astrophysics Data System (ADS)
Quatieri, T. E.; McAulay, R. J.
1986-05-01
A new speech analysis/synthesis technique is presented which provides the basis for a general class of speech transformation including time-scale modification, frequency scaling, and pitch modification. These modifications can be performed with a time-varying change, permitting continuous adjustment of a speaker's fundamental frequency and rate of articulation. The method is based on a sinusoidal representation of the speech production mechanism that has been shown to produce synthetic speech that preserves the waveform shape and is essentially perceptually indistinguishable from the original. Although the analysis/synthesis system originally was designed for single-speaker signals, it is equally capable of recovering and modifying nonspeech signals such as music; multiple speakers, marine biologic sounds, and speakers in the presence of interferences such as noise and musical backgrounds.
Accurate visible speech synthesis based on concatenating variable length motion capture data.
Ma, Jiyong; Cole, Ron; Pellom, Bryan; Ward, Wayne; Wise, Barbara
2006-01-01
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
Expressive facial animation synthesis by learning speech coarticulation and expression spaces.
Deng, Zhigang; Neumann, Ulrich; Lewis, J P; Kim, Tae-Yong; Bulut, Murtaza; Narayanan, Shrikanth
2006-01-01
Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D motions of the markers on the face of a human subject are captured while he/she recites a predesigned corpus, with specific spoken and visual expressions. We present a novel motion capture mining technique that "learns" speech coarticulation models for diphones and triphones from the recorded data. A Phoneme-Independent Expression Eigenspace (PIEES) that encloses the dynamic expression signals is constructed by motion signal processing (phoneme-based time-warping and subtraction) and Principal Component Analysis (PCA) reduction. New expressive facial animations are synthesized as follows: First, the learned coarticulation models are concatenated to synthesize neutral visual speech according to novel speech input, then a texture-synthesis-based approach is used to generate a novel dynamic expression signal from the PIEES model, and finally the synthesized expression signal is blended with the synthesized neutral visual speech to create the final expressive facial animation. Our experiments demonstrate that the system can effectively synthesize realistic expressive facial animation.
Department of Cybernetic Acoustics
NASA Astrophysics Data System (ADS)
The development of the theory, instrumentation and applications of methods and systems for the measurement, analysis, processing and synthesis of acoustic signals within the audio frequency range, particularly of the speech signal and the vibro-acoustic signal emitted by technical and industrial equipments treated as noise and vibration sources was discussed. The research work, both theoretical and experimental, aims at applications in various branches of science, and medicine, such as: acoustical diagnostics and phoniatric rehabilitation of pathological and postoperative states of the speech organ; bilateral ""man-machine'' speech communication based on the analysis, recognition and synthesis of the speech signal; vibro-acoustical diagnostics and continuous monitoring of the state of machines, technical equipments and technological processes.
Fifty years of progress in acoustic phonetics
NASA Astrophysics Data System (ADS)
Stevens, Kenneth N.
2004-10-01
Three events that occurred 50 or 60 years ago shaped the study of acoustic phonetics, and in the following few decades these events influenced research and applications in speech disorders, speech development, speech synthesis, speech recognition, and other subareas in speech communication. These events were: (1) the source-filter theory of speech production (Chiba and Kajiyama; Fant); (2) the development of the sound spectrograph and its interpretation (Potter, Kopp, and Green; Joos); and (3) the birth of research that related distinctive features to acoustic patterns (Jakobson, Fant, and Halle). Following these events there has been systematic exploration of the articulatory, acoustic, and perceptual bases of phonological categories, and some quantification of the sources of variability in the transformation of this phonological representation of speech into its acoustic manifestations. This effort has been enhanced by studies of how children acquire language in spite of this variability and by research on speech disorders. Gaps in our knowledge of this inherent variability in speech have limited the directions of applications such as synthesis and recognition of speech, and have led to the implementation of data-driven techniques rather than theoretical principles. Some examples of advances in our knowledge, and limitations of this knowledge, are reviewed.
NASA Technical Reports Server (NTRS)
Walklet, T.
1981-01-01
The feasibility of a miniature versatile portable speech prosthesis (VPSP) was analyzed and information on its potential users and on other similar devices was collected. The VPSP is a device that incorporates speech synthesis technology. The objective is to provide sufficient information to decide whether there is valuable technology to contribute to the miniaturization of the VPSP. The needs of potential users are identified, the development status of technologies similar or related to those used in the VPSP are evaluated. The VPSP, a computer based speech synthesis system fits on a wheelchair. The purpose was to produce a device that provides communication assistance in educational, vocational, and social situations to speech impaired individuals. It is expected that the VPSP can be a valuable aid for persons who are also motor impaired, which explains the placement of the system on a wheelchair.
Neural network based speech synthesizer: A preliminary report
NASA Technical Reports Server (NTRS)
Villarreal, James A.; Mcintire, Gary
1987-01-01
A neural net based speech synthesis project is discussed. The novelty is that the reproduced speech was extracted from actual voice recordings. In essence, the neural network learns the timing, pitch fluctuations, connectivity between individual sounds, and speaking habits unique to that individual person. The parallel distributed processing network used for this project is the generalized backward propagation network which has been modified to also learn sequences of actions or states given in a particular plan.
Infants’ brain responses to speech suggest Analysis by Synthesis
Kuhl, Patricia K.; Ramírez, Rey R.; Bosseler, Alexis; Lin, Jo-Fu Lotus; Imada, Toshiaki
2014-01-01
Historic theories of speech perception (Motor Theory and Analysis by Synthesis) invoked listeners’ knowledge of speech production to explain speech perception. Neuroimaging data show that adult listeners activate motor brain areas during speech perception. In two experiments using magnetoencephalography (MEG), we investigated motor brain activation, as well as auditory brain activation, during discrimination of native and nonnative syllables in infants at two ages that straddle the developmental transition from language-universal to language-specific speech perception. Adults are also tested in Exp. 1. MEG data revealed that 7-mo-old infants activate auditory (superior temporal) as well as motor brain areas (Broca’s area, cerebellum) in response to speech, and equivalently for native and nonnative syllables. However, in 11- and 12-mo-old infants, native speech activates auditory brain areas to a greater degree than nonnative, whereas nonnative speech activates motor brain areas to a greater degree than native speech. This double dissociation in 11- to 12-mo-old infants matches the pattern of results obtained in adult listeners. Our infant data are consistent with Analysis by Synthesis: auditory analysis of speech is coupled with synthesis of the motor plans necessary to produce the speech signal. The findings have implications for: (i) perception-action theories of speech perception, (ii) the impact of “motherese” on early language learning, and (iii) the “social-gating” hypothesis and humans’ development of social understanding. PMID:25024207
Infants' brain responses to speech suggest analysis by synthesis.
Kuhl, Patricia K; Ramírez, Rey R; Bosseler, Alexis; Lin, Jo-Fu Lotus; Imada, Toshiaki
2014-08-05
Historic theories of speech perception (Motor Theory and Analysis by Synthesis) invoked listeners' knowledge of speech production to explain speech perception. Neuroimaging data show that adult listeners activate motor brain areas during speech perception. In two experiments using magnetoencephalography (MEG), we investigated motor brain activation, as well as auditory brain activation, during discrimination of native and nonnative syllables in infants at two ages that straddle the developmental transition from language-universal to language-specific speech perception. Adults are also tested in Exp. 1. MEG data revealed that 7-mo-old infants activate auditory (superior temporal) as well as motor brain areas (Broca's area, cerebellum) in response to speech, and equivalently for native and nonnative syllables. However, in 11- and 12-mo-old infants, native speech activates auditory brain areas to a greater degree than nonnative, whereas nonnative speech activates motor brain areas to a greater degree than native speech. This double dissociation in 11- to 12-mo-old infants matches the pattern of results obtained in adult listeners. Our infant data are consistent with Analysis by Synthesis: auditory analysis of speech is coupled with synthesis of the motor plans necessary to produce the speech signal. The findings have implications for: (i) perception-action theories of speech perception, (ii) the impact of "motherese" on early language learning, and (iii) the "social-gating" hypothesis and humans' development of social understanding.
Visually Impaired Persons' Comprehension of Text Presented with Speech Synthesis.
ERIC Educational Resources Information Center
Hjelmquist, E.; And Others
1992-01-01
This study of 48 individuals with visual impairments (16 middle-aged with experience in synthetic speech, 16 middle-aged inexperienced, and 16 older inexperienced) found that speech synthesis, compared to natural speech, generally yielded lower results with respect to memory and understanding of texts. Experience had no effect on performance.…
NASA Astrophysics Data System (ADS)
Jelinek, H. J.
1986-01-01
This is the Final Report of Electronic Design Associates on its Phase I SBIR project. The purpose of this project is to develop a method for correcting helium speech, as experienced in diver-surface communication. The goal of the Phase I study was to design, prototype, and evaluate a real time helium speech corrector system based upon digital signal processing techniques. The general approach was to develop hardware (an IBM PC board) to digitize helium speech and software (a LAMBDA computer based simulation) to translate the speech. As planned in the study proposal, this initial prototype may now be used to assess expected performance from a self contained real time system which uses an identical algorithm. The Final Report details the work carried out to produce the prototype system. Four major project tasks were: a signal processing scheme for converting helium speech to normal sounding speech was generated. The signal processing scheme was simulated on a general purpose (LAMDA) computer. Actual helium speech was supplied to the simulation and the converted speech was generated. An IBM-PC based 14 bit data Input/Output board was designed and built. A bibliography of references on speech processing was generated.
Alternative Speech Communication System for Persons with Severe Speech Disorders
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; Sidi Yakoub, Mohammed; O'Shaughnessy, Douglas
2009-12-01
Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders. The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality of communication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and close to the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a grafting technique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system in order to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and two Acadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposed methods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achieved by the speech synthesis systems that deal with SSD and dysarthria, respectively.
Articulatory speech synthesis and speech production modelling
NASA Astrophysics Data System (ADS)
Huang, Jun
This dissertation addresses the problem of speech synthesis and speech production modelling based on the fundamental principles of human speech production. Unlike the conventional source-filter model, which assumes the independence of the excitation and the acoustic filter, we treat the entire vocal apparatus as one system consisting of a fluid dynamic aspect and a mechanical part. We model the vocal tract by a three-dimensional moving geometry. We also model the sound propagation inside the vocal apparatus as a three-dimensional nonplane-wave propagation inside a viscous fluid described by Navier-Stokes equations. In our work, we first propose a combined minimum energy and minimum jerk criterion to estimate the dynamic vocal tract movements during speech production. Both theoretical error bound analysis and experimental results show that this method can achieve very close match at the target points and avoid the abrupt change in articulatory trajectory at the same time. Second, a mechanical vocal fold model is used to compute the excitation signal of the vocal tract. The advantage of this model is that it is closely coupled with the vocal tract system based on fundamental aerodynamics. As a result, we can obtain an excitation signal with much more detail than the conventional parametric vocal fold excitation model. Furthermore, strong evidence of source-tract interaction is observed. Finally, we propose a computational model of the fricative and stop types of sounds based on the physical principles of speech production. The advantage of this model is that it uses an exogenous process to model the additional nonsteady and nonlinear effects due to the flow mode, which are ignored by the conventional source- filter speech production model. A recursive algorithm is used to estimate the model parameters. Experimental results show that this model is able to synthesize good quality fricative and stop types of sounds. Based on our dissertation work, we carefully argue that the articulatory speech production model has the potential to flexibly synthesize natural-quality speech sounds and to provide a compact computational model for speech production that can be beneficial to a wide range of areas in speech signal processing.
Analysis and synthesis of laughter
NASA Astrophysics Data System (ADS)
Sundaram, Shiva; Narayanan, Shrikanth
2004-10-01
There is much enthusiasm in the text-to-speech community for synthesis of emotional and natural speech. One idea being proposed is to include emotion dependent paralinguistic cues during synthesis to convey emotions effectively. This requires modeling and synthesis techniques of various cues for different emotions. Motivated by this, a technique to synthesize human laughter is proposed. Laughter is a complex mechanism of expression and has high variability in terms of types and usage in human-human communication. People have their own characteristic way of laughing. Laughter can be seen as a controlled/uncontrolled physiological process of a person resulting from an initial excitation in context. A parametric model based on damped simple harmonic motion to effectively capture these diversities and also maintain the individuals characteristics is developed here. Limited laughter/speech data from actual humans and synthesis ease are the constraints imposed on the accuracy of the model. Analysis techniques are also developed to determine the parameters of the model for a given individual or laughter type. Finally, the effectiveness of the model to capture the individual characteristics and naturalness compared to real human laughter has been analyzed. Through this the factors involved in individual human laughter and their importance can be better understood.
Key considerations in designing a speech brain-computer interface.
Bocquelet, Florent; Hueber, Thomas; Girin, Laurent; Chabardès, Stéphan; Yvert, Blaise
2016-11-01
Restoring communication in case of aphasia is a key challenge for neurotechnologies. To this end, brain-computer strategies can be envisioned to allow artificial speech synthesis from the continuous decoding of neural signals underlying speech imagination. Such speech brain-computer interfaces do not exist yet and their design should consider three key choices that need to be made: the choice of appropriate brain regions to record neural activity from, the choice of an appropriate recording technique, and the choice of a neural decoding scheme in association with an appropriate speech synthesis method. These key considerations are discussed here in light of (1) the current understanding of the functional neuroanatomy of cortical areas underlying overt and covert speech production, (2) the available literature making use of a variety of brain recording techniques to better characterize and address the challenge of decoding cortical speech signals, and (3) the different speech synthesis approaches that can be considered depending on the level of speech representation (phonetic, acoustic or articulatory) envisioned to be decoded at the core of a speech BCI paradigm. Copyright © 2017 The Author(s). Published by Elsevier Ltd.. All rights reserved.
Synthesis of Speaker Facial Movement to Match Selected Speech Sequences
NASA Technical Reports Server (NTRS)
Scott, K. C.; Kagels, D. S.; Watson, S. H.; Rom, H.; Wright, J. R.; Lee, M.; Hussey, K. J.
1994-01-01
A system is described which allows for the synthesis of a video sequence of a realistic-appearing talking human head. A phonic based approach is used to describe facial motion; image processing rather than physical modeling techniques are used to create video frames.
Optimal pattern synthesis for speech recognition based on principal component analysis
NASA Astrophysics Data System (ADS)
Korsun, O. N.; Poliyev, A. V.
2018-02-01
The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.
Library Automation Design for Visually Impaired People
ERIC Educational Resources Information Center
Yurtay, Nilufer; Bicil, Yucel; Celebi, Sait; Cit, Guluzar; Dural, Deniz
2011-01-01
Speech synthesis is a technology used in many different areas in computer science. This technology can bring a solution to reading activity of visually impaired people due to its text to speech conversion. Based on this problem, in this study, a system is designed needed for a visually impaired person to make use of all the library facilities in…
Implementation of Three Text to Speech Systems for Kurdish Language
NASA Astrophysics Data System (ADS)
Bahrampour, Anvar; Barkhoda, Wafa; Azami, Bahram Zahir
Nowadays, concatenative method is used in most modern TTS systems to produce artificial speech. The most important challenge in this method is choosing appropriate unit for creating database. This unit must warranty smoothness and high quality speech, and also, creating database for it must reasonable and inexpensive. For example, syllable, phoneme, allophone, and, diphone are appropriate units for all-purpose systems. In this paper, we implemented three synthesis systems for Kurdish language based on syllable, allophone, and diphone and compare their quality using subjective testing.
ERIC Educational Resources Information Center
Flowers, Arthur; Crandell, Edwin W.
Three auditory perceptual processes (resistance to distortion, selective listening in the form of auditory dedifferentiation, and binaural synthesis) were evaluated by five assessment techniques: (1) low pass filtered speech, (2) accelerated speech, (3) competing messages, (4) accelerated plus competing messages, and (5) binaural synthesis.…
Multilevel Analysis in Analyzing Speech Data
ERIC Educational Resources Information Center
Guddattu, Vasudeva; Krishna, Y.
2011-01-01
The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…
Design and performance of an analysis-by-synthesis class of predictive speech coders
NASA Technical Reports Server (NTRS)
Rose, Richard C.; Barnwell, Thomas P., III
1990-01-01
The performance of a broad class of analysis-by-synthesis linear predictive speech coders is quantified experimentally. The class of coders includes a number of well-known techniques as well as a very large number of speech coders which have not been named or studied. A general formulation for deriving the parametric representation used in all of the coders in the class is presented. A new coder, named the self-excited vocoder, is discussed because of its good performance with low complexity, and because of the insight this coder gives to analysis-by-synthesis coders in general. The results of a study comparing the performances of different members of this class are presented. The study takes the form of a series of formal subjective and objective speech quality tests performed on selected coders. The results of this study lead to some interesting and important observations concerning the controlling parameters for analysis-by-synthesis speech coders.
Implicit prosody mining based on the human eye image capture technology
NASA Astrophysics Data System (ADS)
Gao, Pei-pei; Liu, Feng
2013-08-01
The technology of eye tracker has become the main methods of analyzing the recognition issues in human-computer interaction. Human eye image capture is the key problem of the eye tracking. Based on further research, a new human-computer interaction method introduced to enrich the form of speech synthetic. We propose a method of Implicit Prosody mining based on the human eye image capture technology to extract the parameters from the image of human eyes when reading, control and drive prosody generation in speech synthesis, and establish prosodic model with high simulation accuracy. Duration model is key issues for prosody generation. For the duration model, this paper put forward a new idea for obtaining gaze duration of eyes when reading based on the eye image capture technology, and synchronous controlling this duration and pronunciation duration in speech synthesis. The movement of human eyes during reading is a comprehensive multi-factor interactive process, such as gaze, twitching and backsight. Therefore, how to extract the appropriate information from the image of human eyes need to be considered and the gaze regularity of eyes need to be obtained as references of modeling. Based on the analysis of current three kinds of eye movement control model and the characteristics of the Implicit Prosody reading, relative independence between speech processing system of text and eye movement control system was discussed. It was proved that under the same text familiarity condition, gaze duration of eyes when reading and internal voice pronunciation duration are synchronous. The eye gaze duration model based on the Chinese language level prosodic structure was presented to change previous methods of machine learning and probability forecasting, obtain readers' real internal reading rhythm and to synthesize voice with personalized rhythm. This research will enrich human-computer interactive form, and will be practical significance and application prospect in terms of disabled assisted speech interaction. Experiments show that Implicit Prosody mining based on the human eye image capture technology makes the synthesized speech has more flexible expressions.
Speech Analysis and Synthesis Based on Pitch-Synchronous Segmentation of the Speech Waveform.
1994-11-09
95.1 Nasality Distinguishes /n/ from /d/, /m/ from /b/, etc. 96.1 99.2 96.9 99.2 Sustention Distinguishes /ffrom /p/, 86.7 91.4 82.8 92.7/b/ from /v...first incoming pitch waveform becomes the first template, and it is stored in memory . Step 2: The amplitude spectrum of the next incoming pitch waveform
A hardware experimental platform for neural circuits in the auditory cortex
NASA Astrophysics Data System (ADS)
Rodellar-Biarge, Victoria; García-Dominguez, Pablo; Ruiz-Rizaldos, Yago; Gómez-Vilda, Pedro
2011-05-01
Speech processing in the human brain is a very complex process far from being fully understood although much progress has been done recently. Neuromorphic Speech Processing is a new research orientation in bio-inspired systems approach to find solutions to automatic treatment of specific problems (recognition, synthesis, segmentation, diarization, etc) which can not be adequately solved using classical algorithms. In this paper a neuromorphic speech processing architecture is presented. The systematic bottom-up synthesis of layered structures reproduce the dynamic feature detection of speech related to plausible neural circuits which work as interpretation centres located in the Auditory Cortex. The elementary model is based on Hebbian neuron-like units. For the computation of the architecture a flexible framework is proposed in the environment of Matlab®/Simulink®/HDL, which allows building models in different description styles, complexity and implementation levels. It provides a flexible platform for experimenting on the influence of the number of neurons and interconnections, in the precision of the results and in performance evaluation. The experimentation with different architecture configurations may help both in better understanding how neural circuits may work in the brain as well as in how speech processing can benefit from this understanding.
THE COMPREHENSION OF RAPID SPEECH BY THE BLIND, PART III.
ERIC Educational Resources Information Center
FOULKE, EMERSON
A REVIEW OF THE RESEARCH ON THE COMPREHENSION OF RAPID SPEECH BY THE BLIND IDENTIFIES FIVE METHODS OF SPEECH COMPRESSION--SPEECH CHANGING, ELECTROMECHANICAL SAMPLING, COMPUTER SAMPLING, SPEECH SYNTHESIS, AND FREQUENCY DIVIDING WITH THE HARMONIC COMPRESSOR. THE SPEECH CHANGING AND ELECTROMECHANICAL SAMPLING METHODS AND THE NECESSARY APPARATUS HAVE…
Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces
Bocquelet, Florent; Hueber, Thomas; Girin, Laurent; Savariaux, Christophe; Yvert, Blaise
2016-01-01
Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer. PMID:27880768
Research in speech communication.
Flanagan, J
1995-10-24
Advances in digital speech processing are now supporting application and deployment of a variety of speech technologies for human/machine communication. In fact, new businesses are rapidly forming about these technologies. But these capabilities are of little use unless society can afford them. Happily, explosive advances in microelectronics over the past two decades have assured affordable access to this sophistication as well as to the underlying computing technology. The research challenges in speech processing remain in the traditionally identified areas of recognition, synthesis, and coding. These three areas have typically been addressed individually, often with significant isolation among the efforts. But they are all facets of the same fundamental issue--how to represent and quantify the information in the speech signal. This implies deeper understanding of the physics of speech production, the constraints that the conventions of language impose, and the mechanism for information processing in the auditory system. In ongoing research, therefore, we seek more accurate models of speech generation, better computational formulations of language, and realistic perceptual guides for speech processing--along with ways to coalesce the fundamental issues of recognition, synthesis, and coding. Successful solution will yield the long-sought dictation machine, high-quality synthesis from text, and the ultimate in low bit-rate transmission of speech. It will also open the door to language-translating telephony, where the synthetic foreign translation can be in the voice of the originating talker.
Implementation of the Intelligent Voice System for Kazakh
NASA Astrophysics Data System (ADS)
Yessenbayev, Zh; Saparkhojayev, N.; Tibeyev, T.
2014-04-01
Modern speech technologies are highly advanced and widely used in day-to-day applications. However, this is mostly concerned with the languages of well-developed countries such as English, German, Japan, Russian, etc. As for Kazakh, the situation is less prominent and research in this field is only starting to evolve. In this research and application-oriented project, we introduce an intelligent voice system for the fast deployment of call-centers and information desks supporting Kazakh speech. The demand on such a system is obvious if the country's large size and small population is considered. The landline and cell phones become the only means of communication for the distant villages and suburbs. The system features Kazakh speech recognition and synthesis modules as well as a web-GUI for efficient dialog management. For speech recognition we use CMU Sphinx engine and for speech synthesis- MaryTTS. The web-GUI is implemented in Java enabling operators to quickly create and manage the dialogs in user-friendly graphical environment. The call routines are handled by Asterisk PBX and JBoss Application Server. The system supports such technologies and protocols as VoIP, VoiceXML, FastAGI, Java SpeechAPI and J2EE. For the speech recognition experiments we compiled and used the first Kazakh speech corpus with the utterances from 169 native speakers. The performance of the speech recognizer is 4.1% WER on isolated word recognition and 6.9% WER on clean continuous speech recognition tasks. The speech synthesis experiments include the training of male and female voices.
Voice-processing technologies--their application in telecommunications.
Wilpon, J G
1995-01-01
As the telecommunications industry evolves over the next decade to provide the products and services that people will desire, several key technologies will become commonplace. Two of these, automatic speech recognition and text-to-speech synthesis, will provide users with more freedom on when, where, and how they access information. While these technologies are currently in their infancy, their capabilities are rapidly increasing and their deployment in today's telephone network is expanding. The economic impact of just one application, the automation of operator services, is well over $100 million per year. Yet there still are many technical challenges that must be resolved before these technologies can be deployed ubiquitously in products and services throughout the worldwide telephone network. These challenges include: (i) High level of accuracy. The technology must be perceived by the user as highly accurate, robust, and reliable. (ii) Easy to use. Speech is only one of several possible input/output modalities for conveying information between a human and a machine, much like a computer terminal or Touch-Tone pad on a telephone. It is not the final product. Therefore, speech technologies must be hidden from the user. That is, the burden of using the technology must be on the technology itself. (iii) Quick prototyping and development of new products and services. The technology must support the creation of new products and services based on speech in an efficient and timely fashion. In this paper I present a vision of the voice-processing industry with a focus on the areas with the broadest base of user penetration: speech recognition, text-to-speech synthesis, natural language processing, and speaker recognition technologies. The current and future applications of these technologies in the telecommunications industry will be examined in terms of their strengths, limitations, and the degree to which user needs have been or have yet to be met. Although noteworthy gains have been made in areas with potentially small user bases and in the more mature speech-coding technologies, these subjects are outside the scope of this paper. Images Fig. 1 PMID:7479815
Research in speech communication.
Flanagan, J
1995-01-01
Advances in digital speech processing are now supporting application and deployment of a variety of speech technologies for human/machine communication. In fact, new businesses are rapidly forming about these technologies. But these capabilities are of little use unless society can afford them. Happily, explosive advances in microelectronics over the past two decades have assured affordable access to this sophistication as well as to the underlying computing technology. The research challenges in speech processing remain in the traditionally identified areas of recognition, synthesis, and coding. These three areas have typically been addressed individually, often with significant isolation among the efforts. But they are all facets of the same fundamental issue--how to represent and quantify the information in the speech signal. This implies deeper understanding of the physics of speech production, the constraints that the conventions of language impose, and the mechanism for information processing in the auditory system. In ongoing research, therefore, we seek more accurate models of speech generation, better computational formulations of language, and realistic perceptual guides for speech processing--along with ways to coalesce the fundamental issues of recognition, synthesis, and coding. Successful solution will yield the long-sought dictation machine, high-quality synthesis from text, and the ultimate in low bit-rate transmission of speech. It will also open the door to language-translating telephony, where the synthetic foreign translation can be in the voice of the originating talker. Images Fig. 1 Fig. 2 Fig. 5 Fig. 8 Fig. 11 Fig. 12 Fig. 13 PMID:7479806
Processing of speech signals for physical and sensory disabilities.
Levitt, H
1995-01-01
Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities. Images Fig. 4 PMID:7479816
Processing of Speech Signals for Physical and Sensory Disabilities
NASA Astrophysics Data System (ADS)
Levitt, Harry
1995-10-01
Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities.
Application of speech recognition and synthesis in the general aviation cockpit
NASA Technical Reports Server (NTRS)
North, R. A.; Mountford, S. J.; Bergeron, H.
1984-01-01
Interactive speech recognition/synthesis technology is assessed as a method for the aleviation of single-pilot IFR flight workloads. Attention was given during this series of evaluations to the conditions typical of general aviation twin-engine aircrft cockpits, covering several commonly encountered IFR flight condition scenarios. The most beneficial speech command tasks are noted to be in the data retrieval domain, which would allow the pilot access to uplinked data, checklists, and performance charts. Data entry tasks also appear to benefit from this technology.
Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.
1983-05-01
16.311 % a. Seale In/Se"l tAL4 lrs e y i s 2 I ROM men "Ig eddiei, m releerla ons leveltc. Ŗ dots ghoeea INDtISTRtAIJ%6LITARY SPEECH SYNTHESIS PRODUCTS...saquence The SC-01 Suech Syntheszer conftains 64 cf, arent poneme~hs which are accessed try A 6-tht code. 1 - the proper sequ.enti omthnatiors of thoe...connected speech input with widely differing emotional states, diverse accents, and substantial nonperiodic background noise input. As noted previously
Speech processing using maximum likelihood continuity mapping
Hogden, John E.
2000-01-01
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Speech processing using maximum likelihood continuity mapping
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.E.
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
NASA Technical Reports Server (NTRS)
Mcaulay, Robert J.; Quatieri, Thomas F.
1988-01-01
It has been shown that an analysis/synthesis system based on a sinusoidal representation of speech leads to synthetic speech that is essentially perceptually indistinguishable from the original. Strategies for coding the amplitudes, frequencies and phases of the sine waves have been developed that have led to a multirate coder operating at rates from 2400 to 9600 bps. The encoded speech is highly intelligible at all rates with a uniformly improving quality as the data rate is increased. A real-time fixed-point implementation has been developed using two ADSP2100 DSP chips. The methods used for coding and quantizing the sine-wave parameters for operation at the various frame rates are described.
Use of Computer Speech Technologies To Enhance Learning.
ERIC Educational Resources Information Center
Ferrell, Joe
1999-01-01
Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)
NASA Technical Reports Server (NTRS)
Wolf, Jared J.
1977-01-01
The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.
ERIC Educational Resources Information Center
Kjaersgaard, Poul Soren, Ed.
2002-01-01
Papers from the conference in this volume include the following: "Towards Corpus Annotation Standards--The MATE Workbench" (Laila Dybkjaer and Niels Ole Bernsen); "Danish Text-to-Speech Synthesis Based on Stored Acoustic Segments" (Charles Hoequist); "Toward a Method for the Automated Design of Semantic…
Vector/Matrix Quantization for Narrow-Bandwidth Digital Speech Compression.
1982-09-01
8217o 0 -X -u -vc "oi ’" o 0 00i MN nM I -r -: I I Ir , I C 64 ut c 4c -C ;6 19I *~I C’ I I I 1 Kall 9 I I V4 S.0 M r4) ** al Iw* 0 0 10* 0 f 65 signal...Prediction of the Speech Wave, JASA Vol. 50, pp. 637-655, April 1971 . - 2. I. Itakura and S. Saito, Analysis Synthesis Telephony Based Upon the Maximum
Military and Government Applications of Human-Machine Communication by Voice
NASA Astrophysics Data System (ADS)
Weinstein, Clifford J.
1995-10-01
This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will require a varying mix of advances in speech technology and in integration of the technology into applications environments. Applications that are described include (1) speech recognition and synthesis for mobile command and control; (2) speech processing for a portable multifunction soldier's computer; (3) speech- and language-based technology for naval combat team tactical training; (4) speech technology for command and control on a carrier flight deck; (5) control of auxiliary systems, and alert and warning generation, in fighter aircraft and helicopters; and (6) voice check-in, report entry, and communication for law enforcement agents or special forces. A phased approach for transfer of the technology into applications is advocated, where integration of applications systems is pursued in parallel with advanced research to meet future needs.
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
Holzrichter, J.F.; Ng, L.C.
1998-03-17
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
Holzrichter, John F.; Ng, Lawrence C.
1998-01-01
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.
ERIC Educational Resources Information Center
Higgins, Eleanor L.; Raskind, Marshall H.
1997-01-01
Thirty-seven college students with learning disabilities were given a reading comprehension task under the following conditions: (1) using an optical character recognition/speech synthesis system; (2) having the text read aloud by a human reader; or (3) reading silently without assistance. Findings indicated that the greater the disability, the…
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J.F.; Ng, L.C.
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used formore » purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.« less
Greene, Beth G; Logan, John S; Pisoni, David B
1986-03-01
We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.
Scientific bases of human-machine communication by voice.
Schafer, R W
1995-01-01
The scientific bases for human-machine communication by voice are in the fields of psychology, linguistics, acoustics, signal processing, computer science, and integrated circuit technology. The purpose of this paper is to highlight the basic scientific and technological issues in human-machine communication by voice and to point out areas of future research opportunity. The discussion is organized around the following major issues in implementing human-machine voice communication systems: (i) hardware/software implementation of the system, (ii) speech synthesis for voice output, (iii) speech recognition and understanding for voice input, and (iv) usability factors related to how humans interact with machines. PMID:7479802
Research on Speech Perception. Progress Report No. 13.
ERIC Educational Resources Information Center
Pisoni, David B.; And Others
Summarizing research activities in 1987, this is the thirteenth annual report of research on speech perception, analysis, synthesis, and recognition conducted in the Speech Research Laboratory of the Department of Psychology at Indiana University. The report includes extended manuscripts, short reports, progress reports, and information on…
1988-05-01
Seeciv Limited- System for varying Senses term filter capacity output until some Figure 2. Original limited-capacity channel model (Frim Broadbent, 1958) S...2 Figure 2. Original limited-capacity channel model (From Broadbent, 1958) .... 10 Figure 3. Experimental...unlimited variety of human voices for digital recording sources. Synthesis by Analysis Analysis-synthesis methods electronically model the human voice
English Intonation and Computerized Speech Synthesis. Technical Report No. 287.
ERIC Educational Resources Information Center
Levine, Arvin
This work treats some of the important issues encountered in an attempt to synthesize natural sounding English speech from arbitrary written text. Details of the systems that interact in producing speech are described. The principal systems dealt with are phonology (intonation), phonetics, syntax, semantics, and text-view (discourse). Technical…
ERIC Educational Resources Information Center
Haskins Labs., New Haven, CT.
This report is one of a regular series on the status and progress of studies on the nature of speech, instrumentation for its investigation, and practical applications. The ten papers treat the following topics: speech synthesis as a tool for the study of speech production; the study of articulatory organization; phonetic perception; cardiac…
Soft context clustering for F0 modeling in HMM-based speech synthesis
NASA Astrophysics Data System (ADS)
Khorram, Soheil; Sameti, Hossein; King, Simon
2015-12-01
This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional `hard' decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this `divide-and-conquer' approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.
Military and government applications of human-machine communication by voice.
Weinstein, C J
1995-01-01
This paper describes a range of opportunities for military and government applications of human-machine communication by voice, based on visits and contacts with numerous user organizations in the United States. The applications include some that appear to be feasible by careful integration of current state-of-the-art technology and others that will require a varying mix of advances in speech technology and in integration of the technology into applications environments. Applications that are described include (1) speech recognition and synthesis for mobile command and control; (2) speech processing for a portable multifunction soldier's computer; (3) speech- and language-based technology for naval combat team tactical training; (4) speech technology for command and control on a carrier flight deck; (5) control of auxiliary systems, and alert and warning generation, in fighter aircraft and helicopters; and (6) voice check-in, report entry, and communication for law enforcement agents or special forces. A phased approach for transfer of the technology into applications is advocated, where integration of applications systems is pursued in parallel with advanced research to meet future needs. Images Fig. 1 Fig. 2 Fig. 3 Fig. 4 Fig. 5 Fig. 6 PMID:7479718
Coding strategies for cochlear implants under adverse environments
NASA Astrophysics Data System (ADS)
Tahmina, Qudsia
Cochlear implants are electronic prosthetic devices that restores partial hearing in patients with severe to profound hearing loss. Although most coding strategies have significantly improved the perception of speech in quite listening conditions, there remains limitations on speech perception under adverse environments such as in background noise, reverberation and band-limited channels, and we propose strategies that improve the intelligibility of speech transmitted over the telephone networks, reverberated speech and speech in the presence of background noise. For telephone processed speech, we propose to examine the effects of adding low-frequency and high- frequency information to the band-limited telephone speech. Four listening conditions were designed to simulate the receiving frequency characteristics of telephone handsets. Results indicated improvement in cochlear implant and bimodal listening when telephone speech was augmented with high frequency information and therefore this study provides support for design of algorithms to extend the bandwidth towards higher frequencies. The results also indicated added benefit from hearing aids for bimodal listeners in all four types of listening conditions. Speech understanding in acoustically reverberant environments is always a difficult task for hearing impaired listeners. Reverberated sounds consists of direct sound, early reflections and late reflections. Late reflections are known to be detrimental to speech intelligibility. In this study, we propose a reverberation suppression strategy based on spectral subtraction to suppress the reverberant energies from late reflections. Results from listening tests for two reverberant conditions (RT60 = 0.3s and 1.0s) indicated significant improvement when stimuli was processed with SS strategy. The proposed strategy operates with little to no prior information on the signal and the room characteristics and therefore, can potentially be implemented in real-time CI speech processors. For speech in background noise, we propose a mechanism underlying the contribution of harmonics to the benefit of electroacoustic stimulations in cochlear implants. The proposed strategy is based on harmonic modeling and uses synthesis driven approach to synthesize the harmonics in voiced segments of speech. Based on objective measures, results indicated improvement in speech quality. This study warrants further work into development of algorithms to regenerate harmonics of voiced segments in the presence of noise.
GREENE, BETH G.; LOGAN, JOHN S.; PISONI, DAVID B.
2012-01-01
We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered. PMID:23225916
Research on Speech Perception. Progress Report No. 8, January 1982-December 1982.
ERIC Educational Resources Information Center
Pisoni, David B.; And Others
Summarizing research activities from January 1982 to December 1982, this is the eighth annual report of research on speech perception, analysis and synthesis conducted in the Speech Research Laboratory of the Department of Psychology at Indiana University. The report includes extended manuscripts, short reports, progress reports, and information…
Designing a Humane Multimedia Interface for the Visually Impaired.
ERIC Educational Resources Information Center
Ghaoui, Claude; Mann, M.; Ng, Eng Huat
2001-01-01
Promotes the provision of interfaces that allow users to access most of the functionality of existing graphical user interfaces (GUI) using speech. Uses the design of a speech control tool that incorporates speech recognition and synthesis into existing packaged software such as Teletext, the Internet, or a word processor. (Contains 22…
Research on Speech Perception. Progress Report No. 9, January 1983-December 1983.
ERIC Educational Resources Information Center
Pisoni, David B.; And Others
Summarizing research activities from January 1983 to December 1983, this is the ninth annual report of research on speech perception, analysis and synthesis conducted in the Speech Research Laboratory of the Department of Psychology at Indiana University. The report includes extended manuscripts, short reports, and progress reports. The report…
Assistive Software Tools for Secondary-Level Students with Literacy Difficulties
ERIC Educational Resources Information Center
Lange, Alissa A.; McPhillips, Martin; Mulhern, Gerry; Wylie, Judith
2006-01-01
The present study assessed the compensatory effectiveness of four assistive software tools (speech synthesis, spellchecker, homophone tool, and dictionary) on literacy. Secondary-level students (N = 93) with reading difficulties completed computer-based tests of literacy skills. Training on their respective software followed for those assigned to…
NASA Astrophysics Data System (ADS)
Lightstone, P. C.; Davidson, W. M.
1982-04-01
The military detection assessment laboratory houses an experimental field system which assesses different alarm indicators such as fence disturbance sensors, MILES cables, and microwave Racons. A speech synthesis board which could be interfaced, by means of a computer, to an alarm logger making verbal acknowledgement of alarms possible was purchased. Different products and different types of voice synthesis were analyzed before a linear predictive code device produced by Telesensory Speech Systems of Palo Alto, California was chosen. This device is called the Speech 1000 Board and has a dedicated 8085 processor. A multiplexer card was designed and the Sp 1000 interfaced through the card into a TMS 990/100M Texas Instrument microcomputer. It was also necessary to design the software with the capability of recognizing and flagging an alarm on any 1 of 32 possible lines. The experimental field system was then packaged with a dc power supply, LED indicators, speakers, and switches, and deployed in the field performing reliably.
Wołk, Agnieszka; Glinkowski, Wojciech
2017-01-01
People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer. PMID:29230254
Wołk, Krzysztof; Wołk, Agnieszka; Glinkowski, Wojciech
2017-01-01
People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make people's lives happier, easier, and safer.
Development of a good-quality speech coder for transmission over noisy channels at 2.4 kb/s
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Berouti, M.; Higgins, A.; Russell, W.
1982-03-01
This report describes the development, study, and experimental results of a 2.4 kb/s speech coder called harmonic deviations (HDV) vocoder, which transmits good-quality speech over noisy channels with bit-error rates of up to 1%. The HDV coder is based on the linear predictive coding (LPC) vocoder, and it transmits additional information over and above the data transmitted by the LPC vocoder, in the form of deviations between the speech spectrum and the LPC all-pole model spectrum at a selected set of frequencies. At the receiver, the spectral deviations are used to generate the excitation signal for the all-pole synthesis filter. The report describes and compares several methods for extracting the spectral deviations from the speech signal and for encoding them. To limit the bit-rate of the HDV coder to 2.4 kb/s the report discusses several methods including orthogonal transformation and minimum-mean-square-error scalar quantization of log area ratios, two-stage vector-scalar quantization, and variable frame rate transmission. The report also presents the results of speech-quality optimization of the HDV coder at 2.4 kb/s.
ERIC Educational Resources Information Center
White, Brian
2004-01-01
This paper presents a generally applicable method for characterizing subjects' hypothesis-testing behaviour based on a synthesis that extends on previous work. Beginning with a transcript of subjects' speech and videotape of their actions, a Reasoning Map is created that depicts the flow of their hypotheses, tests, predictions, results, and…
Phrase-level speech simulation with an airway modulation model of speech production
Story, Brad H.
2012-01-01
Artificial talkers and speech synthesis systems have long been used as a means of understanding both speech production and speech perception. The development of an airway modulation model is described that simulates the time-varying changes of the glottis and vocal tract, as well as acoustic wave propagation, during speech production. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by humans and how that sound is perceived by a listener. The primary components of the model are introduced and simulation of words and phrases are demonstrated. PMID:23503742
1990-05-01
speech produced by these systems. Finally, perhaps the greatest recent impetus in advancing digital Finally, in the area of speech and speaker recognitio ...XX) Ilz and logarithmic beyond I(XX) Hz (91. ts(n) *n) n)mW0) SWS BNLP LOGO *) -KQfl1 BANoPASS FILTER LOWPASS FILTER 0 fLi fHl f 0 fLP f FIgure 2
Influences on infant speech processing: toward a new synthesis.
Werker, J F; Tees, R C
1999-01-01
To comprehend and produce language, we must be able to recognize the sound patterns of our language and the rules for how these sounds "map on" to meaning. Human infants are born with a remarkable array of perceptual sensitivities that allow them to detect the basic properties that are common to the world's languages. During the first year of life, these sensitivities undergo modification reflecting an exquisite tuning to just that phonological information that is needed to map sound to meaning in the native language. We review this transition from language-general to language-specific perceptual sensitivity that occurs during the first year of life and consider whether the changes propel the child into word learning. To account for the broad-based initial sensitivities and subsequent reorganizations, we offer an integrated transactional framework based on the notion of a specialized perceptual-motor system that has evolved to serve human speech, but which functions in concert with other developing abilities. In so doing, we highlight the links between infant speech perception, babbling, and word learning.
Military applications of automatic speech recognition and future requirements
NASA Technical Reports Server (NTRS)
Beek, Bruno; Cupples, Edward J.
1977-01-01
An updated summary of the state-of-the-art of automatic speech recognition and its relevance to military applications is provided. A number of potential systems for military applications are under development. These include: (1) digital narrowband communication systems; (2) automatic speech verification; (3) on-line cartographic processing unit; (4) word recognition for militarized tactical data system; and (5) voice recognition and synthesis for aircraft cockpit.
Voice Quality Modelling for Expressive Speech Synthesis
Socoró, Joan Claudi
2014-01-01
This paper presents the perceptual experiments that were carried out in order to validate the methodology of transforming expressive speech styles using voice quality (VoQ) parameters modelling, along with the well-known prosody (F 0, duration, and energy), from a neutral style into a number of expressive ones. The main goal was to validate the usefulness of VoQ in the enhancement of expressive synthetic speech in terms of speech quality and style identification. A harmonic plus noise model (HNM) was used to modify VoQ and prosodic parameters that were extracted from an expressive speech corpus. Perception test results indicated the improvement of obtained expressive speech styles using VoQ modelling along with prosodic characteristics. PMID:24587738
NASA Astrophysics Data System (ADS)
Seresangtakul, Pusadee; Takara, Tomio
In this paper, the distinctive tones of Thai in running speech are studied. We present rules to synthesize F0 contours of Thai tones in running speech by using the generative model of F0 contours. Along with our method, the pitch contours of Thai polysyllabic words, both disyllabic and trisyllabic words, were analyzed. The coarticulation effect of Thai tones in running speech were found. Based on the analysis of the polysyllabic words using this model, rules are derived and applied to synthesize Thai polysyllabic tone sequences. We performed listening tests to evaluate intelligibility of the rules for Thai tones generation. The average intelligibility scores became 98.8%, and 96.6% for disyllabic and trisyllabic words, respectively. From these result, the rule of the tones' generation was shown to be effective. Furthermore, we constructed the connecting rules to synthesize suprasegmental F0 contours using the trisyllable training rules' parameters. The parameters of the first, the third, and the second syllables were selected and assigned to the initial, the ending, and the remaining syllables in a sentence, respectively. Even such a simple rule, the synthesized phrases/senetences were completely identified in listening tests. The MOSs (Mean Opinion Score) was 3.50 while the original and analysis/synthesis samples were 4.82 and 3.59, respectively.
A Decade of Stage Fright Research (1960-1969): A Synthesis.
ERIC Educational Resources Information Center
Munger, Daniel I.
The synthesis of stage fright research by Clevenger (1959) has been widely accepted and used by writers in the speech field. Since 1959, additional research has appeared which warrants an updating of Clevenger's synthesis. Such a followup synthesis is the purpose of this paper. It assumes familiarity with the first article. Measuring stage fright…
Heimbauer, Lisa A; Beran, Michael J; Owren, Michael J
2011-07-26
A long-standing debate concerns whether humans are specialized for speech perception, which some researchers argue is demonstrated by the ability to understand synthetic speech with significantly reduced acoustic cues to phonetic content. We tested a chimpanzee (Pan troglodytes) that recognizes 128 spoken words, asking whether she could understand such speech. Three experiments presented 48 individual words, with the animal selecting a corresponding visuographic symbol from among four alternatives. Experiment 1 tested spectrally reduced, noise-vocoded (NV) synthesis, originally developed to simulate input received by human cochlear-implant users. Experiment 2 tested "impossibly unspeechlike" sine-wave (SW) synthesis, which reduces speech to just three moving tones. Although receiving only intermittent and noncontingent reward, the chimpanzee performed well above chance level, including when hearing synthetic versions for the first time. Recognition of SW words was least accurate but improved in experiment 3 when natural words in the same session were rewarded. The chimpanzee was more accurate with NV than SW versions, as were 32 human participants hearing these items. The chimpanzee's ability to spontaneously recognize acoustically reduced synthetic words suggests that experience rather than specialization is critical for speech-perception capabilities that some have suggested are uniquely human. Copyright © 2011 Elsevier Ltd. All rights reserved.
ERIC Educational Resources Information Center
Raskind, Marshall
1993-01-01
This article describes assistive technologies for persons with learning disabilities, including word processing, spell checking, proofreading programs, outlining/"brainstorming" programs, abbreviation expanders, speech recognition, speech synthesis/screen review, optical character recognition systems, personal data managers, free-form databases,…
Computational neuroanatomy of speech production.
Hickok, Gregory
2012-01-05
Speech production has been studied predominantly from within two traditions, psycholinguistics and motor control. These traditions have rarely interacted, and the resulting chasm between these approaches seems to reflect a level of analysis difference: whereas motor control is concerned with lower-level articulatory control, psycholinguistics focuses on higher-level linguistic processing. However, closer examination of both approaches reveals a substantial convergence of ideas. The goal of this article is to integrate psycholinguistic and motor control approaches to speech production. The result of this synthesis is a neuroanatomically grounded, hierarchical state feedback control model of speech production.
Evans, Samuel; Davis, Matthew H.
2015-01-01
How humans extract the identity of speech sounds from highly variable acoustic signals remains unclear. Here, we use searchlight representational similarity analysis (RSA) to localize and characterize neural representations of syllables at different levels of the hierarchically organized temporo-frontal pathways for speech perception. We asked participants to listen to spoken syllables that differed considerably in their surface acoustic form by changing speaker and degrading surface acoustics using noise-vocoding and sine wave synthesis while we recorded neural responses with functional magnetic resonance imaging. We found evidence for a graded hierarchy of abstraction across the brain. At the peak of the hierarchy, neural representations in somatomotor cortex encoded syllable identity but not surface acoustic form, at the base of the hierarchy, primary auditory cortex showed the reverse. In contrast, bilateral temporal cortex exhibited an intermediate response, encoding both syllable identity and the surface acoustic form of speech. Regions of somatomotor cortex associated with encoding syllable identity in perception were also engaged when producing the same syllables in a separate session. These findings are consistent with a hierarchical account of how variable acoustic signals are transformed into abstract representations of the identity of speech sounds. PMID:26157026
Speech transformation system (spectrum and/or excitation) without pitch extraction
NASA Astrophysics Data System (ADS)
Seneff, S.
1980-07-01
A speech analysis synthesis system was developed which is capable of independent manipulation of the fundamental frequency and spectral envelope of a speech waveform. The system deconvolved the original speech with the spectral envelope estimate to obtain a model for the excitation, explicit pitch extraction was not required and as a consequence, the transformed speech was more natural sounding than would be the case if the excitation were modeled as a sequence of pulses. It is shown that the system has applications in the areas of voice modifications, baseband excited vocoders, time scale modifications, and frequency compression as an aid to the partially deaf.
Stylistic gait synthesis based on hidden Markov models
NASA Astrophysics Data System (ADS)
Tilmanne, Joëlle; Moinet, Alexis; Dutoit, Thierry
2012-12-01
In this work we present an expressive gait synthesis system based on hidden Markov models (HMMs), following and modifying a procedure originally developed for speaking style adaptation, in speech synthesis. A large database of neutral motion capture walk sequences was used to train an HMM of average walk. The model was then used for automatic adaptation to a particular style of walk using only a small amount of training data from the target style. The open source toolkit that we adapted for motion modeling also enabled us to take into account the dynamics of the data and to model accurately the duration of each HMM state. We also address the assessment issue and propose a procedure for qualitative user evaluation of the synthesized sequences. Our tests show that the style of these sequences can easily be recognized and look natural to the evaluators.
ERIC Educational Resources Information Center
Jones, Joan Simon
This review and synthesis of the literature on correctional vocational education includes historical documents, recent surveys and reports, journal articles, dissertations, and speeches and presentations which were located by computer-assisted and manual searches of these data bases: Abstracts of Instructional and Research Materials in Vocational…
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aimthikul, Y.
This thesis reviews the essential aspects of speech synthesis and distinguishes between the two prevailing techniques: compressed digital speech and phonemic synthesis. It then presents the hardware details of the five speech modules evaluated. FORTRAN programs were written to facilitate message creation and retrieval with four of the modules driven by a PDP-11 minicomputer. The fifth module was driven directly by a computer terminal. The compressed digital speech modules (T.I. 990/306, T.S.I. Series 3D and N.S. Digitalker) each contain a limited vocabulary produced by the manufacturers while both the phonemic synthesizers made by Votrax permit an almost unlimited set ofmore » sounds and words. A text-to-phoneme rules program was adapted for the PDP-11 (running under the RSX-11M operating system) to drive the Votrax Speech Pac module. However, the Votrax Type'N Talk unit has its own built-in translator. Comparison of these modules revealed that the compressed digital speech modules were superior in pronouncing words on an individual basis but lacked the inflection capability that permitted the phonemic synthesizers to generate more coherent phrases. These findings were necessarily highly subjective and dependent on the specific words and phrases studied. In addition, the rapid introduction of new modules by manufacturers will necessitate new comparisons. However, the results of this research verified that all of the modules studied do possess reasonable quality of speech that is suitable for man-machine applications. Furthermore, the development tools are now in place to permit the addition of computer speech output in such applications.« less
Parameterized Facial Expression Synthesis Based on MPEG-4
NASA Astrophysics Data System (ADS)
Raouzaiou, Amaryllis; Tsapatsoulis, Nicolas; Karpouzis, Kostas; Kollias, Stefanos
2002-12-01
In the framework of MPEG-4, one can include applications where virtual agents, utilizing both textual and multisensory data, including facial expressions and nonverbal speech help systems become accustomed to the actual feelings of the user. Applications of this technology are expected in educational environments, virtual collaborative workplaces, communities, and interactive entertainment. Facial animation has gained much interest within the MPEG-4 framework; with implementation details being an open research area (Tekalp, 1999). In this paper, we describe a method for enriching human computer interaction, focusing on analysis and synthesis of primary and intermediate facial expressions (Ekman and Friesen (1978)). To achieve this goal, we utilize facial animation parameters (FAPs) to model primary expressions and describe a rule-based technique for handling intermediate ones. A relation between FAPs and the activation parameter proposed in classical psychological studies is established, leading to parameterized facial expression analysis and synthesis notions, compatible with the MPEG-4 standard.
ERIC Educational Resources Information Center
Murphy, Harry; Higgins, Eleanor
This final report describes the activities and accomplishments of a 3-year study on the compensatory effectiveness of three assistive technologies, optical character recognition, speech synthesis, and speech recognition, on postsecondary students (N=140) with learning disabilities. These technologies were investigated relative to: (1) immediate…
Style and Content in the Rhetoric of Early Afro-American Feminists.
ERIC Educational Resources Information Center
Campbell, Karlyn Kohrs
1986-01-01
Analyzes selected speeches by feminists active in the early Afro-American protest, revealing differences in their rhetoric and that of White feminists of the period. Argues that a simultaneous analysis and synthesis is necessary to understand these differences. Illustrates speeches by Sojourner Truth, Ida B. Wells, and Mary Church Terrell. (JD)
Children's perception of their synthetically corrected speech production.
Strömbergsson, Sofia; Wengelin, Asa; House, David
2014-06-01
We explore children's perception of their own speech - in its online form, in its recorded form, and in synthetically modified forms. Children with phonological disorder (PD) and children with typical speech and language development (TD) performed tasks of evaluating accuracy of the different types of speech stimuli, either immediately after having produced the utterance or after a delay. In addition, they performed a task designed to assess their ability to detect synthetic modification. Both groups showed high performance in tasks involving evaluation of other children's speech, whereas in tasks of evaluating one's own speech, the children with PD were less accurate than their TD peers. The children with PD were less sensitive to misproductions in immediate conjunction with their production of an utterance, and more accurate after a delay. Within-category modification often passed undetected, indicating a satisfactory quality of the generated speech. Potential clinical benefits of using corrective re-synthesis are discussed.
Evans, Samuel; Davis, Matthew H
2015-12-01
How humans extract the identity of speech sounds from highly variable acoustic signals remains unclear. Here, we use searchlight representational similarity analysis (RSA) to localize and characterize neural representations of syllables at different levels of the hierarchically organized temporo-frontal pathways for speech perception. We asked participants to listen to spoken syllables that differed considerably in their surface acoustic form by changing speaker and degrading surface acoustics using noise-vocoding and sine wave synthesis while we recorded neural responses with functional magnetic resonance imaging. We found evidence for a graded hierarchy of abstraction across the brain. At the peak of the hierarchy, neural representations in somatomotor cortex encoded syllable identity but not surface acoustic form, at the base of the hierarchy, primary auditory cortex showed the reverse. In contrast, bilateral temporal cortex exhibited an intermediate response, encoding both syllable identity and the surface acoustic form of speech. Regions of somatomotor cortex associated with encoding syllable identity in perception were also engaged when producing the same syllables in a separate session. These findings are consistent with a hierarchical account of how variable acoustic signals are transformed into abstract representations of the identity of speech sounds. © The Author 2015. Published by Oxford University Press.
(abstract) Synthesis of Speaker Facial Movements to Match Selected Speech Sequences
NASA Technical Reports Server (NTRS)
Scott, Kenneth C.
1994-01-01
We are developing a system for synthesizing image sequences the simulate the facial motion of a speaker. To perform this synthesis, we are pursuing two major areas of effort. We are developing the necessary computer graphics technology to synthesize a realistic image sequence of a person speaking selected speech sequences. Next, we are developing a model that expresses the relation between spoken phonemes and face/mouth shape. A subject is video taped speaking an arbitrary text that contains expression of the full list of desired database phonemes. The subject is video taped from the front speaking normally, recording both audio and video detail simultaneously. Using the audio track, we identify the specific video frames on the tape relating to each spoken phoneme. From this range we digitize the video frame which represents the extreme of mouth motion/shape. Thus, we construct a database of images of face/mouth shape related to spoken phonemes. A selected audio speech sequence is recorded which is the basis for synthesizing a matching video sequence; the speaker need not be the same as used for constructing the database. The audio sequence is analyzed to determine the spoken phoneme sequence and the relative timing of the enunciation of those phonemes. Synthesizing an image sequence corresponding to the spoken phoneme sequence is accomplished using a graphics technique known as morphing. Image sequence keyframes necessary for this processing are based on the spoken phoneme sequence and timing. We have been successful in synthesizing the facial motion of a native English speaker for a small set of arbitrary speech segments. Our future work will focus on advancement of the face shape/phoneme model and independent control of facial features.
Advances in natural language processing.
Hirschberg, Julia; Manning, Christopher D
2015-07-17
Natural language processing employs computational techniques for the purpose of learning, understanding, and producing human language content. Early computational approaches to language research focused on automating the analysis of the linguistic structure of language and developing basic technologies such as machine translation, speech recognition, and speech synthesis. Today's researchers refine and make use of such tools in real-world applications, creating spoken dialogue systems and speech-to-speech translation engines, mining social media for information about health or finance, and identifying sentiment and emotion toward products and services. We describe successes and challenges in this rapidly advancing area. Copyright © 2015, American Association for the Advancement of Science.
What does voice-processing technology support today?
Nakatsu, R; Suzuki, Y
1995-01-01
This paper describes the state of the art in applications of voice-processing technologies. In the first part, technologies concerning the implementation of speech recognition and synthesis algorithms are described. Hardware technologies such as microprocessors and DSPs (digital signal processors) are discussed. Software development environment, which is a key technology in developing applications software, ranging from DSP software to support software also is described. In the second part, the state of the art of algorithms from the standpoint of applications is discussed. Several issues concerning evaluation of speech recognition/synthesis algorithms are covered, as well as issues concerning the robustness of algorithms in adverse conditions. Images Fig. 3 PMID:7479720
NASA Astrophysics Data System (ADS)
Morishima, Shigeo; Nakamura, Satoshi
2004-12-01
We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.
A video, text, and speech-driven realistic 3-d virtual head for human-machine interface.
Yu, Jun; Wang, Zeng-Fu
2015-05-01
A multiple inputs-driven realistic facial animation system based on 3-D virtual head for human-machine interface is proposed. The system can be driven independently by video, text, and speech, thus can interact with humans through diverse interfaces. The combination of parameterized model and muscular model is used to obtain a tradeoff between computational efficiency and high realism of 3-D facial animation. The online appearance model is used to track 3-D facial motion from video in the framework of particle filtering, and multiple measurements, i.e., pixel color value of input image and Gabor wavelet coefficient of illumination ratio image, are infused to reduce the influence of lighting and person dependence for the construction of online appearance model. The tri-phone model is used to reduce the computational consumption of visual co-articulation in speech synchronized viseme synthesis without sacrificing any performance. The objective and subjective experiments show that the system is suitable for human-machine interaction.
Advancements in text-to-speech technology and implications for AAC applications
NASA Astrophysics Data System (ADS)
Syrdal, Ann K.
2003-10-01
Intelligibility was the initial focus in text-to-speech (TTS) research, since it is clearly a necessary condition for the application of the technology. Sufficiently high intelligibility (approximating human speech) has been achieved in the last decade by the better formant-based and concatenative TTS systems. This led to commercially available TTS systems for highly motivated users, particularly the blind and vocally impaired. Some unnatural qualities of TTS were exploited by these users, such as very fast speaking rates and altered pitch ranges for flagging relevant information. Recently, the focus in TTS research has turned to improving naturalness, so that synthetic speech sounds more human and less robotic. Unit selection approaches to concatenative synthesis have dramatically improved TTS quality, although at the cost of larger and more complex systems. This advancement in naturalness has made TTS technology more acceptable to the general public. The vocally impaired appreciate a more natural voice with which to represent themselves when communicating with others. Unit selection TTS does not achieve such high speaking rates as the earlier TTS systems, however, which is a disadvantage to some AAC device users. An important new research emphasis is to improve and increase the range of emotional expressiveness of TTS.
James, Deborah M
2011-08-12
The Bercow review found a high level of public dissatisfaction with speech and language services for children. Children with speech, language, and communication needs (SLCN) often have chronic complex conditions that require provision from health, education, and community services. Speech and language therapists are a small group of Allied Health Professionals with a specialist skill-set that equips them to work with children with SLCN. They work within and across the diverse range of public service providers. The aim of this review was to explore the applicability of Normalisation Process Theory (NPT) to the case of speech and language therapy. A review of qualitative research on a successfully embedded speech and language therapy intervention was undertaken to test the applicability of NPT. The review focused on two of the collective action elements of NPT (relational integration and interaction workability) using all previously published qualitative data from both parents and practitioners' perspectives on the intervention. The synthesis of the data based on the Normalisation Process Model (NPM) uncovered strengths in the interpersonal processes between the practitioners and parents, and weaknesses in how the accountability of the intervention is distributed in the health system. The analysis based on the NPM uncovered interpersonal processes between the practitioners and parents that were likely to have given rise to successful implementation of the intervention. In previous qualitative research on this intervention where the Medical Research Council's guidance on developing a design for a complex intervention had been used as a framework, the interpersonal work within the intervention had emerged as a barrier to implementation of the intervention. It is suggested that the design of services for children and families needs to extend beyond the consideration of benefits and barriers to embrace the social processes that appear to afford success in embedding innovation in healthcare.
Noise suppression methods for robust speech processing
NASA Astrophysics Data System (ADS)
Boll, S. F.; Ravindra, H.; Randall, G.; Armantrout, R.; Power, R.
1980-05-01
Robust speech processing in practical operating environments requires effective environmental and processor noise suppression. This report describes the technical findings and accomplishments during this reporting period for the research program funded to develop real time, compressed speech analysis synthesis algorithms whose performance in invariant under signal contamination. Fulfillment of this requirement is necessary to insure reliable secure compressed speech transmission within realistic military command and control environments. Overall contributions resulting from this research program include the understanding of how environmental noise degrades narrow band, coded speech, development of appropriate real time noise suppression algorithms, and development of speech parameter identification methods that consider signal contamination as a fundamental element in the estimation process. This report describes the current research and results in the areas of noise suppression using the dual input adaptive noise cancellation using the short time Fourier transform algorithms, articulation rate change techniques, and a description of an experiment which demonstrated that the spectral subtraction noise suppression algorithm can improve the intelligibility of 2400 bps, LPC 10 coded, helicopter speech by 10.6 point.
Applications of Hilbert Spectral Analysis for Speech and Sound Signals
NASA Technical Reports Server (NTRS)
Huang, Norden E.
2003-01-01
A new method for analyzing nonlinear and nonstationary data has been developed, and the natural applications are to speech and sound signals. The key part of the method is the Empirical Mode Decomposition method with which any complicated data set can be decomposed into a finite and often small number of Intrinsic Mode Functions (IMF). An IMF is defined as any function having the same numbers of zero-crossing and extrema, and also having symmetric envelopes defined by the local maxima and minima respectively. The IMF also admits well-behaved Hilbert transform. This decomposition method is adaptive, and, therefore, highly efficient. Since the decomposition is based on the local characteristic time scale of the data, it is applicable to nonlinear and nonstationary processes. With the Hilbert transform, the Intrinsic Mode Functions yield instantaneous frequencies as functions of time, which give sharp identifications of imbedded structures. This method invention can be used to process all acoustic signals. Specifically, it can process the speech signals for Speech synthesis, Speaker identification and verification, Speech recognition, and Sound signal enhancement and filtering. Additionally, as the acoustical signals from machinery are essentially the way the machines are talking to us. Therefore, the acoustical signals, from the machines, either from sound through air or vibration on the machines, can tell us the operating conditions of the machines. Thus, we can use the acoustic signal to diagnosis the problems of machines.
Open source OCR framework using mobile devices
NASA Astrophysics Data System (ADS)
Zhou, Steven Zhiying; Gilani, Syed Omer; Winkler, Stefan
2008-02-01
Mobile phones have evolved from passive one-to-one communication device to powerful handheld computing device. Today most new mobile phones are capable of capturing images, recording video, and browsing internet and do much more. Exciting new social applications are emerging on mobile landscape, like, business card readers, sing detectors and translators. These applications help people quickly gather the information in digital format and interpret them without the need of carrying laptops or tablet PCs. However with all these advancements we find very few open source software available for mobile phones. For instance currently there are many open source OCR engines for desktop platform but, to our knowledge, none are available on mobile platform. Keeping this in perspective we propose a complete text detection and recognition system with speech synthesis ability, using existing desktop technology. In this work we developed a complete OCR framework with subsystems from open source desktop community. This includes a popular open source OCR engine named Tesseract for text detection & recognition and Flite speech synthesis module, for adding text-to-speech ability.
Scaling and universality in the human voice.
Luque, Jordi; Luque, Bartolo; Lacasa, Lucas
2015-04-06
Speech is a distinctive complex feature of human capabilities. In order to understand the physics underlying speech production, in this work, we empirically analyse the statistics of large human speech datasets ranging several languages. We first show that during speech, the energy is unevenly released and power-law distributed, reporting a universal robust Gutenberg-Richter-like law in speech. We further show that such 'earthquakes in speech' show temporal correlations, as the interevent statistics are again power-law distributed. As this feature takes place in the intraphoneme range, we conjecture that the process responsible for this complex phenomenon is not cognitive, but it resides in the physiological (mechanical) mechanisms of speech production. Moreover, we show that these waiting time distributions are scale invariant under a renormalization group transformation, suggesting that the process of speech generation is indeed operating close to a critical point. These results are put in contrast with current paradigms in speech processing, which point towards low dimensional deterministic chaos as the origin of nonlinear traits in speech fluctuations. As these latter fluctuations are indeed the aspects that humanize synthetic speech, these findings may have an impact in future speech synthesis technologies. Results are robust and independent of the communication language or the number of speakers, pointing towards a universal pattern and yet another hint of complexity in human speech. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Analysis and enhancement of country singing
NASA Astrophysics Data System (ADS)
Lee, Matthew; Smith, Mark J. T.
2003-04-01
The study of human singing has focused extensively on the analysis of voice characteristics. At the same time, a substantial body of work has been under study aimed at modeling and synthesizing the human voice. The work on which we report brings together some key analysis and synthesis principles to create a new model for digitally improving the perceived quality of an average singing voice. The model presented employs an analysis-by-synthesis overlap-add (ABS-OLA) sinusoidal model, which in the past has been used for the analysis and synthesis of speech, in combination with a spectral model of the vocal tract. The ABS-OLA sinusoidal model for speech has been shown to be a flexible, accurate, and computationally efficient representation capable of producing a natural-sounding singing voice [E. B. George and M. J. T. Smith, Trans. Speech Audio Processing 5, 389-406 (1997)]. A spectral model infused in the ABS-OLA uses Generalized Gaussian functions to provide a simple framework which enables the precise modification of spectral characteristics while maintaining the quality and naturalness of the original voice. Furthermore, it is shown that the parameters of the new ABS-OLA can accommodate pitch corrections and vocal quality enhancements while preserving naturalness and singer identity. Examples of enhanced country singing will be presented.
Visual speech discrimination and identification of natural and synthetic consonant stimuli
Files, Benjamin T.; Tjan, Bosco S.; Jiang, Jintao; Bernstein, Lynne E.
2015-01-01
From phonetic features to connected discourse, every level of psycholinguistic structure including prosody can be perceived through viewing the talking face. Yet a longstanding notion in the literature is that visual speech perceptual categories comprise groups of phonemes (referred to as visemes), such as /p, b, m/ and /f, v/, whose internal structure is not informative to the visual speech perceiver. This conclusion has not to our knowledge been evaluated using a psychophysical discrimination paradigm. We hypothesized that perceivers can discriminate the phonemes within typical viseme groups, and that discrimination measured with d-prime (d’) and response latency is related to visual stimulus dissimilarities between consonant segments. In Experiment 1, participants performed speeded discrimination for pairs of consonant-vowel spoken nonsense syllables that were predicted to be same, near, or far in their perceptual distances, and that were presented as natural or synthesized video. Near pairs were within-viseme consonants. Natural within-viseme stimulus pairs were discriminated significantly above chance (except for /k/-/h/). Sensitivity (d’) increased and response times decreased with distance. Discrimination and identification were superior with natural stimuli, which comprised more phonetic information. We suggest that the notion of the viseme as a unitary perceptual category is incorrect. Experiment 2 probed the perceptual basis for visual speech discrimination by inverting the stimuli. Overall reductions in d’ with inverted stimuli but a persistent pattern of larger d’ for far than for near stimulus pairs are interpreted as evidence that visual speech is represented by both its motion and configural attributes. The methods and results of this investigation open up avenues for understanding the neural and perceptual bases for visual and audiovisual speech perception and for development of practical applications such as visual lipreading/speechreading speech synthesis. PMID:26217249
Morris, Meg E; Erickson, Shane; Serry, Tanya A
2016-01-01
Background Although mobile apps are readily available for speech sound disorders (SSD), their validity has not been systematically evaluated. This evidence-based appraisal will critically review and synthesize current evidence on available therapy apps for use by children with SSD. Objective The main aims are to (1) identify the types of apps currently available for Android and iOS mobile phones and tablets, and (2) to critique their design features and content using a structured quality appraisal tool. Methods This protocol paper presents and justifies the methods used for a systematic review of mobile apps that provide intervention for use by children with SSD. The primary outcomes of interest are (1) engagement, (2) functionality, (3) aesthetics, (4) information quality, (5) subjective quality, and (6) perceived impact. Quality will be assessed by 2 certified practicing speech-language pathologists using a structured quality appraisal tool. Two app stores will be searched from the 2 largest operating platforms, Android and iOS. Systematic methods of knowledge synthesis shall include searching the app stores using a defined procedure, data extraction, and quality analysis. Results This search strategy shall enable us to determine how many SSD apps are available for Android and for iOS compatible mobile phones and tablets. It shall also identify the regions of the world responsible for the apps’ development, the content and the quality of offerings. Recommendations will be made for speech-language pathologists seeking to use mobile apps in their clinical practice. Conclusions This protocol provides a structured process for locating apps and appraising the quality, as the basis for evaluating their use in speech pathology for children in English-speaking nations. PMID:27899341
Furlong, Lisa M; Morris, Meg E; Erickson, Shane; Serry, Tanya A
2016-11-29
Although mobile apps are readily available for speech sound disorders (SSD), their validity has not been systematically evaluated. This evidence-based appraisal will critically review and synthesize current evidence on available therapy apps for use by children with SSD. The main aims are to (1) identify the types of apps currently available for Android and iOS mobile phones and tablets, and (2) to critique their design features and content using a structured quality appraisal tool. This protocol paper presents and justifies the methods used for a systematic review of mobile apps that provide intervention for use by children with SSD. The primary outcomes of interest are (1) engagement, (2) functionality, (3) aesthetics, (4) information quality, (5) subjective quality, and (6) perceived impact. Quality will be assessed by 2 certified practicing speech-language pathologists using a structured quality appraisal tool. Two app stores will be searched from the 2 largest operating platforms, Android and iOS. Systematic methods of knowledge synthesis shall include searching the app stores using a defined procedure, data extraction, and quality analysis. This search strategy shall enable us to determine how many SSD apps are available for Android and for iOS compatible mobile phones and tablets. It shall also identify the regions of the world responsible for the apps' development, the content and the quality of offerings. Recommendations will be made for speech-language pathologists seeking to use mobile apps in their clinical practice. This protocol provides a structured process for locating apps and appraising the quality, as the basis for evaluating their use in speech pathology for children in English-speaking nations. ©Lisa M Furlong, Meg E Morris, Shane Erickson, Tanya A Serry. Originally published in JMIR Research Protocols (http://www.researchprotocols.org), 29.11.2016.
Sample-based engine noise synthesis using an enhanced pitch-synchronous overlap-and-add method.
Jagla, Jan; Maillard, Julien; Martin, Nadine
2012-11-01
An algorithm for the real time synthesis of internal combustion engine noise is presented. Through the analysis of a recorded engine noise signal of continuously varying engine speed, a dataset of sound samples is extracted allowing the real time synthesis of the noise induced by arbitrary evolutions of engine speed. The sound samples are extracted from a recording spanning the entire engine speed range. Each sample is delimitated such as to contain the sound emitted during one cycle of the engine plus the necessary overlap to ensure smooth transitions during the synthesis. The proposed approach, an extension of the PSOLA method introduced for speech processing, takes advantage of the specific periodicity of engine noise signals to locate the extraction instants of the sound samples. During the synthesis stage, the sound samples corresponding to the target engine speed evolution are concatenated with an overlap and add algorithm. It is shown that this method produces high quality audio restitution with a low computational load. It is therefore well suited for real time applications.
Research on Spoken Dialogue Systems
NASA Technical Reports Server (NTRS)
Aist, Gregory; Hieronymus, James; Dowding, John; Hockey, Beth Ann; Rayner, Manny; Chatzichrisafis, Nikos; Farrell, Kim; Renders, Jean-Michel
2010-01-01
Research in the field of spoken dialogue systems has been performed with the goal of making such systems more robust and easier to use in demanding situations. The term "spoken dialogue systems" signifies unified software systems containing speech-recognition, speech-synthesis, dialogue management, and ancillary components that enable human users to communicate, using natural spoken language or nearly natural prescribed spoken language, with other software systems that provide information and/or services.
Prediction of acoustic feature parameters using myoelectric signals.
Lee, Ki-Seung
2010-07-01
It is well-known that a clear relationship exists between human voices and myoelectric signals (MESs) from the area of the speaker's mouth. In this study, we utilized this information to implement a speech synthesis scheme in which MES alone was used to predict the parameters characterizing the vocal-tract transfer function of specific speech signals. Several feature parameters derived from MES were investigated to find the optimal feature for maximization of the mutual information between the acoustic and the MES features. After the optimal feature was determined, an estimation rule for the acoustic parameters was proposed, based on a minimum mean square error (MMSE) criterion. In a preliminary study, 60 isolated words were used for both objective and subjective evaluations. The results showed that the average Euclidean distance between the original and predicted acoustic parameters was reduced by about 30% compared with the average Euclidean distance of the original parameters. The intelligibility of the synthesized speech signals using the predicted features was also evaluated. A word-level identification ratio of 65.5% and a syllable-level identification ratio of 73% were obtained through a listening test.
Statistical Analysis of Spectral Properties and Prosodic Parameters of Emotional Speech
NASA Astrophysics Data System (ADS)
Přibil, J.; Přibilová, A.
2009-01-01
The paper addresses reflection of microintonation and spectral properties in male and female acted emotional speech. Microintonation component of speech melody is analyzed regarding its spectral and statistical parameters. According to psychological research of emotional speech, different emotions are accompanied by different spectral noise. We control its amount by spectral flatness according to which the high frequency noise is mixed in voiced frames during cepstral speech synthesis. Our experiments are aimed at statistical analysis of cepstral coefficient values and ranges of spectral flatness in three emotions (joy, sadness, anger), and a neutral state for comparison. Calculated histograms of spectral flatness distribution are visually compared and modelled by Gamma probability distribution. Histograms of cepstral coefficient distribution are evaluated and compared using skewness and kurtosis. Achieved statistical results show good correlation comparing male and female voices for all emotional states portrayed by several Czech and Slovak professional actors.
Understanding of Android-Based Robotic and Game Structure
NASA Astrophysics Data System (ADS)
Phongtraychack, A.; Syryamkin, V.
2018-05-01
The development of an android with impressive lifelike appearance and behavior has been a long-standing goal in robotics and a new and exciting approach of smartphone-based robotics for research and education. Recent years have been progressive for many technologies, which allowed creating such androids. There are different examples including the autonomous Erica android system capable of conversational interaction and speech synthesis technologies. The behavior of Android-based robot could be running on the phone as the robot performed a task outdoors. In this paper, we present an overview and understanding of the platform of Android-based robotic and game structure for research and education.
Remote voice training: A case study on space shuttle applications, appendix C
NASA Technical Reports Server (NTRS)
Mollakarimi, Cindy; Hamid, Tamin
1990-01-01
The Tile Automation System includes applications of automation and robotics technology to all aspects of the Shuttle tile processing and inspection system. An integrated set of rapid prototyping testbeds was developed which include speech recognition and synthesis, laser imaging systems, distributed Ada programming environments, distributed relational data base architectures, distributed computer network architectures, multi-media workbenches, and human factors considerations. Remote voice training in the Tile Automation System is discussed. The user is prompted over a headset by synthesized speech for the training sequences. The voice recognition units and the voice output units are remote from the user and are connected by Ethernet to the main computer system. A supervisory channel is used to monitor the training sequences. Discussions include the training approaches as well as the human factors problems and solutions for this system utilizing remote training techniques.
NWR (National Weather Service) voice synthesis project, phase 1
NASA Astrophysics Data System (ADS)
Sampson, G. W.
1986-01-01
The purpose of the NOAA Weather Radio (NWR) Voice Synthesis Project is to provide a demonstration of the current voice synthesis technology. Phase 1 of this project is presented, providing a complete automation of an hourly surface aviation observation for broadcast over NWR. In examining the products currently available on the market, the decision was made that synthetic voice technology does not have the high quality speech required for broadcast over the NWR. Therefore the system presented uses the phrase concatenation type of technology for a very high quality, versatile, voice synthesis system.
Simultaneous F 0-F 1 modifications of Arabic for the improvement of natural-sounding
NASA Astrophysics Data System (ADS)
Ykhlef, F.; Bensebti, M.
2013-03-01
Pitch (F 0) modification is one of the most important problems in the area of speech synthesis. Several techniques have been developed in the literature to achieve this goal. The main restrictions of these techniques are in the modification range and the synthesised speech quality, intelligibility and naturalness. The control of formants in a spoken language can significantly improve the naturalness of the synthesised speech. This improvement is mainly dependent on the control of the first formant (F 1). Inspired by this observation, this article proposes a new approach that modifies both F 0 and F 1 of Arabic voiced sounds in order to improve the naturalness of the pitch shifted speech. The developed strategy takes a parallel processing approach, in which the analysis segments are decomposed into sub-bands in the wavelet domain, modified in the desired sub-band by using a resampling technique and reconstructed without affecting the remained sub-bands. Pitch marking and voicing detection are performed in the frequency decomposition step based on the comparison of the multi-level approximation and detail signals. The performance of the proposed technique is evaluated by listening tests and compared to the pitch synchronous overlap and add (PSOLA) technique in the third approximation level. Experimental results have shown that the manipulation in the wavelet domain of F 0 in conjunction with F 1 guarantees natural-sounding of the synthesised speech compared to the classical pitch modification technique. This improvement was appropriate for high pitch modifications.
Seelbach, C
1995-01-01
The Colloquium on Human-Machine Communication by Voice highlighted the global technical community's focus on the problems and promise of voice-processing technology, particularly, speech recognition and speech synthesis. Clearly, there are many areas in both the research and development of these technologies that can be advanced significantly. However, it is also true that there are many applications of these technologies that are capable of commercialization now. Early successful commercialization of new technology is vital to ensure continuing interest in its development. This paper addresses efforts to commercialize speech technologies in two markets: telecommunications and aids for the handicapped. PMID:7479814
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
These proceedings of the 1981 annual meeting of the Committee on Hearing, Bioacoustics, and Biomechanics cover topics of emerging research in several areas of interest to the Committee. Topics covered include: hair cell function; transduction process of hair cells; speech synthesis; machine recognition of words; neuromagnetic analysis of sensory systems; tinnitus; tactile communication of speech; and biodynamic research at the Air Force Aerospace Medical Research Laboratory.
Structured codebook design in CELP
NASA Technical Reports Server (NTRS)
Leblanc, W. P.; Mahmoud, S. A.
1990-01-01
Codebook Excited Linear Protection (CELP) is a popular analysis by synthesis technique for quantizing speech at bit rates from 4 to 6 kbps. Codebook design techniques to date have been largely based on either random (often Gaussian) codebooks, or on known binary or ternary codes which efficiently map the space of (assumed white) excitation codevectors. It has been shown that by introducing symmetries into the codebook, good complexity reduction can be realized with only marginal decrease in performance. Codebook design algorithms are considered for a wide range of structured codebooks.
Study of acoustic correlates associate with emotional speech
NASA Astrophysics Data System (ADS)
Yildirim, Serdar; Lee, Sungbok; Lee, Chul Min; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Ebrahim; Narayanan, Shrikanth
2004-10-01
This study investigates the acoustic characteristics of four different emotions expressed in speech. The aim is to obtain detailed acoustic knowledge on how a speech signal is modulated by changes from neutral to a certain emotional state. Such knowledge is necessary for automatic emotion recognition and classification and emotional speech synthesis. Speech data obtained from two semi-professional actresses are analyzed and compared. Each subject produces 211 sentences with four different emotions; neutral, sad, angry, happy. We analyze changes in temporal and acoustic parameters such as magnitude and variability of segmental duration, fundamental frequency and the first three formant frequencies as a function of emotion. Acoustic differences among the emotions are also explored with mutual information computation, multidimensional scaling and acoustic likelihood comparison with normal speech. Results indicate that speech associated with anger and happiness is characterized by longer duration, shorter interword silence, higher pitch and rms energy with wider ranges. Sadness is distinguished from other emotions by lower rms energy and longer interword silence. Interestingly, the difference in formant pattern between [happiness/anger] and [neutral/sadness] are better reflected in back vowels such as /a/(/father/) than in front vowels. Detailed results on intra- and interspeaker variability will be reported.
Toward dynamic magnetic resonance imaging of the vocal tract during speech production.
Ventura, Sandra M Rua; Freitas, Diamantino Rui S; Tavares, João Manuel R S
2011-07-01
The most recent and significant magnetic resonance imaging (MRI) improvements allow for the visualization of the vocal tract during speech production, which has been revealed to be a powerful tool in dynamic speech research. However, a synchronization technique with enhanced temporal resolution is still required. The study design was transversal in nature. Throughout this work, a technique for the dynamic study of the vocal tract with MRI by using the heart's signal to synchronize and trigger the imaging-acquisition process is presented and described. The technique in question is then used in the measurement of four speech articulatory parameters to assess three different syllables (articulatory gestures) of European Portuguese Language. The acquired MR images are automatically reconstructed so as to result in a variable sequence of images (slices) of different vocal tract shapes in articulatory positions associated with Portuguese speech sounds. The knowledge obtained as a result of the proposed technique represents a direct contribution to the improvement of speech synthesis algorithms, thereby allowing for novel perceptions in coarticulation studies, in addition to providing further efficient clinical guidelines in the pursuit of more proficient speech rehabilitation processes. Copyright © 2011 The Voice Foundation. Published by Mosby, Inc. All rights reserved.
Intelligent acoustic data fusion technique for information security analysis
NASA Astrophysics Data System (ADS)
Jiang, Ying; Tang, Yize; Lu, Wenda; Wang, Zhongfeng; Wang, Zepeng; Zhang, Luming
2017-08-01
Tone is an essential component of word formation in all tonal languages, and it plays an important role in the transmission of information in speech communication. Therefore, tones characteristics study can be applied into security analysis of acoustic signal by the means of language identification, etc. In speech processing, fundamental frequency (F0) is often viewed as representing tones by researchers of speech synthesis. However, regular F0 values may lead to low naturalness in synthesized speech. Moreover, F0 and tone are not equivalent linguistically; F0 is just a representation of a tone. Therefore, the Electroglottography (EGG) signal is collected for deeper tones characteristics study. In this paper, focusing on the Northern Kam language, which has nine tonal contours and five level tone types, we first collected EGG and speech signals from six natural male speakers of the Northern Kam language, and then achieved the clustering distributions of the tone curves. After summarizing the main characteristics of tones of Northern Kam, we analyzed the relationship between EGG and speech signal parameters, and laid the foundation for further security analysis of acoustic signal.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bouchard, Kristofer E.; Conant, David F.; Anumanchipalli, Gopala K.
A complete neurobiological understanding of speech motor control requires determination of the relationship between simultaneously recorded neural activity and the kinematics of the lips, jaw, tongue, and larynx. Many speech articulators are internal to the vocal tract, and therefore simultaneously tracking the kinematics of all articulators is nontrivial-especially in the context of human electrophysiology recordings. Here, we describe a noninvasive, multi-modal imaging system to monitor vocal tract kinematics, demonstrate this system in six speakers during production of nine American English vowels, and provide new analysis of such data. Classification and regression analysis revealed considerable variability in the articulator-to-acoustic relationship acrossmore » speakers. Non-negative matrix factorization extracted basis sets capturing vocal tract shapes allowing for higher vowel classification accuracy than traditional methods. Statistical speech synthesis generated speech from vocal tract measurements, and we demonstrate perceptual identification. We demonstrate the capacity to predict lip kinematics from ventral sensorimotor cortical activity. These results demonstrate a multi-modal system to non-invasively monitor articulator kinematics during speech production, describe novel analytic methods for relating kinematic data to speech acoustics, and provide the first decoding of speech kinematics from electrocorticography. These advances will be critical for understanding the cortical basis of speech production and the creation of vocal prosthetics.« less
Anumanchipalli, Gopala K.; Dichter, Benjamin; Chaisanguanthum, Kris S.; Johnson, Keith; Chang, Edward F.
2016-01-01
A complete neurobiological understanding of speech motor control requires determination of the relationship between simultaneously recorded neural activity and the kinematics of the lips, jaw, tongue, and larynx. Many speech articulators are internal to the vocal tract, and therefore simultaneously tracking the kinematics of all articulators is nontrivial—especially in the context of human electrophysiology recordings. Here, we describe a noninvasive, multi-modal imaging system to monitor vocal tract kinematics, demonstrate this system in six speakers during production of nine American English vowels, and provide new analysis of such data. Classification and regression analysis revealed considerable variability in the articulator-to-acoustic relationship across speakers. Non-negative matrix factorization extracted basis sets capturing vocal tract shapes allowing for higher vowel classification accuracy than traditional methods. Statistical speech synthesis generated speech from vocal tract measurements, and we demonstrate perceptual identification. We demonstrate the capacity to predict lip kinematics from ventral sensorimotor cortical activity. These results demonstrate a multi-modal system to non-invasively monitor articulator kinematics during speech production, describe novel analytic methods for relating kinematic data to speech acoustics, and provide the first decoding of speech kinematics from electrocorticography. These advances will be critical for understanding the cortical basis of speech production and the creation of vocal prosthetics. PMID:27019106
Bouchard, Kristofer E.; Conant, David F.; Anumanchipalli, Gopala K.; ...
2016-03-28
A complete neurobiological understanding of speech motor control requires determination of the relationship between simultaneously recorded neural activity and the kinematics of the lips, jaw, tongue, and larynx. Many speech articulators are internal to the vocal tract, and therefore simultaneously tracking the kinematics of all articulators is nontrivial-especially in the context of human electrophysiology recordings. Here, we describe a noninvasive, multi-modal imaging system to monitor vocal tract kinematics, demonstrate this system in six speakers during production of nine American English vowels, and provide new analysis of such data. Classification and regression analysis revealed considerable variability in the articulator-to-acoustic relationship acrossmore » speakers. Non-negative matrix factorization extracted basis sets capturing vocal tract shapes allowing for higher vowel classification accuracy than traditional methods. Statistical speech synthesis generated speech from vocal tract measurements, and we demonstrate perceptual identification. We demonstrate the capacity to predict lip kinematics from ventral sensorimotor cortical activity. These results demonstrate a multi-modal system to non-invasively monitor articulator kinematics during speech production, describe novel analytic methods for relating kinematic data to speech acoustics, and provide the first decoding of speech kinematics from electrocorticography. These advances will be critical for understanding the cortical basis of speech production and the creation of vocal prosthetics.« less
JNDS of interaural time delay (ITD) of selected frequency bands in speech and music signals
NASA Astrophysics Data System (ADS)
Aliphas, Avner; Colburn, H. Steven; Ghitza, Oded
2002-05-01
JNDS of interaural time delay (ITD) of selected frequency bands in the presence of other frequency bands have been reported for noiseband stimuli [Zurek (1985); Trahiotis and Bernstein (1990)]. Similar measurements will be reported for speech and music signals. When stimuli are synthesized with bandpass/band-stop operations, performance with complex stimuli are similar to noisebands (JNDS in tens or hundreds of microseconds); however, the resulting waveforms, when viewed through a model of the auditory periphery, show distortions (irregularities in phase and level) at the boundaries of the target band of frequencies. An alternate synthesis method based upon group-delay filtering operations does not show these distortions and is being used for the current measurements. Preliminary measurements indicate that when music stimuli are created using the new techniques, JNDS of ITDs are increased significantly compared to previous studies, with values on the order of milliseconds.
Fluid Dynamics of Human Phonation and Speech
NASA Astrophysics Data System (ADS)
Mittal, Rajat; Erath, Byron D.; Plesniak, Michael W.
2013-01-01
This article presents a review of the fluid dynamics, flow-structure interactions, and acoustics associated with human phonation and speech. Our voice is produced through the process of phonation in the larynx, and an improved understanding of the underlying physics of this process is essential to advancing the treatment of voice disorders. Insights into the physics of phonation and speech can also contribute to improved vocal training and the development of new speech compression and synthesis schemes. This article introduces the key biomechanical features of the laryngeal physiology, reviews the basic principles of voice production, and summarizes the progress made over the past half-century in understanding the flow physics of phonation and speech. Laryngeal pathologies, which significantly enhance the complexity of phonatory dynamics, are discussed. After a thorough examination of the state of the art in computational modeling and experimental investigations of phonatory biomechanics, we present a synopsis of the pacing issues in this arena and an outlook for research in this fascinating subject.
Monson, Brian B; Lotto, Andrew J; Story, Brad H
2012-09-01
The human singing and speech spectrum includes energy above 5 kHz. To begin an in-depth exploration of this high-frequency energy (HFE), a database of anechoic high-fidelity recordings of singers and talkers was created and analyzed. Third-octave band analysis from the long-term average spectra showed that production level (soft vs normal vs loud), production mode (singing vs speech), and phoneme (for voiceless fricatives) all significantly affected HFE characteristics. Specifically, increased production level caused an increase in absolute HFE level, but a decrease in relative HFE level. Singing exhibited higher levels of HFE than speech in the soft and normal conditions, but not in the loud condition. Third-octave band levels distinguished phoneme class of voiceless fricatives. Female HFE levels were significantly greater than male levels only above 11 kHz. This information is pertinent to various areas of acoustics, including vocal tract modeling, voice synthesis, augmentative hearing technology (hearing aids and cochlear implants), and training/therapy for singing and speech.
Voice input/output capabilities at Perception Technology Corporation
NASA Technical Reports Server (NTRS)
Ferber, Leon A.
1977-01-01
Condensed resumes of key company personnel at the Perception Technology Corporation are presented. The staff possesses recognition, speech synthesis, speaker authentication, and language identification. Hardware and software engineers' capabilities are included.
Artificial intelligence, expert systems, computer vision, and natural language processing
NASA Technical Reports Server (NTRS)
Gevarter, W. B.
1984-01-01
An overview of artificial intelligence (AI), its core ingredients, and its applications is presented. The knowledge representation, logic, problem solving approaches, languages, and computers pertaining to AI are examined, and the state of the art in AI is reviewed. The use of AI in expert systems, computer vision, natural language processing, speech recognition and understanding, speech synthesis, problem solving, and planning is examined. Basic AI topics, including automation, search-oriented problem solving, knowledge representation, and computational logic, are discussed.
Kim, Yoon-Chul; Narayanan, Shrikanth S; Nayak, Krishna S
2011-05-01
In speech production research using real-time magnetic resonance imaging (MRI), the analysis of articulatory dynamics is performed retrospectively. A flexible selection of temporal resolution is highly desirable because of natural variations in speech rate and variations in the speed of different articulators. The purpose of the study is to demonstrate a first application of golden-ratio spiral temporal view order to real-time speech MRI and investigate its performance by comparison with conventional bit-reversed temporal view order. Golden-ratio view order proved to be more effective at capturing the dynamics of rapid tongue tip motion. A method for automated blockwise selection of temporal resolution is presented that enables the synthesis of a single video from multiple temporal resolution videos and potentially facilitates subsequent vocal tract shape analysis. Copyright © 2010 Wiley-Liss, Inc.
Vibrato in Singing Voice: The Link between Source-Filter and Sinusoidal Models
NASA Astrophysics Data System (ADS)
Arroabarren, Ixone; Carlosena, Alfonso
2004-12-01
The application of inverse filtering techniques for high-quality singing voice analysis/synthesis is discussed. In the context of source-filter models, inverse filtering provides a noninvasive method to extract the voice source, and thus to study voice quality. Although this approach is widely used in speech synthesis, this is not the case in singing voice. Several studies have proved that inverse filtering techniques fail in the case of singing voice, the reasons being unclear. In order to shed light on this problem, we will consider here an additional feature of singing voice, not present in speech: the vibrato. Vibrato has been traditionally studied by sinusoidal modeling. As an alternative, we will introduce here a novel noninteractive source filter model that incorporates the mechanisms of vibrato generation. This model will also allow the comparison of the results produced by inverse filtering techniques and by sinusoidal modeling, as they apply to singing voice and not to speech. In this way, the limitations of these conventional techniques, described in previous literature, will be explained. Both synthetic signals and singer recordings are used to validate and compare the techniques presented in the paper.
On the recognition of emotional vocal expressions: motivations for a holistic approach.
Esposito, Anna; Esposito, Antonietta M
2012-10-01
Human beings seem to be able to recognize emotions from speech very well and information communication technology aims to implement machines and agents that can do the same. However, to be able to automatically recognize affective states from speech signals, it is necessary to solve two main technological problems. The former concerns the identification of effective and efficient processing algorithms capable of capturing emotional acoustic features from speech sentences. The latter focuses on finding computational models able to classify, with an approximation as good as human listeners, a given set of emotional states. This paper will survey these topics and provide some insights for a holistic approach to the automatic analysis, recognition and synthesis of affective states.
Field-testing the new DECtalk PC system for medical applications
NASA Technical Reports Server (NTRS)
Grams, R. R.; Smillov, A.; Li, B.
1992-01-01
Synthesized human speech has now reached a new level of performance. With the introduction of DEC's new DECtalk PC, the small system developer will have a very powerful tool for creative design. It has been our privilege to be involved in the beta-testing of this new device and to add a medical dictionary which covers a wide range of medical terminology. With the inherent board level understanding of speech synthesis and the medical dictionary, it is now possible to provide full digital speech output for all medical files and terms. The application of these tools will cover a wide range of options for the future and allow a new dimension in dealing with the complex user interface experienced in medical practice.
Telecommunicating in Tomorrow's World.
ERIC Educational Resources Information Center
Harkins, Judy
1983-01-01
Examples of new electronic devices used by deaf persons include electronic mail capabilities, teleprompters that can caption television live, and speech synthesis equipment. Consumers can establish and use assistive device centers to become familiar with the latest in technology. (CL)
Monson, Brian B.; Lotto, Andrew J.; Story, Brad H.
2012-01-01
The human singing and speech spectrum includes energy above 5 kHz. To begin an in-depth exploration of this high-frequency energy (HFE), a database of anechoic high-fidelity recordings of singers and talkers was created and analyzed. Third-octave band analysis from the long-term average spectra showed that production level (soft vs normal vs loud), production mode (singing vs speech), and phoneme (for voiceless fricatives) all significantly affected HFE characteristics. Specifically, increased production level caused an increase in absolute HFE level, but a decrease in relative HFE level. Singing exhibited higher levels of HFE than speech in the soft and normal conditions, but not in the loud condition. Third-octave band levels distinguished phoneme class of voiceless fricatives. Female HFE levels were significantly greater than male levels only above 11 kHz. This information is pertinent to various areas of acoustics, including vocal tract modeling, voice synthesis, augmentative hearing technology (hearing aids and cochlear implants), and training/therapy for singing and speech. PMID:22978902
NASA Astrophysics Data System (ADS)
Rimland, Jeff; Ballora, Mark
2014-05-01
The field of sonification, which uses auditory presentation of data to replace or augment visualization techniques, is gaining popularity and acceptance for analysis of "big data" and for assisting analysts who are unable to utilize traditional visual approaches due to either: 1) visual overload caused by existing displays; 2) concurrent need to perform critical visually intensive tasks (e.g. operating a vehicle or performing a medical procedure); or 3) visual impairment due to either temporary environmental factors (e.g. dense smoke) or biological causes. Sonification tools typically map data values to sound attributes such as pitch, volume, and localization to enable them to be interpreted via human listening. In more complex problems, the challenge is in creating multi-dimensional sonifications that are both compelling and listenable, and that have enough discrete features that can be modulated in ways that allow meaningful discrimination by a listener. We propose a solution to this problem that incorporates Complex Event Processing (CEP) with speech synthesis. Some of the more promising sonifications to date use speech synthesis, which is an "instrument" that is amenable to extended listening, and can also provide a great deal of subtle nuance. These vocal nuances, which can represent a nearly limitless number of expressive meanings (via a combination of pitch, inflection, volume, and other acoustic factors), are the basis of our daily communications, and thus have the potential to engage the innate human understanding of these sounds. Additionally, recent advances in CEP have facilitated the extraction of multi-level hierarchies of information, which is necessary to bridge the gap between raw data and this type of vocal synthesis. We therefore propose that CEP-enabled sonifications based on the sound of human utterances could be considered the next logical step in human-centric "big data" compression and transmission.
Measuring Speech Comprehensibility in Students with Down Syndrome
Woynaroski, Tiffany; Camarata, Stephen
2016-01-01
Purpose There is an ongoing need to develop assessments of spontaneous speech that focus on whether the child's utterances are comprehensible to listeners. This study sought to identify the attributes of a stable ratings-based measure of speech comprehensibility, which enabled examining the criterion-related validity of an orthography-based measure of the comprehensibility of conversational speech in students with Down syndrome. Method Participants were 10 elementary school students with Down syndrome and 4 unfamiliar adult raters. Averaged across-observer Likert ratings of speech comprehensibility were called a ratings-based measure of speech comprehensibility. The proportion of utterance attempts fully glossed constituted an orthography-based measure of speech comprehensibility. Results Averaging across 4 raters on four 5-min segments produced a reliable (G = .83) ratings-based measure of speech comprehensibility. The ratings-based measure was strongly (r > .80) correlated with the orthography-based measure for both the same and different conversational samples. Conclusion Reliable and valid measures of speech comprehensibility are achievable with the resources available to many researchers and some clinicians. PMID:27299989
Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review
NASA Astrophysics Data System (ADS)
Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH
2017-09-01
This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.
Edison, Erica E; Brosnan, Margaret E; Aziz, Khalid; Brosnan, John T
2013-09-28
Creatine is essential for normal neural development; children with inborn errors of creatine synthesis or transport exhibit neurological symptoms such as mental retardation, speech delay and epilepsy. Creatine accretion may occur through dietary intake or de novo creatine synthesis. The objective of the present study was to determine how much creatine an infant must synthesise de novo. We have calculated how much creatine an infant needs to account for urinary creatinine excretion (creatine's breakdown product) and new muscle lay-down. To measure an infant's dietary creatine intake, we measured creatine in mother's milk and in various commercially available infant formulas. Knowing the amount of milk/formula ingested, we calculated the amount of creatine ingested. We have found that a breast-fed infant receives about 9 % of the creatine needed in the diet and that infants fed cows' milk-based formula receive up to 36 % of the creatine needed. However, infants fed a soya-based infant formula receive negligible dietary creatine and must rely solely on de novo creatine synthesis. This is the first time that it has been shown that neonatal creatine accretion is largely due to de novo synthesis and not through dietary intake of creatine. This has important implications both for infants suffering from creatine deficiency syndromes and for neonatal amino acid metabolism.
Low-dimensional recurrent neural network-based Kalman filter for speech enhancement.
Xia, Youshen; Wang, Jun
2015-07-01
This paper proposes a new recurrent neural network-based Kalman filter for speech enhancement, based on a noise-constrained least squares estimate. The parameters of speech signal modeled as autoregressive process are first estimated by using the proposed recurrent neural network and the speech signal is then recovered from Kalman filtering. The proposed recurrent neural network is globally asymptomatically stable to the noise-constrained estimate. Because the noise-constrained estimate has a robust performance against non-Gaussian noise, the proposed recurrent neural network-based speech enhancement algorithm can minimize the estimation error of Kalman filter parameters in non-Gaussian noise. Furthermore, having a low-dimensional model feature, the proposed neural network-based speech enhancement algorithm has a much faster speed than two existing recurrent neural networks-based speech enhancement algorithms. Simulation results show that the proposed recurrent neural network-based speech enhancement algorithm can produce a good performance with fast computation and noise reduction. Copyright © 2015 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Athaudage, Chandranath R. N.; Bradley, Alan B.; Lech, Margaret
2003-12-01
A dynamic programming-based optimization strategy for a temporal decomposition (TD) model of speech and its application to low-rate speech coding in storage and broadcasting is presented. In previous work with the spectral stability-based event localizing (SBEL) TD algorithm, the event localization was performed based on a spectral stability criterion. Although this approach gave reasonably good results, there was no assurance on the optimality of the event locations. In the present work, we have optimized the event localizing task using a dynamic programming-based optimization strategy. Simulation results show that an improved TD model accuracy can be achieved. A methodology of incorporating the optimized TD algorithm within the standard MELP speech coder for the efficient compression of speech spectral information is also presented. The performance evaluation results revealed that the proposed speech coding scheme achieves 50%-60% compression of speech spectral information with negligible degradation in the decoded speech quality.
Speech perception at the interface of neurobiology and linguistics.
Poeppel, David; Idsardi, William J; van Wassenhove, Virginie
2008-03-12
Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the speech perception enter into subsequent linguistic computation, the format that is used for lexical representation and processing fundamentally constrains the speech perceptual processes. Consequently, theories of speech perception must, at some level, be tightly linked to theories of lexical representation. Minimally, speech perception must yield representations that smoothly and rapidly interface with stored lexical items. Adopting the perspective of Marr, we argue and provide neurobiological and psychophysical evidence for the following research programme. First, at the implementational level, speech perception is a multi-time resolution process, with perceptual analyses occurring concurrently on at least two time scales (approx. 20-80 ms, approx. 150-300 ms), commensurate with (sub)segmental and syllabic analyses, respectively. Second, at the algorithmic level, we suggest that perception proceeds on the basis of internal forward models, or uses an 'analysis-by-synthesis' approach. Third, at the computational level (in the sense of Marr), the theory of lexical representation that we adopt is principally informed by phonological research and assumes that words are represented in the mental lexicon in terms of sequences of discrete segments composed of distinctive features. One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives (e.g. temporal primitives) and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.
Speech perception in noise with a harmonic complex excited vocoder.
Churchill, Tyler H; Kan, Alan; Goupell, Matthew J; Ihlefeld, Antje; Litovsky, Ruth Y
2014-04-01
A cochlear implant (CI) presents band-pass-filtered acoustic envelope information by modulating current pulse train levels. Similarly, a vocoder presents envelope information by modulating an acoustic carrier. By studying how normal hearing (NH) listeners are able to understand degraded speech signals with a vocoder, the parameters that best simulate electric hearing and factors that might contribute to the NH-CI performance difference may be better understood. A vocoder with harmonic complex carriers (fundamental frequency, f0 = 100 Hz) was used to study the effect of carrier phase dispersion on speech envelopes and intelligibility. The starting phases of the harmonic components were randomly dispersed to varying degrees prior to carrier filtering and modulation. NH listeners were tested on recognition of a closed set of vocoded words in background noise. Two sets of synthesis filters simulated different amounts of current spread in CIs. Results showed that the speech vocoded with carriers whose starting phases were maximally dispersed was the most intelligible. Superior speech understanding may have been a result of the flattening of the dispersed-phase carrier's intrinsic temporal envelopes produced by the large number of interacting components in the high-frequency channels. Cross-correlogram analyses of auditory nerve model simulations confirmed that randomly dispersing the carrier's component starting phases resulted in better neural envelope representation. However, neural metrics extracted from these analyses were not found to accurately predict speech recognition scores for all vocoded speech conditions. It is possible that central speech understanding mechanisms are insensitive to the envelope-fine structure dichotomy exploited by vocoders.
NASA Astrophysics Data System (ADS)
Kayasith, Prakasith; Theeramunkong, Thanaruk
It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.
Rumbach, Anna F; Rose, Tanya A; Cheah, Mynn
2018-01-29
To explore Australian speech-language pathologists' use of non-speech oral motor exercises, and rationales for using/not using non-speech oral motor exercises in clinical practice. A total of 124 speech-language pathologists practising in Australia, working with paediatric and/or adult clients with speech sound difficulties, completed an online survey. The majority of speech-language pathologists reported that they did not use non-speech oral motor exercises when working with paediatric or adult clients with speech sound difficulties. However, more than half of the speech-language pathologists working with adult clients who have dysarthria reported using non-speech oral motor exercises with this population. The most frequently reported rationale for using non-speech oral motor exercises in speech sound difficulty management was to improve awareness/placement of articulators. The majority of speech-language pathologists agreed there is no clear clinical or research evidence base to support non-speech oral motor exercise use with clients who have speech sound difficulties. This study provides an overview of Australian speech-language pathologists' reported use and perceptions of non-speech oral motor exercises' applicability and efficacy in treating paediatric and adult clients who have speech sound difficulties. The research findings provide speech-language pathologists with insight into how and why non-speech oral motor exercises are currently used, and adds to the knowledge base regarding Australian speech-language pathology practice of non-speech oral motor exercises in the treatment of speech sound difficulties. Implications for Rehabilitation Non-speech oral motor exercises refer to oral motor activities which do not involve speech, but involve the manipulation or stimulation of oral structures including the lips, tongue, jaw, and soft palate. Non-speech oral motor exercises are intended to improve the function (e.g., movement, strength) of oral structures. The majority of speech-language pathologists agreed there is no clear clinical or research evidence base to support non-speech oral motor exercise use with clients who have speech sound disorders. Non-speech oral motor exercise use was most frequently reported in the treatment of dysarthria. Non-speech oral motor exercise use when targeting speech sound disorders is not widely endorsed in the literature.
[Non-speech oral motor treatment efficacy for children with developmental speech sound disorders].
Ygual-Fernandez, A; Cervera-Merida, J F
2016-01-01
In the treatment of speech disorders by means of speech therapy two antagonistic methodological approaches are applied: non-verbal ones, based on oral motor exercises (OME), and verbal ones, which are based on speech processing tasks with syllables, phonemes and words. In Spain, OME programmes are called 'programas de praxias', and are widely used and valued by speech therapists. To review the studies conducted on the effectiveness of OME-based treatments applied to children with speech disorders and the theoretical arguments that could justify, or not, their usefulness. Over the last few decades evidence has been gathered about the lack of efficacy of this approach to treat developmental speech disorders and pronunciation problems in populations without any neurological alteration of motor functioning. The American Speech-Language-Hearing Association has advised against its use taking into account the principles of evidence-based practice. The knowledge gathered to date on motor control shows that the pattern of mobility and its corresponding organisation in the brain are different in speech and other non-verbal functions linked to nutrition and breathing. Neither the studies on their effectiveness nor the arguments based on motor control studies recommend the use of OME-based programmes for the treatment of pronunciation problems in children with developmental language disorders.
Lee, J D; Caven, B; Haake, S; Brown, T L
2001-01-01
As computer applications for cars emerge, a speech-based interface offers an appealing alternative to the visually demanding direct manipulation interface. However, speech-based systems may pose cognitive demands that could undermine driving safety. This study used a car-following task to evaluate how a speech-based e-mail system affects drivers' response to the periodic braking of a lead vehicle. The study included 24 drivers between the ages of 18 and 24 years. A baseline condition with no e-mail system was compared with a simple and a complex e-mail system in both simple and complex driving environments. The results show a 30% (310 ms) increase in reaction time when the speech-based system is used. Subjective workload ratings and probe questions also indicate that speech-based interaction introduces a significant cognitive load, which was highest for the complex e-mail system. These data show that a speech-based interface is not a panacea that eliminates the potential distraction of in-vehicle computers. Actual or potential applications of this research include design of in-vehicle information systems and evaluation of their contributions to driver distraction.
Hologlyphics: volumetric image synthesis performance system
NASA Astrophysics Data System (ADS)
Funk, Walter
2008-02-01
This paper describes a novel volumetric image synthesis system and artistic technique, which generate moving volumetric images in real-time, integrated with music. The system, called the Hologlyphic Funkalizer, is performance based, wherein the images and sound are controlled by a live performer, for the purposes of entertaining a live audience and creating a performance art form unique to volumetric and autostereoscopic images. While currently configured for a specific parallax barrier display, the Hologlyphic Funkalizer's architecture is completely adaptable to various volumetric and autostereoscopic display technologies. Sound is distributed through a multi-channel audio system; currently a quadraphonic speaker setup is implemented. The system controls volumetric image synthesis, production of music and spatial sound via acoustic analysis and human gestural control, using a dedicated control panel, motion sensors, and multiple musical keyboards. Music can be produced by external acoustic instruments, pre-recorded sounds or custom audio synthesis integrated with the volumetric image synthesis. Aspects of the sound can control the evolution of images and visa versa. Sounds can be associated and interact with images, for example voice synthesis can be combined with an animated volumetric mouth, where nuances of generated speech modulate the mouth's expressiveness. Different images can be sent to up to 4 separate displays. The system applies many novel volumetric special effects, and extends several film and video special effects into the volumetric realm. Extensive and various content has been developed and shown to live audiences by a live performer. Real world applications will be explored, with feedback on the human factors.
Applications of Microcomputers in the Education of the Physically Disabled Child.
ERIC Educational Resources Information Center
Foulds, Richard A.
1982-01-01
Microcomputers can serve as expressive communication tools for severely physically disabled persons. Features such as single input devices, direct selection aids, and speech synthesis capabilities can be extremely useful. The trend toward portable battery-operated computers will make the technology even more accessible. (CL)
Konnen Computer das Sprachproblem losen (Can Computers Solve the Language Problem)?
ERIC Educational Resources Information Center
Zeilinger, Michael
1972-01-01
Various computer applications in linguistics, primarily speech synthesis and machine translation, are reviewed. Although the computer proves useful for statistics, dictionary building and programmed instruction, the promulgation of a world auxiliary language is considered a more human and practical solution to the international communication…
ERIC Educational Resources Information Center
Edwards, Paul
The paper examines the applications of microcomputers to recreation programing for blind persons. The accessibility of microcomputers to this population is discussed, and the advantages as well as disadvantages of speech synthesis equipment are noted. Information is presented on the modification of hardware for Radio Shack and Apple computers.…
Evidence-Based Systematic Review: Effects of Nonspeech Oral Motor Exercises on Speech
ERIC Educational Resources Information Center
McCauley, Rebecca J.; Strand, Edythe; Lof, Gregory L.; Schooling, Tracy; Frymark, Tobi
2009-01-01
Purpose: The purpose of this systematic review was to examine the current evidence for the use of oral motor exercises (OMEs) on speech (i.e., speech physiology, speech production, and functional speech outcomes) as a means of supporting further research and clinicians' use of evidence-based practice. Method: The peer-reviewed literature from 1960…
A Networking of Community-Based Speech Therapy: Borabue District, Maha Sarakham.
Pumnum, Tawitree; Kum-ud, Weawta; Prathanee, Benjamas
2015-08-01
Most children with cleft lip and palate have articulation problems because of compensatory articulation disorders from velopharyngeal insufficiency. Theoretically, children should receive speech therapy from a speech and language pathologist (SLP) 1-2 sessions per week. For developing countries, particularly Thailand, most of them cannot reach standard speech services because of limitation of speech services and SLP Networking of a Community-Based Speech Model might be an appropriate way to solve this problem. To study the effectiveness of a networking of Khon Kaen University (KKU) Community-Based Speech Model, Non Thong Tambon Health Promotion Hospital, Borabue, Maha Sarakham, in decreasing the number of articulation errors for children with CLP. Six children with cleft lip and palate (CLP) who lived in Borabue and the surrounding district, Maha Sarakham, and had medical records in Srinagarind Hospital. They were assessed for pre- and post-articulation errors and provided speech therapy by SLP via teaching on service for speech assistant (SA). Then, children with CLP received speech correction (SC) by SA based on assignment and caregivers practiced home program for a year. Networking of Non Thong Tambon Health Promotion Hospital, Borabue, Maha Sarakham significantly reduce the number of post-articulation errors for 3 children with CLP. There were factors affecting the results in treatment of other children as follows: delayed speech and language development, hypernaslaity, and consistency of SC at local hospital and home. A networking of KKU Community-Based Speech Model, Non Thong Tambon Health Promotion Hospital, Borabue, and Maha Sarakham was a good way to enhance speech therapy in Thailand or other developing countries, where have limitation of speech services or lack of professionals.
Vector adaptive predictive coder for speech and audio
NASA Technical Reports Server (NTRS)
Chen, Juin-Hwey (Inventor); Gersho, Allen (Inventor)
1990-01-01
A real-time vector adaptive predictive coder which approximates each vector of K speech samples by using each of M fixed vectors in a first codebook to excite a time-varying synthesis filter and picking the vector that minimizes distortion. Predictive analysis for each frame determines parameters used for computing from vectors in the first codebook zero-state response vectors that are stored at the same address (index) in a second codebook. Encoding of input speech vectors s.sub.n is then carried out using the second codebook. When the vector that minimizes distortion is found, its index is transmitted to a decoder which has a codebook identical to the first codebook of the decoder. There the index is used to read out a vector that is used to synthesize an output speech vector s.sub.n. The parameters used in the encoder are quantized, for example by using a table, and the indices are transmitted to the decoder where they are decoded to specify transfer characteristics of filters used in producing the vector s.sub.n from the receiver codebook vector selected by the vector index transmitted.
ERIC Educational Resources Information Center
Oliveira, Carla; Lousada, Marisa; Jesus, Luis M. T.
2015-01-01
Children with speech sound disorders (SSD) represent a large number of speech and language therapists' caseloads. The intervention with children who have SSD can involve different therapy approaches, and these may be articulatory or phonologically based. Some international studies reveal a widespread application of articulatory based approaches in…
Gauvin, Hanna S; De Baene, Wouter; Brass, Marcel; Hartsuiker, Robert J
2016-02-01
To minimize the number of errors in speech, and thereby facilitate communication, speech is monitored before articulation. It is, however, unclear at which level during speech production monitoring takes place, and what mechanisms are used to detect and correct errors. The present study investigated whether internal verbal monitoring takes place through the speech perception system, as proposed by perception-based theories of speech monitoring, or whether mechanisms independent of perception are applied, as proposed by production-based theories of speech monitoring. With the use of fMRI during a tongue twister task we observed that error detection in internal speech during noise-masked overt speech production and error detection in speech perception both recruit the same neural network, which includes pre-supplementary motor area (pre-SMA), dorsal anterior cingulate cortex (dACC), anterior insula (AI), and inferior frontal gyrus (IFG). Although production and perception recruit similar areas, as proposed by perception-based accounts, we did not find activation in superior temporal areas (which are typically associated with speech perception) during internal speech monitoring in speech production as hypothesized by these accounts. On the contrary, results are highly compatible with a domain general approach to speech monitoring, by which internal speech monitoring takes place through detection of conflict between response options, which is subsequently resolved by a domain general executive center (e.g., the ACC). Copyright © 2015 Elsevier Inc. All rights reserved.
Lai, Ying-Hui; Chen, Fei; Wang, Syu-Siang; Lu, Xugang; Tsao, Yu; Lee, Chin-Hui
2017-07-01
In a cochlear implant (CI) speech processor, noise reduction (NR) is a critical component for enabling CI users to attain improved speech perception under noisy conditions. Identifying an effective NR approach has long been a key topic in CI research. Recently, a deep denoising autoencoder (DDAE) based NR approach was proposed and shown to be effective in restoring clean speech from noisy observations. It was also shown that DDAE could provide better performance than several existing NR methods in standardized objective evaluations. Following this success with normal speech, this paper further investigated the performance of DDAE-based NR to improve the intelligibility of envelope-based vocoded speech, which simulates speech signal processing in existing CI devices. We compared the performance of speech intelligibility between DDAE-based NR and conventional single-microphone NR approaches using the noise vocoder simulation. The results of both objective evaluations and listening test showed that, under the conditions of nonstationary noise distortion, DDAE-based NR yielded higher intelligibility scores than conventional NR approaches. This study confirmed that DDAE-based NR could potentially be integrated into a CI processor to provide more benefits to CI users under noisy conditions.
ERIC Educational Resources Information Center
Adank, Patti
2012-01-01
The role of speech production mechanisms in difficult speech comprehension is the subject of on-going debate in speech science. Two Activation Likelihood Estimation (ALE) analyses were conducted on neuroimaging studies investigating difficult speech comprehension or speech production. Meta-analysis 1 included 10 studies contrasting comprehension…
Nawaz, Tabassam; Mehmood, Zahid; Rashid, Muhammad; Habib, Hafiz Adnan
2018-01-01
Recent research on speech segregation and music fingerprinting has led to improvements in speech segregation and music identification algorithms. Speech and music segregation generally involves the identification of music followed by speech segregation. However, music segregation becomes a challenging task in the presence of noise. This paper proposes a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. A recurrent neural network (RNN)-based hidden layer segregation model is applied to remove stationary noise. Dictionary-based fisher algorithms are employed for speech classification. The proposed method is tested on three datasets (TIMIT, MIR-1K, and MusicBrainz), and the results indicate the robustness of proposed method for speech segregation. The qualitative and quantitative analysis carried out on three datasets demonstrate the efficiency of the proposed method compared to the state-of-the-art speech segregation and classification-based methods. PMID:29558485
The Speech multi features fusion perceptual hash algorithm based on tensor decomposition
NASA Astrophysics Data System (ADS)
Huang, Y. B.; Fan, M. H.; Zhang, Q. Y.
2018-03-01
With constant progress in modern speech communication technologies, the speech data is prone to be attacked by the noise or maliciously tampered. In order to make the speech perception hash algorithm has strong robustness and high efficiency, this paper put forward a speech perception hash algorithm based on the tensor decomposition and multi features is proposed. This algorithm analyses the speech perception feature acquires each speech component wavelet packet decomposition. LPCC, LSP and ISP feature of each speech component are extracted to constitute the speech feature tensor. Speech authentication is done by generating the hash values through feature matrix quantification which use mid-value. Experimental results showing that the proposed algorithm is robust for content to maintain operations compared with similar algorithms. It is able to resist the attack of the common background noise. Also, the algorithm is highly efficiency in terms of arithmetic, and is able to meet the real-time requirements of speech communication and complete the speech authentication quickly.
Hansen, J H; Nandkumar, S
1995-01-01
The formulation of reliable signal processing algorithms for speech coding and synthesis require the selection of a prior criterion of performance. Though coding efficiency (bits/second) or computational requirements can be used, a final performance measure must always include speech quality. In this paper, three objective speech quality measures are considered with respect to quality assessment for American English, noisy American English, and noise-free versions of seven languages. The purpose is to determine whether objective quality measures can be used to quantify changes in quality for a given voice coding method, with a known subjective performance level, as background noise or language conditions are changed. The speech coding algorithm chosen is regular-pulse excitation with long-term prediction (RPE-LTP), which has been chosen as the standard voice compression algorithm for the European Digital Mobile Radio system. Three areas are considered for objective quality assessment which include: (i) vocoder performance for American English in a noise-free environment, (ii) speech quality variation for three additive background noise sources, and (iii) noise-free performance for seven languages which include English, Japanese, Finnish, German, Hindi, Spanish, and French. It is suggested that although existing objective quality measures will never replace subjective testing, they can be a useful means of assessing changes in performance, identifying areas for improvement in algorithm design, and augmenting subjective quality tests for voice coding/compression algorithms in noise-free, noisy, and/or non-English applications.
Texting while driving: is speech-based text entry less risky than handheld text entry?
He, J; Chaparro, A; Nguyen, B; Burge, R J; Crandall, J; Chaparro, B; Ni, R; Cao, S
2014-11-01
Research indicates that using a cell phone to talk or text while maneuvering a vehicle impairs driving performance. However, few published studies directly compare the distracting effects of texting using a hands-free (i.e., speech-based interface) versus handheld cell phone, which is an important issue for legislation, automotive interface design and driving safety training. This study compared the effect of speech-based versus handheld text entries on simulated driving performance by asking participants to perform a car following task while controlling the duration of a secondary text-entry task. Results showed that both speech-based and handheld text entries impaired driving performance relative to the drive-only condition by causing more variation in speed and lane position. Handheld text entry also increased the brake response time and increased variation in headway distance. Text entry using a speech-based cell phone was less detrimental to driving performance than handheld text entry. Nevertheless, the speech-based text entry task still significantly impaired driving compared to the drive-only condition. These results suggest that speech-based text entry disrupts driving, but reduces the level of performance interference compared to text entry with a handheld device. In addition, the difference in the distraction effect caused by speech-based and handheld text entry is not simply due to the difference in task duration. Copyright © 2014 Elsevier Ltd. All rights reserved.
Automated Discovery of Speech Act Categories in Educational Games
ERIC Educational Resources Information Center
Rus, Vasile; Moldovan, Cristian; Niraula, Nobal; Graesser, Arthur C.
2012-01-01
In this paper we address the important task of automated discovery of speech act categories in dialogue-based, multi-party educational games. Speech acts are important in dialogue-based educational systems because they help infer the student speaker's intentions (the task of speech act classification) which in turn is crucial to providing adequate…
The Suitability of Cloud-Based Speech Recognition Engines for Language Learning
ERIC Educational Resources Information Center
Daniels, Paul; Iwago, Koji
2017-01-01
As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…
Speech Correction for Children with Cleft Lip and Palate by Networking of Community-Based Care.
Hanchanlert, Yotsak; Pramakhatay, Worawat; Pradubwong, Suteera; Prathanee, Benjamas
2015-08-01
Prevalence of cleft lip and palate (CLP) is high in Northeast Thailand. Most children with CLP face many problems, particularly compensatory articulation disorders (CAD) beyond surgery while speech services and the number of speech and language pathologists (SLPs) are limited. To determine the effectiveness of networking of Khon Kaen University (KKU) Community-Based Speech Therapy Model: Kosumphisai Hospital, Kosumphisai District and Maha Sarakham Hospital, Mueang District, Maha Sarakham Province for reduction of the number of articulations errors for children with CLP. Eleven children with CLP were recruited in 3 1-year projects of KKU Community-Based Speech Therapy Model. Articulation tests were formally assessed by qualified language pathologists (SLPs) for baseline and post treatment outcomes. Teachings on services for speech assistants (SAs) were conducted by SLPs. Assigned speech correction (SC) was performed by SAs at home and at local hospitals. Caregivers also gave SC at home 3-4 days a week. Networking of Community-Based Speech Therapy Model signficantly reduced the number of articulation errors for children with CLP in both word and sentence levels (mean difference = 6.91, 95% confidence interval = 4.15-9.67; mean difference = 5.36, 95% confidence interval = 2.99-7.73, respectively). Networking by Kosumphisai and Maha Sarakham of KKU Community-Based Speech Therapy Model was a valid and efficient method for providing speech services for children with cleft palate and could be extended to any area in Thailand and other developing countries, where have similar contexts.
The Digichaint Interactive Game as a Virtual Learning Environment for Irish
ERIC Educational Resources Information Center
Ní Chiaráin, Neasa; Ní Chasaide, Ailbhe
2016-01-01
Although Text-To-Speech (TTS) synthesis has been little used in Computer-Assisted Language Learning (CALL), it is ripe for deployment, particularly for minority and endangered languages, where learners have little access to native speaker models and where few genuinely interactive and engaging teaching/learning materials are available. These…
NASA Astrophysics Data System (ADS)
Dat, Tran Huy; Takeda, Kazuya; Itakura, Fumitada
We present a multichannel speech enhancement method based on MAP speech spectral magnitude estimation using a generalized gamma model of speech prior distribution, where the model parameters are adapted from actual noisy speech in a frame-by-frame manner. The utilization of a more general prior distribution with its online adaptive estimation is shown to be effective for speech spectral estimation in noisy environments. Furthermore, the multi-channel information in terms of cross-channel statistics are shown to be useful to better adapt the prior distribution parameters to the actual observation, resulting in better performance of speech enhancement algorithm. We tested the proposed algorithm in an in-car speech database and obtained significant improvements of the speech recognition performance, particularly under non-stationary noise conditions such as music, air-conditioner and open window.
Evaluation of the importance of time-frequency contributions to speech intelligibility in noise
Yu, Chengzhu; Wójcicki, Kamil K.; Loizou, Philipos C.; Hansen, John H. L.; Johnson, Michael T.
2014-01-01
Recent studies on binary masking techniques make the assumption that each time-frequency (T-F) unit contributes an equal amount to the overall intelligibility of speech. The present study demonstrated that the importance of each T-F unit to speech intelligibility varies in accordance with speech content. Specifically, T-F units are categorized into two classes, speech-present T-F units and speech-absent T-F units. Results indicate that the importance of each speech-present T-F unit to speech intelligibility is highly related to the loudness of its target component, while the importance of each speech-absent T-F unit varies according to the loudness of its masker component. Two types of mask errors are also considered, which include miss and false alarm errors. Consistent with previous work, false alarm errors are shown to be more harmful to speech intelligibility than miss errors when the mixture signal-to-noise ratio (SNR) is below 0 dB. However, the relative importance between the two types of error is conditioned on the SNR level of the input speech signal. Based on these observations, a mask-based objective measure, the loudness weighted hit-false, is proposed for predicting speech intelligibility. The proposed objective measure shows significantly higher correlation with intelligibility compared to two existing mask-based objective measures. PMID:24815280
Women's Speech/Men's Speech: Does Forensic Training Make a Difference?
ERIC Educational Resources Information Center
Larson, Suzanne; Vreeland, Amy L.
A study of cross examination speeches of males and females was conducted to determine gender differences in intercollegiate debate. The theory base for gender differences in speech is closely tied to the analysis of dyadic conversation. It is based on the belief that women are less forceful and dominant in cross examination, and will exhibit…
Application of artifical intelligence principles to the analysis of "crazy" speech.
Garfield, D A; Rapp, C
1994-04-01
Artificial intelligence computer simulation methods can be used to investigate psychotic or "crazy" speech. Here, symbolic reasoning algorithms establish semantic networks that schematize speech. These semantic networks consist of two main structures: case frames and object taxonomies. Node-based reasoning rules apply to object taxonomies and pathway-based reasoning rules apply to case frames. Normal listeners may recognize speech as "crazy talk" based on violations of node- and pathway-based reasoning rules. In this article, three separate segments of schizophrenic speech illustrate violations of these rules. This artificial intelligence approach is compared and contrasted with other neurolinguistic approaches and is discussed as a conceptual link between neurobiological and psychodynamic understandings of psychopathology.
Cheung, Gladys; Trembath, David; Arciuli, Joanne; Togher, Leanne
2013-08-01
Although researchers have examined barriers to implementing evidence-based practice (EBP) at the level of the individual, little is known about the effects workplaces have on speech-language pathologists' implementation of EBP. The aim of this study was to examine the impact of workplace factors on the use of EBP amongst speech-language pathologists who work with children with Autism Spectrum Disorder (ASD). This study sought to (a) explore views about EBP amongst speech-language pathologists who work with children with ASD, (b) identify workplace factors which, in the participants' opinions, acted as barriers or enablers to their provision of evidence-based speech-language pathology services, and (c) examine whether or not speech-language pathologists' responses to workplace factors differed based on the type of workplace or their years of experience. A total of 105 speech-language pathologists from across Australia completed an anonymous online questionnaire. The results indicate that, although the majority of speech-language pathologists agreed that EBP is necessary, they experienced barriers to their implementation of EBP including workplace culture and support, lack of time, cost of EBP, and the availability and accessibility of EBP resources. The barriers reported by speech-language pathologists were similar, regardless of their workplace (private practice vs organization) and years of experience.
Long short-term memory for speaker generalization in supervised speech separation
Chen, Jitong; Wang, DeLiang
2017-01-01
Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker- and noise-independent speech separation. PMID:28679261
Speech enhancement based on modified phase-opponency detectors
NASA Astrophysics Data System (ADS)
Deshmukh, Om D.; Espy-Wilson, Carol Y.
2005-09-01
A speech enhancement algorithm based on a neural model was presented by Deshmukh et al., [149th meeting of the Acoustical Society America, 2005]. The algorithm consists of a bank of Modified Phase Opponency (MPO) filter pairs tuned to different center frequencies. This algorithm is able to enhance salient spectral features in speech signals even at low signal-to-noise ratios. However, the algorithm introduces musical noise and sometimes misses a spectral peak that is close in frequency to a stronger spectral peak. Refinement in the design of the MPO filters was recently made that takes advantage of the falling spectrum of the speech signal in sonorant regions. The modified set of filters leads to better separation of the noise and speech signals, and more accurate enhancement of spectral peaks. The improvements also lead to a significant reduction in musical noise. Continuity algorithms based on the properties of speech signals are used to further reduce the musical noise effect. The efficiency of the proposed method in enhancing the speech signal when the level of the background noise is fluctuating will be demonstrated. The performance of the improved speech enhancement method will be compared with various spectral subtraction-based methods. [Work supported by NSF BCS0236707.
Development of a speech autocuer
NASA Astrophysics Data System (ADS)
Bedles, R. L.; Kizakvich, P. N.; Lawson, D. T.; McCartney, M. L.
1980-12-01
A wearable, visually based prosthesis for the deaf based upon the proven method for removing lipreading ambiguity known as cued speech was fabricated and tested. Both software and hardware developments are described, including a microcomputer, display, and speech preprocessor.
Development of a speech autocuer
NASA Technical Reports Server (NTRS)
Bedles, R. L.; Kizakvich, P. N.; Lawson, D. T.; Mccartney, M. L.
1980-01-01
A wearable, visually based prosthesis for the deaf based upon the proven method for removing lipreading ambiguity known as cued speech was fabricated and tested. Both software and hardware developments are described, including a microcomputer, display, and speech preprocessor.
Job Stress of School-Based Speech-Language Pathologists
ERIC Educational Resources Information Center
Harris, Stephanie Ferney; Prater, Mary Anne; Dyches, Tina Taylor; Heath, Melissa Allen
2009-01-01
Stress and burnout contribute significantly to the shortages of school-based speech-language pathologists (SLPs). At the request of the Utah State Office of Education, the researchers measured the stress levels of 97 school-based SLPs using the "Speech-Language Pathologist Stress Inventory." Results indicated that participants' emotional-fatigue…
Shin, Young Hoon; Seo, Jiwon
2016-01-01
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867
Shin, Young Hoon; Seo, Jiwon
2016-10-29
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.
Significance of parametric spectral ratio methods in detection and recognition of whispered speech
NASA Astrophysics Data System (ADS)
Mathur, Arpit; Reddy, Shankar M.; Hegde, Rajesh M.
2012-12-01
In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.
Delphi, Maryam; Lotfi, M-Yones; Moossavi, Abdollah; Bakhshi, Enayatollah; Banimostafa, Maryam
2017-09-01
Previous studies have shown that interaural-time-difference (ITD) training can improve localization ability. Surprisingly little is, however, known about localization training vis-à-vis speech perception in noise based on interaural time difference in the envelope (ITD ENV). We sought to investigate the reliability of an ITD ENV-based training program in speech-in-noise perception among elderly individuals with normal hearing and speech-in-noise disorder. The present interventional study was performed during 2016. Sixteen elderly men between 55 and 65 years of age with the clinical diagnosis of normal hearing up to 2000 Hz and speech-in-noise perception disorder participated in this study. The training localization program was based on changes in ITD ENV. In order to evaluate the reliability of the training program, we performed speech-in-noise tests before the training program, immediately afterward, and then at 2 months' follow-up. The reliability of the training program was analyzed using the Friedman test and the SPSS software. Significant statistical differences were shown in the mean scores of speech-in-noise perception between the 3 time points (P=0.001). The results also indicated no difference in the mean scores of speech-in-noise perception between the 2 time points of immediately after the training program and 2 months' follow-up (P=0.212). The present study showed the reliability of an ITD ENV-based localization training in elderly individuals with speech-in-noise perception disorder.
ERIC Educational Resources Information Center
Higgins, Eleanor L.; Raskind, Marshall H.
2005-01-01
The study investigated the compensatory effectiveness of the Quicktionary Reading Pen II (the Reading Pen), a portable device with miniaturized optical character recognition and speech synthesis capabilities. Thirty participants with reading disabilities aged 10-18 were trained on the operation of the technology and given two weeks to practice…
Effects of Lexical Tone Contour on Mandarin Sentence Intelligibility
ERIC Educational Resources Information Center
Chen, Fei; Wong, Lena L. N.; Hu, Yi
2014-01-01
Purpose: This study examined the effects of lexical tone contour on the intelligibility of Mandarin sentences in quiet and in noise. Method: A text-to-speech synthesis engine was used to synthesize Mandarin sentences with each word carrying the original lexical tone, flat tone, or a tone randomly selected from the 4 Mandarin lexical tones. The…
Speech recognition systems on the Cell Broadband Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Y; Jones, H; Vaidya, S
In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousandsmore » of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.« less
ERIC Educational Resources Information Center
Whitmire, Kathleen A.; Rivers, Kenyatta O.; Mele-McCarthy, Joan A.; Staskowski, Maureen
2014-01-01
Speech-language pathologists are faced with demands for evidence to support practice. Federal legislation requires high-quality evidence for decisions regarding school-based services as part of evidence-based practice. The purpose of this article is to discuss the limited scientific evidence for making appropriate decisions about speech-language…
Federal Register 2010, 2011, 2012, 2013, 2014
2013-10-23
...] Telecommunications Relay Services and Speech-to-Speech Services for Individuals With Hearing and Speech Disabilities... for telecommunications relay services (TRS) by eliminating standards for Internet-based relay services... comments, identified by CG Docket No. 03-123, by any of the following methods: Electronic Filers: Comments...
Contemporary Reflections on Speech-Based Language Learning
ERIC Educational Resources Information Center
Gustafson, Marianne
2009-01-01
In "The Relation of Language to Mental Development and of Speech to Language Teaching," S.G. Davidson displayed several timeless insights into the role of speech in developing language and reasons for using speech as the basis for instruction for children who are deaf and hard of hearing. His understanding that speech includes more than merely…
Developing a Weighted Measure of Speech Sound Accuracy
Preston, Jonathan L.; Ramsdell, Heather L.; Oller, D. Kimbrough; Edwards, Mary Louise; Tobin, Stephen J.
2010-01-01
Purpose The purpose is to develop a system for numerically quantifying a speaker’s phonetic accuracy through transcription-based measures. With a focus on normal and disordered speech in children, we describe a system for differentially weighting speech sound errors based on various levels of phonetic accuracy with a Weighted Speech Sound Accuracy (WSSA) score. We then evaluate the reliability and validity of this measure. Method Phonetic transcriptions are analyzed from several samples of child speech, including preschoolers and young adolescents with and without speech sound disorders and typically developing toddlers. The new measure of phonetic accuracy is compared to existing measures, is used to discriminate typical and disordered speech production, and is evaluated to determine whether it is sensitive to changes in phonetic accuracy over time. Results Initial psychometric data indicate that WSSA scores correlate with other measures of phonetic accuracy as well as listeners’ judgments of severity of a child’s speech disorder. The measure separates children with and without speech sound disorders. WSSA scores also capture growth in phonetic accuracy in toddler’s speech over time. Conclusion Results provide preliminary support for the WSSA as a valid and reliable measure of phonetic accuracy in children’s speech. PMID:20699344
Deep neural network and noise classification-based speech enhancement
NASA Astrophysics Data System (ADS)
Shi, Wenhua; Zhang, Xiongwei; Zou, Xia; Han, Wei
2017-07-01
In this paper, a speech enhancement method using noise classification and Deep Neural Network (DNN) was proposed. Gaussian mixture model (GMM) was employed to determine the noise type in speech-absent frames. DNN was used to model the relationship between noisy observation and clean speech. Once the noise type was determined, the corresponding DNN model was applied to enhance the noisy speech. GMM was trained with mel-frequency cepstrum coefficients (MFCC) and the parameters were estimated with an iterative expectation-maximization (EM) algorithm. Noise type was updated by spectrum entropy-based voice activity detection (VAD). Experimental results demonstrate that the proposed method could achieve better objective speech quality and smaller distortion under stationary and non-stationary conditions.
Bentsen, Thomas; May, Tobias; Kressner, Abigail A; Dau, Torsten
2018-01-01
Computational speech segregation attempts to automatically separate speech from noise. This is challenging in conditions with interfering talkers and low signal-to-noise ratios. Recent approaches have adopted deep neural networks and successfully demonstrated speech intelligibility improvements. A selection of components may be responsible for the success with these state-of-the-art approaches: the system architecture, a time frame concatenation technique and the learning objective. The aim of this study was to explore the roles and the relative contributions of these components by measuring speech intelligibility in normal-hearing listeners. A substantial improvement of 25.4 percentage points in speech intelligibility scores was found going from a subband-based architecture, in which a Gaussian Mixture Model-based classifier predicts the distributions of speech and noise for each frequency channel, to a state-of-the-art deep neural network-based architecture. Another improvement of 13.9 percentage points was obtained by changing the learning objective from the ideal binary mask, in which individual time-frequency units are labeled as either speech- or noise-dominated, to the ideal ratio mask, where the units are assigned a continuous value between zero and one. Therefore, both components play significant roles and by combining them, speech intelligibility improvements were obtained in a six-talker condition at a low signal-to-noise ratio.
Kaipa, Ramesh; Jones, Richard D; Robb, Michael P
2016-07-01
The benefits of different practice conditions in limb-based rehabilitation of motor disorders are well documented. Conversely, the role of practice structure in the treatment of motor-based speech disorders has only been minimally investigated. Considering this limitation, the current study aimed to investigate the effectiveness of selected practice conditions in spatial and temporal learning of novel speech utterances in individuals with Parkinson's disease (PD). Participants included 16 individuals with PD who were randomly and equally assigned to constant, variable, random, and blocked practice conditions. Participants in all four groups practiced a speech phrase for two consecutive days, and reproduced the speech phrase on the third day without further practice or feedback. There were no significant differences (p > 0.05) between participants across the four practice conditions with respect to either spatial or temporal learning of the speech phrase. Overall, PD participants demonstrated diminished spatial and temporal learning in comparison to healthy controls. Tests of strength of association between participants' demographic/clinical characteristics and speech-motor learning outcomes did not reveal any significant correlations. The findings from the current study suggest that repeated practice facilitates speech-motor learning in individuals with PD irrespective of the type of practice. Clinicians need to be cautious in applying practice conditions to treat speech deficits associated with PD based on the findings of non-speech-motor learning tasks. Copyright © 2016 Elsevier Ltd. All rights reserved.
School-Based Speech-Language Pathologists' Use of iPads
ERIC Educational Resources Information Center
Romane, Garvin Philippe
2017-01-01
This study explored school-based speech-language pathologists' (SLPs') use of iPads and apps for speech and language instruction, specifically for articulation, language, and vocabulary goals. A mostly quantitative-based survey was administered to approximately 2,800 SLPs in a K-12 setting; the final sample consisted of 189 licensed SLPs. Overall,…
Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers
Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng
2014-01-01
Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004
New Ideas for Speech Recognition and Related Technologies
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J F
The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstratedmore » by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun to think differently about practical applications of such radars. In particular, his demonstrations of heart and lung motions have opened up many new areas of application for human and animal measurements.« less
Fifty years of progress in speech and speaker recognition
NASA Astrophysics Data System (ADS)
Furui, Sadaoki
2004-10-01
Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.
ERIC Educational Resources Information Center
Young, Victoria; Mihailidis, Alex
2010-01-01
Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…
An acoustic feature-based similarity scoring system for speech rehabilitation assistance.
Syauqy, Dahnial; Wu, Chao-Min; Setyawati, Onny
2016-08-01
The purpose of this study is to develop a tool to assist speech therapy and rehabilitation, which focused on automatic scoring based on the comparison of the patient's speech with another normal speech on several aspects including pitch, vowel, voiced-unvoiced segments, strident fricative and sound intensity. The pitch estimation employed the use of cepstrum-based algorithm for its robustness; the vowel classification used multilayer perceptron (MLP) to classify vowel from pitch and formants; and the strident fricative detection was based on the major peak spectral intensity, location and the pitch existence in the segment. In order to evaluate the performance of the system, this study analyzed eight patient's speech recordings (four males, four females; 4-58-years-old), which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Taoyuan General Hospital. The experiment result on pitch algorithm showed that the cepstrum method had 5.3% of gross pitch error from a total of 2086 frames. On the vowel classification algorithm, MLP method provided 93% accuracy (men), 87% (women) and 84% (children). In total, the overall results showed that 156 tool's grading results (81%) were consistent compared to 192 audio and visual observations done by four experienced respondents. Implication for Rehabilitation Difficulties in communication may limit the ability of a person to transfer and exchange information. The fact that speech is one of the primary means of communication has encouraged the needs of speech diagnosis and rehabilitation. The advances of technology in computer-assisted speech therapy (CAST) improve the quality, time efficiency of the diagnosis and treatment of the disorders. The present study attempted to develop tool to assist speech therapy and rehabilitation, which provided simple interface to let the assessment be done even by the patient himself without the need of particular knowledge of speech processing while at the same time, also provided further deep analysis of the speech, which can be useful for the speech therapist.
Automatic Speech Recognition from Neural Signals: A Focused Review.
Herff, Christian; Schultz, Tanja
2016-01-01
Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.
ERIC Educational Resources Information Center
Rutgers, The State Univ., New Brunswick, NJ. Dept. of Vocational-Technical Education.
SEVENTY-EIGHT EDUCATORS FROM 13 NORTHEASTERN STATES AND WASHINGTON, D.C. PARTICIPATED IN THE 3-DAY CONFERENCE FOCUSED ON TOPICS OF INTEREST TO BOTH STATE SUPERVISORS AND TEACHER EDUCATORS. MAJOR SPEECHES WERE (1) "A REVIEW OF RESEARCH IN AGRICULTURAL EDUCATION IN THE NORTH ATLANTIC REGION" BY G.M. LOVE, (2) "REVIEW AND SYNTHESIS OF…
ERIC Educational Resources Information Center
Scott Instruments Corp., Denton, TX.
This project was designed to develop techniques for adding low-cost speech synthesis to educational software. Four tasks were identified for the study: (1) select a microcomputer with a built-in analog-to-digital converter that is currently being used in educational environments; (2) determine the feasibility of implementing expansion and playback…
Address entry while driving: speech recognition versus a touch-screen keyboard.
Tsimhoni, Omer; Smith, Daniel; Green, Paul
2004-01-01
A driving simulator experiment was conducted to determine the effects of entering addresses into a navigation system during driving. Participants drove on roads of varying visual demand while entering addresses. Three address entry methods were explored: word-based speech recognition, character-based speech recognition, and typing on a touch-screen keyboard. For each method, vehicle control and task measures, glance timing, and subjective ratings were examined. During driving, word-based speech recognition yielded the shortest total task time (15.3 s), followed by character-based speech recognition (41.0 s) and touch-screen keyboard (86.0 s). The standard deviation of lateral position when performing keyboard entry (0.21 m) was 60% higher than that for all other address entry methods (0.13 m). Degradation of vehicle control associated with address entry using a touch screen suggests that the use of speech recognition is favorable. Speech recognition systems with visual feedback, however, even with excellent accuracy, are not without performance consequences. Applications of this research include the design of in-vehicle navigation systems as well as other systems requiring significant driver input, such as E-mail, the Internet, and text messaging.
Monkey vocal tracts are speech-ready.
Fitch, W Tecumseh; de Boer, Bart; Mathur, Neil; Ghazanfar, Asif A
2016-12-01
For four decades, the inability of nonhuman primates to produce human speech sounds has been claimed to stem from limitations in their vocal tract anatomy, a conclusion based on plaster casts made from the vocal tract of a monkey cadaver. We used x-ray videos to quantify vocal tract dynamics in living macaques during vocalization, facial displays, and feeding. We demonstrate that the macaque vocal tract could easily produce an adequate range of speech sounds to support spoken language, showing that previous techniques based on postmortem samples drastically underestimated primate vocal capabilities. Our findings imply that the evolution of human speech capabilities required neural changes rather than modifications of vocal anatomy. Macaques have a speech-ready vocal tract but lack a speech-ready brain to control it.
Nakai, Yasushi; Takiguchi, Tetsuya; Matsui, Gakuyo; Yamaoka, Noriko; Takada, Satoshi
2017-10-01
Abnormal prosody is often evident in the voice intonations of individuals with autism spectrum disorders. We compared a machine-learning-based voice analysis with human hearing judgments made by 10 speech therapists for classifying children with autism spectrum disorders ( n = 30) and typical development ( n = 51). Using stimuli limited to single-word utterances, machine-learning-based voice analysis was superior to speech therapist judgments. There was a significantly higher true-positive than false-negative rate for machine-learning-based voice analysis but not for speech therapists. Results are discussed in terms of some artificiality of clinician judgments based on single-word utterances, and the objectivity machine-learning-based voice analysis adds to judging abnormal prosody.
A Case Study of a Collaborative Speech-Language Pathologist
ERIC Educational Resources Information Center
Ritzman, Mitzi J.; Sanger, Dixie; Coufal, Kathy L.
2006-01-01
This study explored how a school-based speech-language pathologist implemented a classroom-based service delivery model that focused on collaborative practices in classroom settings. The study used ethnographic observations and interviews with 1 speech-language pathologist to provide insights into how she implemented collaborative consultation and…
Choosing and Using Text-to-Speech Software
ERIC Educational Resources Information Center
Peters, Tom; Bell, Lori
2007-01-01
This article describes a computer-based technology for generating speech called text-to-speech (TTS). This software is ready for widespread use by libraries, other organizations, and individual users. It offers the affordable ability to turn just about any electronic text that is not image-based into an artificially spoken communication. The…
ERIC Educational Resources Information Center
Hill, Anne J.; Theodoros, Deborah G.; Russell, Trevor G.; Cahill, Louise M.; Ward, Elizabeth C.; Clark, Kathy M.
2006-01-01
Purpose: This pilot study explored the feasibility and effectiveness of an Internet-based telerehabilitation application for the assessment of motor speech disorders in adults with acquired neurological impairment. Method: Using a counterbalanced, repeated measures research design, 2 speech-language pathologists assessed 19 speakers with…
Elements of a Plan-Based Theory of Speech Acts. Technical Report No. 141.
ERIC Educational Resources Information Center
Cohen, Philip R.; Perrault, C. Raymond
This report proposes that people often plan their speech acts to affect their listeners' beliefs, goals, and emotional states and that such language use can be modeled by viewing speech acts as operators in a planning system, allowing both physical and speech acts to be integrated into plans. Methodological issues of how speech acts should be…
Tilsen, Sam; Arvaniti, Amalia
2013-07-01
This study presents a method for analyzing speech rhythm using empirical mode decomposition of the speech amplitude envelope, which allows for extraction and quantification of syllabic- and supra-syllabic time-scale components of the envelope. The method of empirical mode decomposition of a vocalic energy amplitude envelope is illustrated in detail, and several types of rhythm metrics derived from this method are presented. Spontaneous speech extracted from the Buckeye Corpus is used to assess the effect of utterance length on metrics, and it is shown how metrics representing variability in the supra-syllabic time-scale components of the envelope can be used to identify stretches of speech with targeted rhythmic characteristics. Furthermore, the envelope-based metrics are used to characterize cross-linguistic differences in speech rhythm in the UC San Diego Speech Lab corpus of English, German, Greek, Italian, Korean, and Spanish speech elicited in read sentences, read passages, and spontaneous speech. The envelope-based metrics exhibit significant effects of language and elicitation method that argue for a nuanced view of cross-linguistic rhythm patterns.
[Restoration of speech function in oncological patients with maxillary defects].
Matiakin, E G; Chuchkov, V M; Akhundov, A A; Azizian, R I; Romanov, I S; Chuchkov, M V; Agapov, V V
2009-01-01
Speech quality was evaluated in 188 patients with acquired maxillary defects. Prosthetic treatment of 29 patients was preceded by pharmacopsychotherapy. Sixty three patients had lessons with a logopedist and 66 practiced self-tuition based on the specially developed test. Thirty patients were examined for the quality of speech without preliminary preparation. Speech quality was assessed by auditory and spectral analysis. The main forms of impaired speech quality in the patients with maxillary defects were marked rhinophonia and impaired articulation. The proposed analytical tests were based on a combination of "difficult" vowels and consonants. The use of a removable prostheses with an obturator failed to correct the affected speech function but created prerequisites for the formation of the correct speech stereotype. Results of the study suggest the relationship between the quality of speech in subjects with maxillary defects and their intellectual faculties as well as the desire to overcome this drawback. The proposed tests are designed to activate the neuromuscular apparatus responsible for the generation of the speech. Lessons with a speech therapist give a powerful emotional incentive to the patients and promote their efforts toward restoration of speaking ability. Pharmacopsychotherapy and self-control are another efficacious tools for the improvement of speech quality in patients with maxillary defects.
Childhood apraxia of speech: A survey of praxis and typical speech characteristics.
Malmenholt, Ann; Lohmander, Anette; McAllister, Anita
2017-07-01
The purpose of this study was to investigate current knowledge of the diagnosis childhood apraxia of speech (CAS) in Sweden and compare speech characteristics and symptoms to those of earlier survey findings in mainly English-speakers. In a web-based questionnaire 178 Swedish speech-language pathologists (SLPs) anonymously answered questions about their perception of typical speech characteristics for CAS. They graded own assessment skills and estimated clinical occurrence. The seven top speech characteristics reported as typical for children with CAS were: inconsistent speech production (85%), sequencing difficulties (71%), oro-motor deficits (63%), vowel errors (62%), voicing errors (61%), consonant cluster deletions (54%), and prosodic disturbance (53%). Motor-programming deficits described as lack of automatization of speech movements were perceived by 82%. All listed characteristics were consistent with the American Speech-Language-Hearing Association (ASHA) consensus-based features, Strand's 10-point checklist, and the diagnostic model proposed by Ozanne. The mode for clinical occurrence was 5%. Number of suspected cases of CAS in the clinical caseload was approximately one new patient/year and SLP. The results support and add to findings from studies of CAS in English-speaking children with similar speech characteristics regarded as typical. Possibly, these findings could contribute to cross-linguistic consensus on CAS characteristics.
Jørgensen, Søren; Dau, Torsten
2011-09-01
A model for predicting the intelligibility of processed noisy speech is proposed. The speech-based envelope power spectrum model has a similar structure as the model of Ewert and Dau [(2000). J. Acoust. Soc. Am. 108, 1181-1196], developed to account for modulation detection and masking data. The model estimates the speech-to-noise envelope power ratio, SNR(env), at the output of a modulation filterbank and relates this metric to speech intelligibility using the concept of an ideal observer. Predictions were compared to data on the intelligibility of speech presented in stationary speech-shaped noise. The model was further tested in conditions with noisy speech subjected to reverberation and spectral subtraction. Good agreement between predictions and data was found in all cases. For spectral subtraction, an analysis of the model's internal representation of the stimuli revealed that the predicted decrease of intelligibility was caused by the estimated noise envelope power exceeding that of the speech. The classical concept of the speech transmission index fails in this condition. The results strongly suggest that the signal-to-noise ratio at the output of a modulation frequency selective process provides a key measure of speech intelligibility. © 2011 Acoustical Society of America
Speech-Language Dissociations, Distractibility, and Childhood Stuttering
Conture, Edward G.; Walden, Tedra A.; Lambert, Warren E.
2015-01-01
Purpose This study investigated the relation among speech-language dissociations, attentional distractibility, and childhood stuttering. Method Participants were 82 preschool-age children who stutter (CWS) and 120 who do not stutter (CWNS). Correlation-based statistics (Bates, Appelbaum, Salcedo, Saygin, & Pizzamiglio, 2003) identified dissociations across 5 norm-based speech-language subtests. The Behavioral Style Questionnaire Distractibility subscale measured attentional distractibility. Analyses addressed (a) between-groups differences in the number of children exhibiting speech-language dissociations; (b) between-groups distractibility differences; (c) the relation between distractibility and speech-language dissociations; and (d) whether interactions between distractibility and dissociations predicted the frequency of total, stuttered, and nonstuttered disfluencies. Results More preschool-age CWS exhibited speech-language dissociations compared with CWNS, and more boys exhibited dissociations compared with girls. In addition, male CWS were less distractible than female CWS and female CWNS. For CWS, but not CWNS, less distractibility (i.e., greater attention) was associated with more speech-language dissociations. Last, interactions between distractibility and dissociations did not predict speech disfluencies in CWS or CWNS. Conclusions The present findings suggest that for preschool-age CWS, attentional processes are associated with speech-language dissociations. Future investigations are warranted to better understand the directionality of effect of this association (e.g., inefficient attentional processes → speech-language dissociations vs. inefficient attentional processes ← speech-language dissociations). PMID:26126203
Tóth, László; Hoffmann, Ildikó; Gosztolya, Gábor; Vincze, Veronika; Szatlóczki, Gréta; Bánréti, Zoltán; Pákáski, Magdolna; Kálmán, János
2018-01-01
Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer’s disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive de-cline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech sig-nals, first manually (using the Praat software), and then automatically, with an automatic speech recogni-tion (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. Conclusion: The temporal analysis of spontaneous speech can be exploited in implementing a new, auto-matic detection-based tool for screening MCI for the community. PMID:29165085
Toth, Laszlo; Hoffmann, Ildiko; Gosztolya, Gabor; Vincze, Veronika; Szatloczki, Greta; Banreti, Zoltan; Pakaski, Magdolna; Kalman, Janos
2018-01-01
Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. The temporal analysis of spontaneous speech can be exploited in implementing a new, automatic detection-based tool for screening MCI for the community. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Pulse Vector-Excitation Speech Encoder
NASA Technical Reports Server (NTRS)
Davidson, Grant; Gersho, Allen
1989-01-01
Proposed pulse vector-excitation speech encoder (PVXC) encodes analog speech signals into digital representation for transmission or storage at rates below 5 kilobits per second. Produces high quality of reconstructed speech, but with less computation than required by comparable speech-encoding systems. Has some characteristics of multipulse linear predictive coding (MPLPC) and of code-excited linear prediction (CELP). System uses mathematical model of vocal tract in conjunction with set of excitation vectors and perceptually-based error criterion to synthesize natural-sounding speech.
NASA Astrophysics Data System (ADS)
Thoonsaengngam, Rattapol; Tangsangiumvisai, Nisachon
This paper proposes an enhanced method for estimating the a priori Signal-to-Disturbance Ratio (SDR) to be employed in the Acoustic Echo and Noise Suppression (AENS) system for full-duplex hands-free communications. The proposed a priori SDR estimation technique is modified based upon the Two-Step Noise Reduction (TSNR) algorithm to suppress the background noise while preserving speech spectral components. In addition, a practical approach to determine accurately the Echo Spectrum Variance (ESV) is presented based upon the linear relationship assumption between the power spectrum of far-end speech and acoustic echo signals. The ESV estimation technique is then employed to alleviate the acoustic echo problem. The performance of the AENS system that employs these two proposed estimation techniques is evaluated through the Echo Attenuation (EA), Noise Attenuation (NA), and two speech distortion measures. Simulation results based upon real speech signals guarantee that our improved AENS system is able to mitigate efficiently the problem of acoustic echo and background noise, while preserving the speech quality and speech intelligibility.
Winn, Matthew B; Won, Jong Ho; Moon, Il Joon
This study was conducted to measure auditory perception by cochlear implant users in the spectral and temporal domains, using tests of either categorization (using speech-based cues) or discrimination (using conventional psychoacoustic tests). The authors hypothesized that traditional nonlinguistic tests assessing spectral and temporal auditory resolution would correspond to speech-based measures assessing specific aspects of phonetic categorization assumed to depend on spectral and temporal auditory resolution. The authors further hypothesized that speech-based categorization performance would ultimately be a superior predictor of speech recognition performance, because of the fundamental nature of speech recognition as categorization. Nineteen cochlear implant listeners and 10 listeners with normal hearing participated in a suite of tasks that included spectral ripple discrimination, temporal modulation detection, and syllable categorization, which was split into a spectral cue-based task (targeting the /ba/-/da/ contrast) and a timing cue-based task (targeting the /b/-/p/ and /d/-/t/ contrasts). Speech sounds were manipulated to contain specific spectral or temporal modulations (formant transitions or voice onset time, respectively) that could be categorized. Categorization responses were quantified using logistic regression to assess perceptual sensitivity to acoustic phonetic cues. Word recognition testing was also conducted for cochlear implant listeners. Cochlear implant users were generally less successful at utilizing both spectral and temporal cues for categorization compared with listeners with normal hearing. For the cochlear implant listener group, spectral ripple discrimination was significantly correlated with the categorization of formant transitions; both were correlated with better word recognition. Temporal modulation detection using 100- and 10-Hz-modulated noise was not correlated either with the cochlear implant subjects' categorization of voice onset time or with word recognition. Word recognition was correlated more closely with categorization of the controlled speech cues than with performance on the psychophysical discrimination tasks. When evaluating people with cochlear implants, controlled speech-based stimuli are feasible to use in tests of auditory cue categorization, to complement traditional measures of auditory discrimination. Stimuli based on specific speech cues correspond to counterpart nonlinguistic measures of discrimination, but potentially show better correspondence with speech perception more generally. The ubiquity of the spectral (formant transition) and temporal (voice onset time) stimulus dimensions across languages highlights the potential to use this testing approach even in cases where English is not the native language.
Winn, Matthew B.; Won, Jong Ho; Moon, Il Joon
2016-01-01
Objectives This study was conducted to measure auditory perception by cochlear implant users in the spectral and temporal domains, using tests of either categorization (using speech-based cues) or discrimination (using conventional psychoacoustic tests). We hypothesized that traditional nonlinguistic tests assessing spectral and temporal auditory resolution would correspond to speech-based measures assessing specific aspects of phonetic categorization assumed to depend on spectral and temporal auditory resolution. We further hypothesized that speech-based categorization performance would ultimately be a superior predictor of speech recognition performance, because of the fundamental nature of speech recognition as categorization. Design Nineteen CI listeners and 10 listeners with normal hearing (NH) participated in a suite of tasks that included spectral ripple discrimination (SRD), temporal modulation detection (TMD), and syllable categorization, which was split into a spectral-cue-based task (targeting the /ba/-/da/ contrast) and a timing-cue-based task (targeting the /b/-/p/ and /d/-/t/ contrasts). Speech sounds were manipulated in order to contain specific spectral or temporal modulations (formant transitions or voice onset time, respectively) that could be categorized. Categorization responses were quantified using logistic regression in order to assess perceptual sensitivity to acoustic phonetic cues. Word recognition testing was also conducted for CI listeners. Results CI users were generally less successful at utilizing both spectral and temporal cues for categorization compared to listeners with normal hearing. For the CI listener group, SRD was significantly correlated with the categorization of formant transitions; both were correlated with better word recognition. TMD using 100 Hz and 10 Hz modulated noise was not correlated with the CI subjects’ categorization of VOT, nor with word recognition. Word recognition was correlated more closely with categorization of the controlled speech cues than with performance on the psychophysical discrimination tasks. Conclusions When evaluating people with cochlear implants, controlled speech-based stimuli are feasible to use in tests of auditory cue categorization, to complement traditional measures of auditory discrimination. Stimuli based on specific speech cues correspond to counterpart non-linguistic measures of discrimination, but potentially show better correspondence with speech perception more generally. The ubiquity of the spectral (formant transition) and temporal (VOT) stimulus dimensions across languages highlights the potential to use this testing approach even in cases where English is not the native language. PMID:27438871
Expertise with artificial non-speech sounds recruits speech-sensitive cortical regions
Leech, Robert; Holt, Lori L.; Devlin, Joseph T.; Dick, Frederic
2009-01-01
Regions of the human temporal lobe show greater activation for speech than for other sounds. These differences may reflect intrinsically specialized domain-specific adaptations for processing speech, or they may be driven by the significant expertise we have in listening to the speech signal. To test the expertise hypothesis, we used a video-game-based paradigm that tacitly trained listeners to categorize acoustically complex, artificial non-linguistic sounds. Before and after training, we used functional MRI to measure how expertise with these sounds modulated temporal lobe activation. Participants’ ability to explicitly categorize the non-speech sounds predicted the change in pre- to post-training activation in speech-sensitive regions of the left posterior superior temporal sulcus, suggesting that emergent auditory expertise may help drive this functional regionalization. Thus, seemingly domain-specific patterns of neural activation in higher cortical regions may be driven in part by experience-based restructuring of high-dimensional perceptual space. PMID:19386919
NASA Astrophysics Data System (ADS)
Bell, Alan; Jurafsky, Daniel; Fosler-Lussier, Eric; Girand, Cynthia; Gregory, Michelle; Gildea, Daniel
2003-02-01
Function words, especially frequently occurring ones such as (the, that, and, and of ), vary widely in pronunciation. Understanding this variation is essential both for cognitive modeling of lexical production and for computer speech recognition and synthesis. This study investigates which factors affect the forms of function words, especially whether they have a fuller pronunciation (e.g., edh,eye, edh,æ,tee, æ,en,&dee, vee) or a more reduced or lenited pronunciation (e.g., edh,schwa, edh,bari;t, n, schwa). It is based on over 8000 occurrences of the ten most frequent English function words in a 4-h sample from conversations from the Switchboard corpus. Ordinary linear and logistic regression models were used to examine variation in the length of the words, in the form of their vowel (basic, full, or reduced), and whether final obstruents were present or not. For all these measures, after controlling for segmental context, rate of speech, and other important factors, there are strong independent effects that made high-frequency monosyllabic function words more likely to be longer or have a fuller form (1) when neighboring disfluencies (such as filled pauses uh and um) indicate that the speaker was encountering problems in planning the utterance; (2) when the word is unexpected, i.e., less predictable in context; (3) when the word is either utterance initial or utterance final. Looking at the phenomenon in a different way, frequent function words are more likely to be shorter and to have less-full forms in fluent speech, in predictable positions or multiword collocations, and utterance internally. Also considered are other factors such as sex (women are more likely to use fuller forms, even after controlling for rate of speech, for example), and some of the differences among the ten function words in their response to the factors.
Towards Artificial Speech Therapy: A Neural System for Impaired Speech Segmentation.
Iliya, Sunday; Neri, Ferrante
2016-09-01
This paper presents a neural system-based technique for segmenting short impaired speech utterances into silent, unvoiced, and voiced sections. Moreover, the proposed technique identifies those points of the (voiced) speech where the spectrum becomes steady. The resulting technique thus aims at detecting that limited section of the speech which contains the information about the potential impairment of the speech. This section is of interest to the speech therapist as it corresponds to the possibly incorrect movements of speech organs (lower lip and tongue with respect to the vocal tract). Two segmentation models to detect and identify the various sections of the disordered (impaired) speech signals have been developed and compared. The first makes use of a combination of four artificial neural networks. The second is based on a support vector machine (SVM). The SVM has been trained by means of an ad hoc nested algorithm whose outer layer is a metaheuristic while the inner layer is a convex optimization algorithm. Several metaheuristics have been tested and compared leading to the conclusion that some variants of the compact differential evolution (CDE) algorithm appears to be well-suited to address this problem. Numerical results show that the SVM model with a radial basis function is capable of effective detection of the portion of speech that is of interest to a therapist. The best performance has been achieved when the system is trained by the nested algorithm whose outer layer is hybrid-population-based/CDE. A population-based approach displays the best performance for the isolation of silence/noise sections, and the detection of unvoiced sections. On the other hand, a compact approach appears to be clearly well-suited to detect the beginning of the steady state of the voiced signal. Both the proposed segmentation models display outperformed two modern segmentation techniques based on Gaussian mixture model and deep learning.
Na, Sung Dae; Wei, Qun; Seong, Ki Woong; Cho, Jin Ho; Kim, Myoung Nam
2018-01-01
The conventional methods of speech enhancement, noise reduction, and voice activity detection are based on the suppression of noise or non-speech components of the target air-conduction signals. However, air-conduced speech is hard to differentiate from babble or white noise signals. To overcome this problem, the proposed algorithm uses the bone-conduction speech signals and soft thresholding based on the Shannon entropy principle and cross-correlation of air- and bone-conduction signals. A new algorithm for speech detection and noise reduction is proposed, which makes use of the Shannon entropy principle and cross-correlation with the bone-conduction speech signals to threshold the wavelet packet coefficients of the noisy speech. The proposed method can be get efficient result by objective quality measure that are PESQ, RMSE, Correlation, SNR. Each threshold is generated by the entropy and cross-correlation approaches in the decomposed bands using the wavelet packet decomposition. As a result, the noise is reduced by the proposed method using the MATLAB simulation. To verify the method feasibility, we compared the air- and bone-conduction speech signals and their spectra by the proposed method. As a result, high performance of the proposed method is confirmed, which makes it quite instrumental to future applications in communication devices, noisy environment, construction, and military operations.
Gilbert, Kathryn E
2013-02-01
Recent attempts to regulate Crisis Pregnancy Centers, pseudoclinics that surreptitiously aim to dissuade pregnant women from choosing abortion, have confronted the thorny problem of how to define commercial speech. The Supreme Court has offered three potential answers to this definitional quandary. This Note uses the Crisis Pregnancy Center cases to demonstrate that courts should use one of these solutions, the factor-based approach of Bolger v. Youngs Drugs Products Corp., to define commercial speech in the Crisis Pregnancy Center cases and elsewhere. In principle and in application, the Bolger factor-based approach succeeds in structuring commercial speech analysis at the margins of the doctrine.
Speech graphs provide a quantitative measure of thought disorder in psychosis.
Mota, Natalia B; Vasconcelos, Nivaldo A P; Lemos, Nathalia; Pieretti, Ana C; Kinouchi, Osame; Cecchi, Guillermo A; Copelli, Mauro; Ribeiro, Sidarta
2012-01-01
Psychosis has various causes, including mania and schizophrenia. Since the differential diagnosis of psychosis is exclusively based on subjective assessments of oral interviews with patients, an objective quantification of the speech disturbances that characterize mania and schizophrenia is in order. In principle, such quantification could be achieved by the analysis of speech graphs. A graph represents a network with nodes connected by edges; in speech graphs, nodes correspond to words and edges correspond to semantic and grammatical relationships. To quantify speech differences related to psychosis, interviews with schizophrenics, manics and normal subjects were recorded and represented as graphs. Manics scored significantly higher than schizophrenics in ten graph measures. Psychopathological symptoms such as logorrhea, poor speech, and flight of thoughts were grasped by the analysis even when verbosity differences were discounted. Binary classifiers based on speech graph measures sorted schizophrenics from manics with up to 93.8% of sensitivity and 93.7% of specificity. In contrast, sorting based on the scores of two standard psychiatric scales (BPRS and PANSS) reached only 62.5% of sensitivity and specificity. The results demonstrate that alterations of the thought process manifested in the speech of psychotic patients can be objectively measured using graph-theoretical tools, developed to capture specific features of the normal and dysfunctional flow of thought, such as divergence and recurrence. The quantitative analysis of speech graphs is not redundant with standard psychometric scales but rather complementary, as it yields a very accurate sorting of schizophrenics and manics. Overall, the results point to automated psychiatric diagnosis based not on what is said, but on how it is said.
Micro-Based Speech Recognition: Instructional Innovation for Handicapped Learners.
ERIC Educational Resources Information Center
Horn, Carin E.; Scott, Brian L.
A new voice based learning system (VBLS), which allows the handicapped user to interact with a microcomputer by voice commands, is described. Speech or voice recognition is the computerized process of identifying a spoken word or phrase, including those resulting from speech impediments. This new technology is helpful to the severely physically…
ERIC Educational Resources Information Center
Nozari, Nazbanou; Dell, Gary S.; Schwartz, Myrna F.
2011-01-01
Despite the existence of speech errors, verbal communication is successful because speakers can detect (and correct) their errors. The standard theory of speech-error detection, the perceptual-loop account, posits that the comprehension system monitors production output for errors. Such a comprehension-based monitor, however, cannot explain the…
Using Web Speech Technology with Language Learning Applications
ERIC Educational Resources Information Center
Daniels, Paul
2015-01-01
In this article, the author presents the history of human-to-computer interaction based upon the design of sophisticated computerized speech recognition algorithms. Advancements such as the arrival of cloud-based computing and software like Google's Web Speech API allows anyone with an Internet connection and Chrome browser to take advantage of…
Evidence-Based Practice for Children with Speech Sound Disorders: Part 1 Narrative Review
ERIC Educational Resources Information Center
Baker, Elise; McLeod, Sharynne
2011-01-01
Purpose: This article provides a comprehensive narrative review of intervention studies for children with speech sound disorders (SSD). Its companion paper (Baker & McLeod, 2011) provides a tutorial and clinical example of how speech-language pathologists (SLPs) can engage in evidence-based practice (EBP) for this clinical population. Method:…
Visemic Processing in Audiovisual Discrimination of Natural Speech: A Simultaneous fMRI-EEG Study
ERIC Educational Resources Information Center
Dubois, Cyril; Otzenberger, Helene; Gounot, Daniel; Sock, Rudolph; Metz-Lutz, Marie-Noelle
2012-01-01
In a noisy environment, visual perception of articulatory movements improves natural speech intelligibility. Parallel to phonemic processing based on auditory signal, visemic processing constitutes a counterpart based on "visemes", the distinctive visual units of speech. Aiming at investigating the neural substrates of visemic processing in a…
Voxel-based morphometry of auditory and speech-related cortex in stutterers.
Beal, Deryk S; Gracco, Vincent L; Lafaille, Sophie J; De Nil, Luc F
2007-08-06
Stutterers demonstrate unique functional neural activation patterns during speech production, including reduced auditory activation, relative to nonstutterers. The extent to which these functional differences are accompanied by abnormal morphology of the brain in stutterers is unclear. This study examined the neuroanatomical differences in speech-related cortex between stutterers and nonstutterers using voxel-based morphometry. Results revealed significant differences in localized grey matter and white matter densities of left and right hemisphere regions involved in auditory processing and speech production.
Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali
2015-01-01
In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature. PMID:25799141
Speech and Hearing Science, Anatomy and Physiology.
ERIC Educational Resources Information Center
Zemlin, Willard R.
Written for those interested in speech pathology and audiology, the text presents the anatomical, physiological, and neurological bases for speech and hearing. Anatomical nomenclature used in the speech and hearing sciences is introduced and the breathing mechanism is defined and discussed in terms of the respiratory passage, the framework and…
Interventions for Speech Sound Disorders in Children
ERIC Educational Resources Information Center
Williams, A. Lynn, Ed.; McLeod, Sharynne, Ed.; McCauley, Rebecca J., Ed.
2010-01-01
With detailed discussion and invaluable video footage of 23 treatment interventions for speech sound disorders (SSDs) in children, this textbook and DVD set should be part of every speech-language pathologist's professional preparation. Focusing on children with functional or motor-based speech disorders from early childhood through the early…
Developing a Weighted Measure of Speech Sound Accuracy
ERIC Educational Resources Information Center
Preston, Jonathan L.; Ramsdell, Heather L.; Oller, D. Kimbrough; Edwards, Mary Louise; Tobin, Stephen J.
2011-01-01
Purpose: To develop a system for numerically quantifying a speaker's phonetic accuracy through transcription-based measures. With a focus on normal and disordered speech in children, the authors describe a system for differentially weighting speech sound errors on the basis of various levels of phonetic accuracy using a Weighted Speech Sound…
Automated Assessment of Speech Fluency for L2 English Learners
ERIC Educational Resources Information Center
Yoon, Su-Youn
2009-01-01
This dissertation provides an automated scoring method of speech fluency for second language learners of English (L2 learners) based that uses speech recognition technology. Non-standard pronunciation, frequent disfluencies, faulty grammar, and inappropriate lexical choices are crucial characteristics of L2 learners' speech. Due to the ease of…
A novel probabilistic framework for event-based speech recognition
NASA Astrophysics Data System (ADS)
Juneja, Amit; Espy-Wilson, Carol
2003-10-01
One of the reasons for unsatisfactory performance of the state-of-the-art automatic speech recognition (ASR) systems is the inferior acoustic modeling of low-level acoustic-phonetic information in the speech signal. An acoustic-phonetic approach to ASR, on the other hand, explicitly targets linguistic information in the speech signal, but such a system for continuous speech recognition (CSR) is not known to exist. A probabilistic and statistical framework for CSR based on the idea of the representation of speech sounds by bundles of binary valued articulatory phonetic features is proposed. Multiple probabilistic sequences of linguistically motivated landmarks are obtained using binary classifiers of manner phonetic features-syllabic, sonorant and continuant-and the knowledge-based acoustic parameters (APs) that are acoustic correlates of those features. The landmarks are then used for the extraction of knowledge-based APs for source and place phonetic features and their binary classification. Probabilistic landmark sequences are constrained using manner class language models for isolated or connected word recognition. The proposed method could overcome the disadvantages encountered by the early acoustic-phonetic knowledge-based systems that led the ASR community to switch to systems highly dependent on statistical pattern analysis methods and probabilistic language or grammar models.
Central Presbycusis: A Review and Evaluation of the Evidence
Humes, Larry E.; Dubno, Judy R.; Gordon-Salant, Sandra; Lister, Jennifer J.; Cacace, Anthony T.; Cruickshanks, Karen J.; Gates, George A.; Wilson, Richard H.; Wingfield, Arthur
2018-01-01
Background The authors reviewed the evidence regarding the existence of age-related declines in central auditory processes and the consequences of any such declines for everyday communication. Purpose This report summarizes the review process and presents its findings. Data Collection and Analysis The authors reviewed 165 articles germane to central presbycusis. Of the 165 articles, 132 articles with a focus on human behavioral measures for either speech or nonspeech stimuli were selected for further analysis. Results For 76 smaller-scale studies of speech understanding in older adults reviewed, the following findings emerged: (1) the three most commonly studied behavioral measures were speech in competition, temporally distorted speech, and binaural speech perception (especially dichotic listening); (2) for speech in competition and temporally degraded speech, hearing loss proved to have a significant negative effect on performance in most of the laboratory studies; (3) significant negative effects of age, unconfounded by hearing loss, were observed in most of the studies of speech in competing speech, time-compressed speech, and binaural speech perception; and (4) the influence of cognitive processing on speech understanding has been examined much less frequently, but when included, significant positive associations with speech understanding were observed. For 36 smaller-scale studies of the perception of nonspeech stimuli by older adults reviewed, the following findings emerged: (1) the three most frequently studied behavioral measures were gap detection, temporal discrimination, and temporal-order discrimination or identification; (2) hearing loss was seldom a significant factor; and (3) negative effects of age were almost always observed. For 18 studies reviewed that made use of test batteries and medium-to-large sample sizes, the following findings emerged: (1) all studies included speech-based measures of auditory processing; (2) 4 of the 18 studies included nonspeech stimuli; (3) for the speech-based measures, monaural speech in a competing-speech background, dichotic speech, and monaural time-compressed speech were investigated most frequently; (4) the most frequently used tests were the Synthetic Sentence Identification (SSI) test with Ipsilateral Competing Message (ICM), the Dichotic Sentence Identification (DSI) test, and time-compressed speech; (5) many of these studies using speech-based measures reported significant effects of age, but most of these studies were confounded by declines in hearing, cognition, or both; (6) for nonspeech auditory-processing measures, the focus was on measures of temporal processing in all four studies; (7) effects of cognition on nonspeech measures of auditory processing have been studied less frequently, with mixed results, whereas the effects of hearing loss on performance were minimal due to judicious selection of stimuli; and (8) there is a paucity of observational studies using test batteries and longitudinal designs. Conclusions Based on this review of the scientific literature, there is insufficient evidence to confirm the existence of central presbycusis as an isolated entity. On the other hand, recent evidence has been accumulating in support of the existence of central presbycusis as a multifactorial condition that involves age- and/or disease-related changes in the auditory system and in the brain. Moreover, there is a clear need for additional research in this area. PMID:22967738
Foster, Abby; Worrall, Linda; Rose, Miranda; O'Halloran, Robyn
2015-07-01
An evidence-practice gap has been identified in current acute aphasia management practice, with the provision of services to people with aphasia in the acute hospital widely considered in the literature to be inconsistent with best-practice recommendations. The reasons for this evidence-practice gap are unclear; however, speech pathologists practising in this setting have articulated a sense of dissonance regarding their limited service provision to this population. A clearer understanding of why this evidence-practice gap exists is essential in order to support and promote evidence-based approaches to the care of people with aphasia in acute care settings. To provide an understanding of speech pathologists' conceptualization of evidence-based practice for acute post-stroke aphasia, and its implementation. This study adopted a phenomenological approach, underpinned by a social constructivist paradigm. In-depth interviews were conducted with 14 Australian speech pathologists, recruited using a purposive sampling technique. An inductive thematic analysis of the data was undertaken. A single, overarching theme emerged from the data. Speech pathologists demonstrated a sense of disempowerment as a result of their relationship with evidence-based practice for acute aphasia management. Three subthemes contributed to this theme. The first described a restricted conceptualization of evidence-based practice. The second revealed speech pathologists' strained relationships with the research literature. The third elucidated a sense of professional unease over their perceived inability to enact evidence-based clinical recommendations, despite their desire to do so. Speech pathologists identified a current knowledge-practice gap in their management of aphasia in acute hospital settings. Speech pathologists place significant emphasis on the research evidence; however, their engagement with the research is limited, in part because it is perceived to lack clinical utility. A sense of professional dissonance arises from the conflict between a desire to provide best practice and the perceived barriers to implementing evidence-based recommendations clinically, resulting in evidence-based practice becoming a disempowering concept for some. © 2015 Royal College of Speech and Language Therapists.
Design Automation for Streaming Systems
2005-12-16
which are FIFO buffered channels. We develop a process network model for streaming sys - tems (TDFPN) and a hardware description language with built in...and may include an automatic address generator. A complete synthesis sys - tem would provide separate segment operator implementations for every...Acoustics, Speech, and Signal Processing (ICASSP ’89), pages 988– 991, 1989. [Luk et al., 1997] Wayne Luk, Nabeel Shirazi, and Peter Y. K. Cheung
Speech Synthesis Using Perceptually Motivated Features
2012-01-23
with others a few years prior (with the concurrence of the project’s program manager. Willard Larkin). The Perceptual Flow of Phonetic Information and...34The Perceptual Flow of Phonetic Processing," consonant confusion matrices are analyzed for patterns of phonetic-feature decoding errors conditioned...decoding) is also observed. From these conditional probability patterns, it is proposed that they reflect a temporal flow of perceptual processing
The Pedagogical Use of Mobile Speech Synthesis (TTS): Focus on French Liaison
ERIC Educational Resources Information Center
Liakin, Denis; Cardoso, Walcir; Liakina, Natallia
2017-01-01
We examine the impact of the pedagogical use of mobile TTS on the L2 acquisition of French liaison, a process by which a word-final consonant is pronounced at the beginning of the following word if the latter is vowel-initial (e.g. peti/t.a/mi = > peti[ta]mi "boyfriend"). The study compares three groups of L2 French students learning…
Buss, Emily; Leibold, Lori J.; Porter, Heather L.; Grose, John H.
2017-01-01
Children perform more poorly than adults on a wide range of masked speech perception paradigms, but this effect is particularly pronounced when the masker itself is also composed of speech. The present study evaluated two factors that might contribute to this effect: the ability to perceptually isolate the target from masker speech, and the ability to recognize target speech based on sparse cues (glimpsing). Speech reception thresholds (SRTs) were estimated for closed-set, disyllabic word recognition in children (5–16 years) and adults in a one- or two-talker masker. Speech maskers were 60 dB sound pressure level (SPL), and they were either presented alone or in combination with a 50-dB-SPL speech-shaped noise masker. There was an age effect overall, but performance was adult-like at a younger age for the one-talker than the two-talker masker. Noise tended to elevate SRTs, particularly for older children and adults, and when summed with the one-talker masker. Removing time-frequency epochs associated with a poor target-to-masker ratio markedly improved SRTs, with larger effects for younger listeners; the age effect was not eliminated, however. Results were interpreted as indicating that development of speech-in-speech recognition is likely impacted by development of both perceptual masking and the ability recognize speech based on sparse cues. PMID:28464682
NASA Astrophysics Data System (ADS)
Nishiura, Takanobu; Nakamura, Satoshi
2003-10-01
Humans communicate with each other through speech by focusing on the target speech among environmental sounds in real acoustic environments. We can easily identify the target sound from other environmental sounds. For hands-free speech recognition, the identification of the target speech from environmental sounds is imperative. This mechanism may also be important for a self-moving robot to sense the acoustic environments and communicate with humans. Therefore, this paper first proposes hidden Markov model (HMM)-based environmental sound source identification. Environmental sounds are modeled by three states of HMMs and evaluated using 92 kinds of environmental sounds. The identification accuracy was 95.4%. This paper also proposes a new HMM composition method that composes speech HMMs and an HMM of categorized environmental sounds for robust environmental sound-added speech recognition. As a result of the evaluation experiments, we confirmed that the proposed HMM composition outperforms the conventional HMM composition with speech HMMs and a noise (environmental sound) HMM trained using noise periods prior to the target speech in a captured signal. [Work supported by Ministry of Public Management, Home Affairs, Posts and Telecommunications of Japan.
Pathological speech signal analysis and classification using empirical mode decomposition.
Kaleem, Muhammad; Ghoraani, Behnaz; Guergachi, Aziz; Krishnan, Sridhar
2013-07-01
Automated classification of normal and pathological speech signals can provide an objective and accurate mechanism for pathological speech diagnosis, and is an active area of research. A large part of this research is based on analysis of acoustic measures extracted from sustained vowels. However, sustained vowels do not reflect real-world attributes of voice as effectively as continuous speech, which can take into account important attributes of speech such as rapid voice onset and termination, changes in voice frequency and amplitude, and sudden discontinuities in speech. This paper presents a methodology based on empirical mode decomposition (EMD) for classification of continuous normal and pathological speech signals obtained from a well-known database. EMD is used to decompose randomly chosen portions of speech signals into intrinsic mode functions, which are then analyzed to extract meaningful temporal and spectral features, including true instantaneous features which can capture discriminative information in signals hidden at local time-scales. A total of six features are extracted, and a linear classifier is used with the feature vector to classify continuous speech portions obtained from a database consisting of 51 normal and 161 pathological speakers. A classification accuracy of 95.7 % is obtained, thus demonstrating the effectiveness of the methodology.
Speech and Speech-Related Quality of Life After Late Palate Repair: A Patient's Perspective.
Schönmeyr, Björn; Wendby, Lisa; Sharma, Mitali; Jacobson, Lia; Restrepo, Carolina; Campbell, Alex
2015-07-01
Many patients with cleft palate deformities worldwide receive treatment at a later age than is recommended for normal speech to develop. The outcomes after late palate repairs in terms of speech and quality of life (QOL) still remain largely unstudied. In the current study, questionnaires were used to assess the patients' perception of speech and QOL before and after primary palate repair. All of the patients were operated at a cleft center in northeast India and had a cleft palate with a normal lip or with a cleft lip that had been previously repaired. A total of 134 patients (7-35 years) were interviewed preoperatively and 46 patients (7-32 years) were assessed in the postoperative survey. The survey showed that scores based on the speech handicap index, concerning speech and speech-related QOL, did not improve postoperatively. In fact, the questionnaires indicated that the speech became more unpredictable (P < 0.01) and that nasal regurgitation became worse (P < 0.01) for some patients after surgery. A total of 78% of the patients were still satisfied with the surgery and all of the patients reported that their self-confidence had improved after the operation. Thus, the majority of interviewed patients who underwent late primary palate repair were satisfied with the surgery. At the same time, speech and speech-related QOL did not improve according to the speech handicap index-based survey. Speech predictability may even become worse and nasal regurgitation may increase after late palate repair, according to these results.
Communication media and the dead: from the Stone Age to Facebook.
Walter, Tony
2015-07-03
This article argues as follows: (i) The presence of the dead within a society depends in part on available communication technologies, specifically speech, stone, sculpture, writing, printing, photography and phonography (including the mass media), and most recently the internet. (ii) Each communication technology affords possibilities for the dead to construct and legitimate particular social groups and institutions - from the oral construction of kinship, to the megalithic legitimation of the territorial rights of chiefdoms, to the written word's construction of world religions and nations, to the photographic and phonographic construction of celebrity-based neo-tribalism, and to the digital reconstruction of family and friendship. (iii) Historically, concerns about the dead have on a number of occasions aided the development of new communication technologies - the causal connection between the two can go both ways. The argument is based primarily on critical synthesis of existing research literature.
Communication media and the dead: from the Stone Age to Facebook
Walter, Tony
2015-01-01
Abstract This article argues as follows: (i) The presence of the dead within a society depends in part on available communication technologies, specifically speech, stone, sculpture, writing, printing, photography and phonography (including the mass media), and most recently the internet. (ii) Each communication technology affords possibilities for the dead to construct and legitimate particular social groups and institutions – from the oral construction of kinship, to the megalithic legitimation of the territorial rights of chiefdoms, to the written word’s construction of world religions and nations, to the photographic and phonographic construction of celebrity-based neo-tribalism, and to the digital reconstruction of family and friendship. (iii) Historically, concerns about the dead have on a number of occasions aided the development of new communication technologies – the causal connection between the two can go both ways. The argument is based primarily on critical synthesis of existing research literature. PMID:26549977
What Makes a Caseload (Un) Manageable? School-Based Speech-Language Pathologists Speak
ERIC Educational Resources Information Center
Katz, Lauren A.; Maag, Abby; Fallon, Karen A.; Blenkarn, Katie; Smith, Megan K.
2010-01-01
Purpose: Large caseload sizes and a shortage of speech-language pathologists (SLPs) are ongoing concerns in the field of speech and language. This study was conducted to identify current mean caseload size for school-based SLPs, a threshold at which caseload size begins to be perceived as unmanageable, and variables contributing to school-based…
O-I-C: An Orality-Based Procedure for Teaching Interactive Communication in the Basic Course.
ERIC Educational Resources Information Center
Haynes, W. Lance
In order to improve instruction in basic speech courses, a program was developed adapting creative problem solving to speech preparation and to interactive speech communication. The program, called O-I-C--Orientation, Incubation, and Composition--and based on Howell's five levels of competence and their implications, begins with a thorough study…
ERIC Educational Resources Information Center
Spek, B.; Wieringa-de Waard, M.; Lucas, C.; van Dijk, N.
2013-01-01
Background: The importance and value of the principles of evidence-based practice (EBP) in the decision-making process is recognized by speech-language therapists (SLTs) worldwide and as a result curricula for speech-language therapy students incorporated EBP principles. However, the willingness actually to use EBP principles in their future…
ERIC Educational Resources Information Center
Payne, Elinor; Post, Brechtje; Astruc, Lluisa; Prieto, Pilar; Vanrell, Maria del Mar
2012-01-01
Interval-based rhythm metrics were applied to the speech of English, Catalan and Spanish 2, 4 and 6 year-olds, and compared with the (adult-directed) speech of their mothers. Results reveal that child speech does not fall into a well-defined rhythmic class: for all three languages, it is more "vocalic" (higher %V) than adult speech and…
Treatment of Children with Speech Oral Placement Disorders (OPDs): A Paradigm Emerges
ERIC Educational Resources Information Center
Bahr, Diane; Rosenfeld-Johnson, Sara
2010-01-01
Epidemiological research was used to develop the Speech Disorders Classification System (SDCS). The SDCS is an important speech diagnostic paradigm in the field of speech-language pathology. This paradigm could be expanded and refined to also address treatment while meeting the standards of evidence-based practice. The article assists that process…
Lu, Huanhuan; Wang, Fuzhong; Zhang, Huichun
2016-04-01
Traditional speech detection methods regard the noise as a jamming signal to filter,but under the strong noise background,these methods lost part of the original speech signal while eliminating noise.Stochastic resonance can use noise energy to amplify the weak signal and suppress the noise.According to stochastic resonance theory,a new method based on adaptive stochastic resonance to extract weak speech signals is proposed.This method,combined with twice sampling,realizes the detection of weak speech signals from strong noise.The parameters of the systema,b are adjusted adaptively by evaluating the signal-to-noise ratio of the output signal,and then the weak speech signal is optimally detected.Experimental simulation analysis showed that under the background of strong noise,the output signal-to-noise ratio increased from the initial value-7dB to about 0.86 dB,with the gain of signalto-noise ratio is 7.86 dB.This method obviously raises the signal-to-noise ratio of the output speech signals,which gives a new idea to detect the weak speech signals in strong noise environment.
Motor-based intervention protocols in treatment of childhood apraxia of speech (CAS)
Maas, Edwin; Gildersleeve-Neumann, Christina; Jakielski, Kathy J.; Stoeckel, Ruth
2014-01-01
This paper reviews current trends in treatment for childhood apraxia of speech (CAS), with a particular emphasis on motor-based intervention protocols. The paper first briefly discusses how CAS fits into the typology of speech sound disorders, followed by a discussion of the potential relevance of principles derived from the motor learning literature for CAS treatment. Next, different motor-based treatment protocols are reviewed, along with their evidence base. The paper concludes with a summary and discussion of future research needs. PMID:25313348
Effects of emotion on different phoneme classes
NASA Astrophysics Data System (ADS)
Lee, Chul Min; Yildirim, Serdar; Bulut, Murtaza; Busso, Carlos; Kazemzadeh, Abe; Lee, Sungbok; Narayanan, Shrikanth
2004-10-01
This study investigates the effects of emotion on different phoneme classes using short-term spectral features. In the research on emotion in speech, most studies have focused on prosodic features of speech. In this study, based on the hypothesis that different emotions have varying effects on the properties of the different speech sounds, we investigate the usefulness of phoneme-class level acoustic modeling for automatic emotion classification. Hidden Markov models (HMM) based on short-term spectral features for five broad phonetic classes are used for this purpose using data obtained from recordings of two actresses. Each speaker produces 211 sentences with four different emotions (neutral, sad, angry, happy). Using the speech material we trained and compared the performances of two sets of HMM classifiers: a generic set of ``emotional speech'' HMMs (one for each emotion) and a set of broad phonetic-class based HMMs (vowel, glide, nasal, stop, fricative) for each emotion type considered. Comparison of classification results indicates that different phoneme classes were affected differently by emotional change and that the vowel sounds are the most important indicator of emotions in speech. Detailed results and their implications on the underlying speech articulation will be discussed.
Speech enhancement using the modified phase-opponency model.
Deshmukh, Om D; Espy-Wilson, Carol Y; Carney, Laurel H
2007-06-01
In this paper we present a model called the Modified Phase-Opponency (MPO) model for single-channel speech enhancement when the speech is corrupted by additive noise. The MPO model is based on the auditory PO model, proposed for detection of tones in noise. The PO model includes a physiologically realistic mechanism for processing the information in neural discharge times and exploits the frequency-dependent phase properties of the tuned filters in the auditory periphery by using a cross-auditory-nerve-fiber coincidence detection for extracting temporal cues. The MPO model alters the components of the PO model such that the basic functionality of the PO model is maintained but the properties of the model can be analyzed and modified independently. The MPO-based speech enhancement scheme does not need to estimate the noise characteristics nor does it assume that the noise satisfies any statistical model. The MPO technique leads to the lowest value of the LPC-based objective measures and the highest value of the perceptual evaluation of speech quality measure compared to other methods when the speech signals are corrupted by fluctuating noise. Combining the MPO speech enhancement technique with our aperiodicity, periodicity, and pitch detector further improves its performance.
Joint Spatial-Spectral Feature Space Clustering for Speech Activity Detection from ECoG Signals
Kanas, Vasileios G.; Mporas, Iosif; Benz, Heather L.; Sgarbas, Kyriakos N.; Bezerianos, Anastasios; Crone, Nathan E.
2014-01-01
Brain machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines (SVM) as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and non-speech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllable repetition tasks and may contribute to the development of portable ECoG-based communication. PMID:24658248
ERIC Educational Resources Information Center
Baker, Elise; McLeod, Sharynne
2011-01-01
Purpose: This article provides both a tutorial and a clinical example of how speech-language pathologists (SLPs) can conduct evidence-based practice (EBP) when working with children with speech sound disorders (SSDs). It is a companion paper to the narrative review of 134 intervention studies for children who have an SSD (Baker & McLeod, 2011).…
ERIC Educational Resources Information Center
Huettig, Falk; Hartsuiker, Robert J.
2010-01-01
Theories of verbal self-monitoring generally assume an internal (pre-articulatory) monitoring channel, but there is debate about whether this channel relies on speech perception or on production-internal mechanisms. Perception-based theories predict that listening to one's own inner speech has similar behavioural consequences as listening to…
Public Speaking Anxiety: Comparing Face-to-Face and Web-Based Speeches
ERIC Educational Resources Information Center
Campbell, Scott; Larson, James
2013-01-01
This study is to determine whether or not students have a different level of anxiety between giving a speech to a group of people in a traditional face-to-face classroom setting to a speech given to an audience (visible on a projected screen) into a camera using distance or web-based technology. The study included approximately 70 students.…
The influence of speech rate and accent on access and use of semantic information.
Sajin, Stanislav M; Connine, Cynthia M
2017-04-01
Circumstances in which the speech input is presented in sub-optimal conditions generally lead to processing costs affecting spoken word recognition. The current study indicates that some processing demands imposed by listening to difficult speech can be mitigated by feedback from semantic knowledge. A set of lexical decision experiments examined how foreign accented speech and word duration impact access to semantic knowledge in spoken word recognition. Results indicate that when listeners process accented speech, the reliance on semantic information increases. Speech rate was not observed to influence semantic access, except in the setting in which unusually slow accented speech was presented. These findings support interactive activation models of spoken word recognition in which attention is modulated based on speech demands.
New Ways in Teaching Connected Speech. New Ways Series
ERIC Educational Resources Information Center
Brown, James Dean, Ed.
2012-01-01
Connected speech is based on a set of rules used to modify pronunciations so that words connect and flow more smoothly in natural speech (hafta versus have to). Native speakers of English tend to feel that connected speech is friendlier, more natural, more sympathetic, and more personal. Is there any reason why learners of English would prefer to…
ERIC Educational Resources Information Center
Crowe, Kathryn; Cumming, Tamara; McCormack, Jane; Baker, Elise; McLeod, Sharynne; Wren, Yvonne; Roulstone, Sue; Masso, Sarah
2017-01-01
Early childhood educators are frequently called on to support preschool-aged children with speech sound disorders and to engage these children in activities that target their speech production. This study explored factors that acted as facilitators and/or barriers to the provision of computer-based support for children with speech sound disorders…
2014-01-01
The phonetic properties of six Malay vowels are investigated using magnetic resonance imaging (MRI) to visualize the vocal tract in order to obtain dynamic articulatory parameters during speech production. To resolve image blurring due to the tongue movement during the scanning process, a method based on active contour extraction is used to track tongue contours. The proposed method efficiently tracks tongue contours despite the partial blurring of MRI images. Consequently, the articulatory parameters that are effectively measured as tongue movement is observed, and the specific shape of the tongue and its position for all six uttered Malay vowels are determined. Speech rehabilitation procedure demands some kind of visual perceivable prototype of speech articulation. To investigate the validity of the measured articulatory parameters based on acoustic theory of speech production, an acoustic analysis based on the uttered vowels by subjects has been performed. As the acoustic speech and articulatory parameters of uttered speech were examined, a correlation between formant frequencies and articulatory parameters was observed. The experiments reported a positive correlation between the constriction location of the tongue body and the first formant frequency, as well as a negative correlation between the constriction location of the tongue tip and the second formant frequency. The results demonstrate that the proposed method is an effective tool for the dynamic study of speech production. PMID:25060583
Zourmand, Alireza; Mirhassani, Seyed Mostafa; Ting, Hua-Nong; Bux, Shaik Ismail; Ng, Kwan Hoong; Bilgen, Mehmet; Jalaludin, Mohd Amin
2014-07-25
The phonetic properties of six Malay vowels are investigated using magnetic resonance imaging (MRI) to visualize the vocal tract in order to obtain dynamic articulatory parameters during speech production. To resolve image blurring due to the tongue movement during the scanning process, a method based on active contour extraction is used to track tongue contours. The proposed method efficiently tracks tongue contours despite the partial blurring of MRI images. Consequently, the articulatory parameters that are effectively measured as tongue movement is observed, and the specific shape of the tongue and its position for all six uttered Malay vowels are determined.Speech rehabilitation procedure demands some kind of visual perceivable prototype of speech articulation. To investigate the validity of the measured articulatory parameters based on acoustic theory of speech production, an acoustic analysis based on the uttered vowels by subjects has been performed. As the acoustic speech and articulatory parameters of uttered speech were examined, a correlation between formant frequencies and articulatory parameters was observed. The experiments reported a positive correlation between the constriction location of the tongue body and the first formant frequency, as well as a negative correlation between the constriction location of the tongue tip and the second formant frequency. The results demonstrate that the proposed method is an effective tool for the dynamic study of speech production.
Enhancing speech recognition using improved particle swarm optimization based hidden Markov model.
Selvaraj, Lokesh; Ganesan, Balakrishnan
2014-01-01
Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO) is suggested. The suggested methodology contains four stages, namely, (i) denoising, (ii) feature mining (iii), vector quantization, and (iv) IPSO based hidden Markov model (HMM) technique (IP-HMM). At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC), mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.
Harlander, Niklas; Rosenkranz, Tobias; Hohmann, Volker
2012-08-01
Single channel noise reduction has been well investigated and seems to have reached its limits in terms of speech intelligibility improvement, however, the quality of such schemes can still be advanced. This study tests to what extent novel model-based processing schemes might improve performance in particular for non-stationary noise conditions. Two prototype model-based algorithms, a speech-model-based, and a auditory-model-based algorithm were compared to a state-of-the-art non-parametric minimum statistics algorithm. A speech intelligibility test, preference rating, and listening effort scaling were performed. Additionally, three objective quality measures for the signal, background, and overall distortions were applied. For a better comparison of all algorithms, particular attention was given to the usage of the similar Wiener-based gain rule. The perceptual investigation was performed with fourteen hearing-impaired subjects. The results revealed that the non-parametric algorithm and the auditory model-based algorithm did not affect speech intelligibility, whereas the speech-model-based algorithm slightly decreased intelligibility. In terms of subjective quality, both model-based algorithms perform better than the unprocessed condition and the reference in particular for highly non-stationary noise environments. Data support the hypothesis that model-based algorithms are promising for improving performance in non-stationary noise conditions.
Dwivedi, Raghav C; St Rose, Suzanne; Chisholm, Edward J; Bisase, Brian; Amen, Furrat; Nutting, Christopher M; Clarke, Peter M; Kerawala, Cyrus J; Rhys-Evans, Peter H; Harrington, Kevin J; Kazi, Rehan
2012-06-01
The aim of this study was to explore post-treatment speech impairments using English version of Speech Handicap Index (SHI) (first speech-specific questionnaire) in a cohort of oral cavity (OC) and oropharyngeal (OP) cancer patients. Sixty-three consecutive OC and OP cancer patients in follow-up participated in this study. Descriptive analyses have been presented as percentages, while Mann-Whitney U-test and Kruskall-Wallis test have been used for the quantitative variables. Statistical Package for Social Science-15 statistical software (SPSS Inc., Chicago, IL) was used for the statistical analyses. Over a third (36.1%) of patients reported their speech as either average or bad. Speech intelligibility and articulation were the main speech concerns for 58.8% and 52.9% OC and 31.6% and 34.2% OP cancer patients, respectively. While feeling of incompetent and being less outgoing were the speech-related psychosocial concerns for 64.7% and 23.5% OC and 15.8% and 18.4% OP cancer patients, respectively. Worse speech outcomes were noted for oral tongue and base of tongue cancers vs. tonsillar cancers, mean (SD) values were 56.7 (31.3) and 52.0 (38.4) vs. 10.9 (14.8) (P<0.001) and late vs. early T stage cancers 65.0 (29.9) vs. 29.3 (32.7) (P<0.005). The English version of the SHI is a reliable, valid and useful tool for the evaluation of speech in HNC patients. Over one-third of OC and OP cancer patients reported speech problems in their day-do-day life. Advanced T-stage tumors affecting the oral tongue or base of tongue are particularly associated with poor speech outcomes. Copyright © 2012 Elsevier Ltd. All rights reserved.
Rhythmic patterning in Malaysian and Singapore English.
Tan, Rachel Siew Kuang; Low, Ee-Ling
2014-06-01
Previous work on the rhythm of Malaysian English has been based on impressionistic observations. This paper utilizes acoustic analysis to measure the rhythmic patterns of Malaysian English. Recordings of the read speech and spontaneous speech of 10 Malaysian English speakers were analyzed and compared with recordings of an equivalent sample of Singaporean English speakers. Analysis was done using two rhythmic indexes, the PVI and VarcoV. It was found that although the rhythm of read speech of the Singaporean speakers was syllable-based as described by previous studies, the rhythm of the Malaysian speakers was even more syllable-based. Analysis of the syllables in specific utterances showed that Malaysian speakers did not reduce vowels as much as Singaporean speakers in cases of syllables in utterances. Results of the spontaneous speech confirmed the findings for the read speech; that is, the same rhythmic patterning was found which normally triggers vowel reductions.
Assessing Auditory Discrimination Skill of Malay Children Using Computer-based Method.
Ting, H; Yunus, J; Mohd Nordin, M Z
2005-01-01
The purpose of this paper is to investigate the auditory discrimination skill of Malay children using computer-based method. Currently, most of the auditory discrimination assessments are conducted manually by Speech-Language Pathologist. These conventional tests are actually general tests of sound discrimination, which do not reflect the client's specific speech sound errors. Thus, we propose computer-based Malay auditory discrimination test to automate the whole process of assessment as well as to customize the test according to the specific speech error sounds of the client. The ability in discriminating voiced and unvoiced Malay speech sounds was studied for the Malay children aged between 7 and 10 years old. The study showed no major difficulty for the children in discriminating the Malay speech sounds except differentiating /g/-/k/ sounds. Averagely the children of 7 years old failed to discriminate /g/-/k/ sounds.
Evolution of language: Lessons from the genome.
Fisher, Simon E
2017-02-01
The post-genomic era is an exciting time for researchers interested in the biology of speech and language. Substantive advances in molecular methodologies have opened up entire vistas of investigation that were not previously possible, or in some cases even imagined. Speculations concerning the origins of human cognitive traits are being transformed into empirically addressable questions, generating specific hypotheses that can be explicitly tested using data collected from both the natural world and experimental settings. In this article, I discuss a number of promising lines of research in this area. For example, the field has begun to identify genes implicated in speech and language skills, including not just disorders but also the normal range of abilities. Such genes provide powerful entry points for gaining insights into neural bases and evolutionary origins, using sophisticated experimental tools from molecular neuroscience and developmental neurobiology. At the same time, sequencing of ancient hominin genomes is giving us an unprecedented view of the molecular genetic changes that have occurred during the evolution of our species. Synthesis of data from these complementary sources offers an opportunity to robustly evaluate alternative accounts of language evolution. Of course, this endeavour remains challenging on many fronts, as I also highlight in the article. Nonetheless, such an integrated approach holds great potential for untangling the complexities of the capacities that make us human.
Walking the talk--speech activates the leg motor cortex.
Liuzzi, Gianpiero; Ellger, Tanja; Flöel, Agnes; Breitenstein, Caterina; Jansen, Andreas; Knecht, Stefan
2008-09-01
Speech may have evolved from earlier modes of communication based on gestures. Consistent with such a motor theory of speech, cortical orofacial and hand motor areas are activated by both speech production and speech perception. However, the extent of speech-related activation of the motor cortex remains unclear. Therefore, we examined if reading and listening to continuous prose also activates non-brachiofacial motor representations like the leg motor cortex. We found corticospinal excitability of bilateral leg muscle representations to be enhanced by speech production and silent reading. Control experiments showed that speech production yielded stronger facilitation of the leg motor system than non-verbal tongue-mouth mobilization and silent reading more than a visuo-attentional task thus indicating speech-specificity of the effect. In the frame of the motor theory of speech this finding suggests that the system of gestural communication, from which speech may have evolved, is not confined to the hand but includes gestural movements of other body parts as well.
"Who" is saying "what"? Brain-based decoding of human voice and speech.
Formisano, Elia; De Martino, Federico; Bonte, Milene; Goebel, Rainer
2008-11-07
Can we decipher speech content ("what" is being said) and speaker identity ("who" is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.
Proceedings of the Second Joint Technology Workshop on Neural Networks and Fuzzy Logic, volume 2
NASA Technical Reports Server (NTRS)
Lea, Robert N. (Editor); Villarreal, James A. (Editor)
1991-01-01
Documented here are papers presented at the Neural Networks and Fuzzy Logic Workshop sponsored by NASA and the University of Texas, Houston. Topics addressed included adaptive systems, learning algorithms, network architectures, vision, robotics, neurobiological connections, speech recognition and synthesis, fuzzy set theory and application, control and dynamics processing, space applications, fuzzy logic and neural network computers, approximate reasoning, and multiobject decision making.
Nordberg, A; Miniscalco, C; Lohmander, A; Himmelmann, K
2013-02-01
To describe speech ability in a population-based study of children with cerebral palsy (CP), in relation to CP subtype, motor function, cognitive level and neuroimaging findings. A retrospective chart review of 129 children (66 girls, 63 boys) with CP, born in 1999-2002, was carried out. Speech ability and background information, such as type of CP, motor function, cognitive level and neuroimaging data, were collected and analysed. Speech disorders were found in 21% of the children and were present in all types of CP. Forty-one per cent of the children with speech disorders also had mental retardation, and 42% were able to walk independently. A further 32% of the children were nonverbal, and maldevelopment and basal ganglia lesions were most common in this group. The remaining 47% had no speech disorders, and this group was most likely to display white matter lesions of immaturity. More than half of the children in this CP cohort had a speech disorder (21%) or were nonverbal (32%). Speech ability was related to the type of CP, gross motor function, the presence of mental retardation and the localization of brain maldevelopment and lesions. Neuroimaging results differed between the three speech ability groups. ©2012 The Author(s)/Acta Paediatrica ©2012 Foundation Acta Paediatrica.
Creatine deficiency syndromes.
Schulze, Andreas
2013-01-01
The lack of creatine in the central nervous system causes a severe but treatable neurological disease. Three inherited defects, AGAT, GAMT, and CrT deficiency, compromising synthesis and transport of creatine have been discovered recently. Together these so-called creatine deficiency syndromes (CDS) might represent the most frequent metabolic disorders with a primarily neurological phenotype. Patients with CDS present with global developmental delays, mental retardation, speech impairment especially affecting active language, seizures, extrapyramidal movement disorder, and autism spectrum disorder. The two defects in the creatine synthesis, AGAT and GAMT, are autosomal recessive disorders. They can be diagnosed by analysis of the creatine, guanidinoacetate, and creatinine in body fluids. Treatment is available and, especially when introduced in infancy, has a good outcome. The defect of creatine transport, CrT, is an X-linked condition and perhaps the most frequent reasons for X-linked mental retardation. Diagnosis is made by an increased ratio of creatine to creatinine in urine, but successful treatment still needs to be explored. CDS are under-diagnosed because easy to miss in standard diagnostic workup. Because CDS represent a frequent cause of cognitive and neurological impairment that is treatable they warrant consideration in the workup for genetic mental retardation syndromes, for intractable seizure disorders, and for neurological diseases with a predominant lack of active speech. Copyright © 2013 Elsevier B.V. All rights reserved.
Panchapagesan, Sankaran; Alwan, Abeer
2011-01-01
In this paper, a quantitative study of acoustic-to-articulatory inversion for vowel speech sounds by analysis-by-synthesis using the Maeda articulatory model is performed. For chain matrix calculation of vocal tract (VT) acoustics, the chain matrix derivatives with respect to area function are calculated and used in a quasi-Newton method for optimizing articulatory trajectories. The cost function includes a distance measure between natural and synthesized first three formants, and parameter regularization and continuity terms. Calibration of the Maeda model to two speakers, one male and one female, from the University of Wisconsin x-ray microbeam (XRMB) database, using a cost function, is discussed. Model adaptation includes scaling the overall VT and the pharyngeal region and modifying the outer VT outline using measured palate and pharyngeal traces. The inversion optimization is initialized by a fast search of an articulatory codebook, which was pruned using XRMB data to improve inversion results. Good agreement between estimated midsagittal VT outlines and measured XRMB tongue pellet positions was achieved for several vowels and diphthongs for the male speaker, with average pellet-VT outline distances around 0.15 cm, smooth articulatory trajectories, and less than 1% average error in the first three formants. PMID:21476670
[A research in speech endpoint detection based on boxes-coupling generalization dimension].
Wang, Zimei; Yang, Cuirong; Wu, Wei; Fan, Yingle
2008-06-01
In this paper, a new calculating method of generalized dimension, based on boxes-coupling principle, is proposed to overcome the edge effects and to improve the capability of the speech endpoint detection which is based on the original calculating method of generalized dimension. This new method has been applied to speech endpoint detection. Firstly, the length of overlapping border was determined, and through calculating the generalized dimension by covering the speech signal with overlapped boxes, three-dimension feature vectors including the box dimension, the information dimension and the correlation dimension were obtained. Secondly, in the light of the relation between feature distance and similarity degree, feature extraction was conducted by use of common distance. Lastly, bi-threshold method was used to classify the speech signals. The results of experiment indicated that, by comparison with the original generalized dimension (OGD) and the spectral entropy (SE) algorithm, the proposed method is more robust and effective for detecting the speech signals which contain different kinds of noise in different signal noise ratio (SNR), especially in low SNR.
Audio visual speech source separation via improved context dependent association model
NASA Astrophysics Data System (ADS)
Kazemi, Alireza; Boostani, Reza; Sobhanmanesh, Fariborz
2014-12-01
In this paper, we exploit the non-linear relation between a speech source and its associated lip video as a source of extra information to propose an improved audio-visual speech source separation (AVSS) algorithm. The audio-visual association is modeled using a neural associator which estimates the visual lip parameters from a temporal context of acoustic observation frames. We define an objective function based on mean square error (MSE) measure between estimated and target visual parameters. This function is minimized for estimation of the de-mixing vector/filters to separate the relevant source from linear instantaneous or time-domain convolutive mixtures. We have also proposed a hybrid criterion which uses AV coherency together with kurtosis as a non-Gaussianity measure. Experimental results are presented and compared in terms of visually relevant speech detection accuracy and output signal-to-interference ratio (SIR) of source separation. The suggested audio-visual model significantly improves relevant speech classification accuracy compared to existing GMM-based model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICA- and AVSS-based methods.
Robust estimators for speech enhancement in real environments
NASA Astrophysics Data System (ADS)
Sandoval-Ibarra, Yuma; Diaz-Ramirez, Victor H.; Kober, Vitaly
2015-09-01
Common statistical estimators for speech enhancement rely on several assumptions about stationarity of speech signals and noise. These assumptions may not always valid in real-life due to nonstationary characteristics of speech and noise processes. We propose new estimators based on existing estimators by incorporation of computation of rank-order statistics. The proposed estimators are better adapted to non-stationary characteristics of speech signals and noise processes. Through computer simulations we show that the proposed estimators yield a better performance in terms of objective metrics than that of known estimators when speech signals are contaminated with airport, babble, restaurant, and train-station noise.
Development and Perceptual Evaluation of Amplitude-Based F0 Control in Electrolarynx Speech
ERIC Educational Resources Information Center
Saikachi, Yoko; Stevens, Kenneth N.; Hillman, Robert E.
2009-01-01
Purpose: Current electrolarynx (EL) devices produce a mechanical speech quality that has been largely attributed to the lack of natural fundamental frequency (F0) variation. In order to improve the quality of EL speech, in the present study the authors aimed to develop and evaluate an automatic F0 control scheme, in which F0 was modulated based on…
Constructing Adequate Non-Speech Analogues: What Is Special about Speech Anyway?
ERIC Educational Resources Information Center
Rosen, Stuart; Iverson, Paul
2007-01-01
Vouloumanos and Werker (2007) claim that human neonates have a (possibly innate) bias to listen to speech based on a preference for natural speech utterances over sine-wave analogues. We argue that this bias more likely arises from the strikingly different saliency of voice melody in the two kinds of sounds, a bias that has already been shown to…
ERIC Educational Resources Information Center
Ashtiani, Farshid Tayari; Zafarghandi, Amir Mahdavi
2015-01-01
The present study was an attempt to investigate the impact of English verbal songs on connected speech aspects of adult English learners' speech production. 40 participants were selected based on the results of their performance in a piloted and validated version of NELSON test given to 60 intermediate English learners in a language institute in…
ERIC Educational Resources Information Center
Koehlinger, Keegan M.
2015-01-01
Clinical Question: Would a preschool-aged child with childhood apraxia of speech (CAS) benefit from a singular approach--such as motor planning, sensory cueing, linguistic and rhythmic--or a combined approach in order to increase intelligibility of spoken language? Method: Systematic Review. Study Sources: ASHA Wire, Google Scholar, Speech Bite.…
Human phoneme recognition depending on speech-intrinsic variability.
Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger
2010-11-01
The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).
Everyday listeners' impressions of speech produced by individuals with adductor spasmodic dysphonia.
Nagle, Kathleen F; Eadie, Tanya L; Yorkston, Kathryn M
2015-01-01
Individuals with adductor spasmodic dysphonia (ADSD) have reported that unfamiliar communication partners appear to judge them as sneaky, nervous or not intelligent, apparently based on the quality of their speech; however, there is minimal research into the actual everyday perspective of listening to ADSD speech. The purpose of this study was to investigate the impressions of listeners hearing ADSD speech for the first time using a mixed-methods design. Everyday listeners were interviewed following sessions in which they made ratings of ADSD speech. A semi-structured interview approach was used and data were analyzed using thematic content analysis. Three major themes emerged: (1) everyday listeners make judgments about speakers with ADSD; (2) ADSD speech does not sound normal to everyday listeners; and (3) rating overall severity is difficult for everyday listeners. Participants described ADSD speech similarly to existing literature; however, some listeners inaccurately extrapolated speaker attributes based solely on speech samples. Listeners may draw erroneous conclusions about individuals with ADSD and these biases may affect the communicative success of these individuals. Results have implications for counseling individuals with ADSD, as well as the need for education and awareness about ADSD. Copyright © 2015 Elsevier Inc. All rights reserved.
Autonomic Correlates of Speech Versus Nonspeech Tasks in Children and Adults
Arnold, Hayley S.; MacPherson, Megan K.; Smith, Anne
2015-01-01
Purpose To assess autonomic arousal associated with speech and nonspeech tasks in school-age children and young adults. Method Measures of autonomic arousal (electrodermal level, electrodermal response amplitude, blood pulse volume, and heart rate) were recorded prior to, during, and after the performance of speech and nonspeech tasks by twenty 7- to 9-year-old children and twenty 18- to 22-year-old adults. Results Across age groups, autonomic arousal was higher for speech tasks compared with nonspeech tasks, based on peak electrodermal response amplitude and blood pulse volume. Children demonstrated greater relative arousal, based on heart rate and blood pulse volume, for nonspeech oral motor tasks than adults but showed similar mean arousal levels for speech tasks as adults. Children demonstrated sex differences in autonomic arousal; specifically, autonomic arousal remained high for school-age boys but not girls in a more complex open-ended narrative task that followed a simple sentence production task. Conclusions Speech tasks elicit greater autonomic arousal than nonspeech tasks, and children demonstrate greater autonomic arousal for nonspeech oral motor tasks than adults. Sex differences in autonomic arousal associated with speech tasks in school-age children are discussed relative to speech-language differences between boys and girls. PMID:24686989
Development of a Low-Cost, Noninvasive, Portable Visual Speech Recognition Program.
Kohlberg, Gavriel D; Gal, Ya'akov Kobi; Lalwani, Anil K
2016-09-01
Loss of speech following tracheostomy and laryngectomy severely limits communication to simple gestures and facial expressions that are largely ineffective. To facilitate communication in these patients, we seek to develop a low-cost, noninvasive, portable, and simple visual speech recognition program (VSRP) to convert articulatory facial movements into speech. A Microsoft Kinect-based VSRP was developed to capture spatial coordinates of lip movements and translate them into speech. The articulatory speech movements associated with 12 sentences were used to train an artificial neural network classifier. The accuracy of the classifier was then evaluated on a separate, previously unseen set of articulatory speech movements. The VSRP was successfully implemented and tested in 5 subjects. It achieved an accuracy rate of 77.2% (65.0%-87.6% for the 5 speakers) on a 12-sentence data set. The mean time to classify an individual sentence was 2.03 milliseconds (1.91-2.16). We have demonstrated the feasibility of a low-cost, noninvasive, portable VSRP based on Kinect to accurately predict speech from articulation movements in clinically trivial time. This VSRP could be used as a novel communication device for aphonic patients. © The Author(s) 2016.
Simulation of talking faces in the human brain improves auditory speech recognition
von Kriegstein, Katharina; Dogan, Özgür; Grüter, Martina; Giraud, Anne-Lise; Kell, Christian A.; Grüter, Thomas; Kleinschmidt, Andreas; Kiebel, Stefan J.
2008-01-01
Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. PMID:18436648
NASA Technical Reports Server (NTRS)
Sternfeld, H., Jr.; Doyle, L. B.
1978-01-01
The relationship between the internal noise environment of helicopters and the ability of personnel to understand commands and instructions was studied. A test program was conducted to relate speech intelligibility to a standard measurement called Articulation Index. An acoustical simulator was used to provide noise environments typical of Army helicopters. Speech material (command sentences and phonetically balanced word lists) were presented at several voice levels in each helicopter environment. Recommended helicopter internal noise criteria, based on speech communication, were derived and the effectiveness of hearing protection devices were evaluated.
Designing speech-based interfaces for telepresence robots for people with disabilities.
Tsui, Katherine M; Flynn, Kelsey; McHugh, Amelia; Yanco, Holly A; Kontak, David
2013-06-01
People with cognitive and/or motor impairments may benefit from using telepresence robots to engage in social activities. To date, these robots, their user interfaces, and their navigation behaviors have not been designed for operation by people with disabilities. We conducted an experiment in which participants (n=12) used a telepresence robot in a scavenger hunt task to determine how they would use speech to command the robot. Based upon the results, we present design guidelines for speech-based interfaces for telepresence robots.
An integrated approach to improving noisy speech perception
NASA Astrophysics Data System (ADS)
Koval, Serguei; Stolbov, Mikhail; Smirnova, Natalia; Khitrov, Mikhail
2002-05-01
For a number of practical purposes and tasks, experts have to decode speech recordings of very poor quality. A combination of techniques is proposed to improve intelligibility and quality of distorted speech messages and thus facilitate their comprehension. Along with the application of noise cancellation and speech signal enhancement techniques removing and/or reducing various kinds of distortions and interference (primarily unmasking and normalization in time and frequency fields), the approach incorporates optimal listener expert tactics based on selective listening, nonstandard binaural listening, accounting for short-term and long-term human ear adaptation to noisy speech, as well as some methods of speech signal enhancement to support speech decoding during listening. The approach integrating the suggested techniques ensures high-quality ultimate results and has successfully been applied by Speech Technology Center experts and by numerous other users, mainly forensic institutions, to perform noisy speech records decoding for courts, law enforcement and emergency services, accident investigation bodies, etc.
Asad, Areej Nimer; Purdy, Suzanne C; Ballard, Elaine; Fairgray, Liz; Bowen, Caroline
2018-04-27
In this descriptive study, phonological processes were examined in the speech of children aged 5;0-7;6 (years; months) with mild to profound hearing loss using hearing aids (HAs) and cochlear implants (CIs), in comparison to their peers. A second aim was to compare phonological processes of HA and CI users. Children with hearing loss (CWHL, N = 25) were compared to children with normal hearing (CWNH, N = 30) with similar age, gender, linguistic, and socioeconomic backgrounds. Speech samples obtained from a list of 88 words, derived from three standardized speech tests, were analyzed using the CASALA (Computer Aided Speech and Language Analysis) program to evaluate participants' phonological systems, based on lax (a process appeared at least twice in the speech of at least two children) and strict (a process appeared at least five times in the speech of at least two children) counting criteria. Developmental phonological processes were eliminated in the speech of younger and older CWNH while eleven developmental phonological processes persisted in the speech of both age groups of CWHL. CWHL showed a similar trend of age of elimination to CWNH, but at a slower rate. Children with HAs and CIs produced similar phonological processes. Final consonant deletion, weak syllable deletion, backing, and glottal replacement were present in the speech of HA users, affecting their overall speech intelligibility. Developmental and non-developmental phonological processes persist in the speech of children with mild to profound hearing loss compared to their peers with typical hearing. The findings indicate that it is important for clinicians to consider phonological assessment in pre-school CWHL and the use of evidence-based speech therapy in order to reduce non-developmental and non-age-appropriate developmental processes, thereby enhancing their speech intelligibility. Copyright © 2018 Elsevier Inc. All rights reserved.
Speech-Like Rhythm in a Voiced and Voiceless Orangutan Call
Lameira, Adriano R.; Hardus, Madeleine E.; Bartlett, Adrian M.; Shumaker, Robert W.; Wich, Serge A.; Menken, Steph B. J.
2015-01-01
The evolutionary origins of speech remain obscure. Recently, it was proposed that speech derived from monkey facial signals which exhibit a speech-like rhythm of ∼5 open-close lip cycles per second. In monkeys, these signals may also be vocalized, offering a plausible evolutionary stepping stone towards speech. Three essential predictions remain, however, to be tested to assess this hypothesis' validity; (i) Great apes, our closest relatives, should likewise produce 5Hz-rhythm signals, (ii) speech-like rhythm should involve calls articulatorily similar to consonants and vowels given that speech rhythm is the direct product of stringing together these two basic elements, and (iii) speech-like rhythm should be experience-based. Via cinematic analyses we demonstrate that an ex-entertainment orangutan produces two calls at a speech-like rhythm, coined “clicks” and “faux-speech.” Like voiceless consonants, clicks required no vocal fold action, but did involve independent manoeuvring over lips and tongue. In parallel to vowels, faux-speech showed harmonic and formant modulations, implying vocal fold and supralaryngeal action. This rhythm was several times faster than orangutan chewing rates, as observed in monkeys and humans. Critically, this rhythm was seven-fold faster, and contextually distinct, than any other known rhythmic calls described to date in the largest database of the orangutan repertoire ever assembled. The first two predictions advanced by this study are validated and, based on parsimony and exclusion of potential alternative explanations, initial support is given to the third prediction. Irrespectively of the putative origins of these calls and underlying mechanisms, our findings demonstrate irrevocably that great apes are not respiratorily, articulatorilly, or neurologically constrained for the production of consonant- and vowel-like calls at speech rhythm. Orangutan clicks and faux-speech confirm the importance of rhythmic speech antecedents within the primate lineage, and highlight potential articulatory homologies between great ape calls and human consonants and vowels. PMID:25569211
A Window into the Intoxicated Mind? Speech as an Index of Psychoactive Drug Effects
Bedi, Gillinder; Cecchi, Guillermo A; Slezak, Diego F; Carrillo, Facundo; Sigman, Mariano; de Wit, Harriet
2014-01-01
Abused drugs can profoundly alter mental states in ways that may motivate drug use. These effects are usually assessed with self-report, an approach that is vulnerable to biases. Analyzing speech during intoxication may present a more direct, objective measure, offering a unique ‘window' into the mind. Here, we employed computational analyses of speech semantic and topological structure after ±3,4-methylenedioxymethamphetamine (MDMA; ‘ecstasy') and methamphetamine in 13 ecstasy users. In 4 sessions, participants completed a 10-min speech task after MDMA (0.75 and 1.5 mg/kg), methamphetamine (20 mg), or placebo. Latent Semantic Analyses identified the semantic proximity between speech content and concepts relevant to drug effects. Graph-based analyses identified topological speech characteristics. Group-level drug effects on semantic distances and topology were assessed. Machine-learning analyses (with leave-one-out cross-validation) assessed whether speech characteristics could predict drug condition in the individual subject. Speech after MDMA (1.5 mg/kg) had greater semantic proximity than placebo to the concepts friend, support, intimacy, and rapport. Speech on MDMA (0.75 mg/kg) had greater proximity to empathy than placebo. Conversely, speech on methamphetamine was further from compassion than placebo. Classifiers discriminated between MDMA (1.5 mg/kg) and placebo with 88% accuracy, and MDMA (1.5 mg/kg) and methamphetamine with 84% accuracy. For the two MDMA doses, the classifier performed at chance. These data suggest that automated semantic speech analyses can capture subtle alterations in mental state, accurately discriminating between drugs. The findings also illustrate the potential for automated speech-based approaches to characterize clinically relevant alterations to mental state, including those occurring in psychiatric illness. PMID:24694926
A window into the intoxicated mind? Speech as an index of psychoactive drug effects.
Bedi, Gillinder; Cecchi, Guillermo A; Slezak, Diego F; Carrillo, Facundo; Sigman, Mariano; de Wit, Harriet
2014-09-01
Abused drugs can profoundly alter mental states in ways that may motivate drug use. These effects are usually assessed with self-report, an approach that is vulnerable to biases. Analyzing speech during intoxication may present a more direct, objective measure, offering a unique 'window' into the mind. Here, we employed computational analyses of speech semantic and topological structure after ±3,4-methylenedioxymethamphetamine (MDMA; 'ecstasy') and methamphetamine in 13 ecstasy users. In 4 sessions, participants completed a 10-min speech task after MDMA (0.75 and 1.5 mg/kg), methamphetamine (20 mg), or placebo. Latent Semantic Analyses identified the semantic proximity between speech content and concepts relevant to drug effects. Graph-based analyses identified topological speech characteristics. Group-level drug effects on semantic distances and topology were assessed. Machine-learning analyses (with leave-one-out cross-validation) assessed whether speech characteristics could predict drug condition in the individual subject. Speech after MDMA (1.5 mg/kg) had greater semantic proximity than placebo to the concepts friend, support, intimacy, and rapport. Speech on MDMA (0.75 mg/kg) had greater proximity to empathy than placebo. Conversely, speech on methamphetamine was further from compassion than placebo. Classifiers discriminated between MDMA (1.5 mg/kg) and placebo with 88% accuracy, and MDMA (1.5 mg/kg) and methamphetamine with 84% accuracy. For the two MDMA doses, the classifier performed at chance. These data suggest that automated semantic speech analyses can capture subtle alterations in mental state, accurately discriminating between drugs. The findings also illustrate the potential for automated speech-based approaches to characterize clinically relevant alterations to mental state, including those occurring in psychiatric illness.
The Role of Visual Image and Perception in Speech Development of Children with Speech Pathology
ERIC Educational Resources Information Center
Tsvetkova, L. S.; Kuznetsova, T. M.
1977-01-01
Investigated with 125 children (4-14 years old) with speech, language, or emotional disorders was the assumption that the naming function can be underdeveloped because of defects in the word's gnostic base. (Author/DB)
Speech deterioration in amyotrophic lateral sclerosis (ALS) after manifestation of bulbar symptoms.
Makkonen, Tanja; Ruottinen, Hanna; Puhto, Riitta; Helminen, Mika; Palmio, Johanna
2018-03-01
The symptoms and their progression in amyotrophic lateral sclerosis (ALS) are typically studied after the diagnosis has been confirmed. However, many people with ALS already have severe dysarthria and loss of adequate speech at the time of diagnosis. Speech-and-language therapy interventions should be targeted timely based on communicative need in ALS. To investigate how long natural speech will remain functional and to identify the changes in the speech of persons with ALS. Altogether 30 consecutive participants were studied and divided into two groups based on the initial type of ALS, bulbar or spinal. Their speech disorder was evaluated on severity, articulation rate and intelligibility during the 2-year follow-up. The ability to speak deteriorated to poor and necessitated augmentative and alternative communication (AAC) methods with 60% of the participants. Their speech remained adequate on average for 18 months from the first bulbar symptom. Severity, articulation rate and intelligibility declined with nearly all participants during the study. To begin with speech deteriorated more in the bulbar group than in the spinal group and the difference remained during the whole follow-up with some exceptions. The onset of bulbar symptoms indicated the time to loss of speech better than when assessed from ALS diagnosis or the first speech therapy evaluation. In clinical work, it is important to take the initial type of ALS into consideration when determining the urgency of AAC measures as people with bulbar-onset ALS are more susceptible to delayed evaluation and AAC intervention. © 2017 Royal College of Speech and Language Therapists.
Kim, Soo Ji; Jo, Uiri
2013-01-01
Based on the anatomical and functional commonality between singing and speech, various types of musical elements have been employed in music therapy research for speech rehabilitation. This study was to develop an accent-based music speech protocol to address voice problems of stroke patients with mixed dysarthria. Subjects were 6 stroke patients with mixed dysarthria and they received individual music therapy sessions. Each session was conducted for 30 minutes and 12 sessions including pre- and post-test were administered for each patient. For examining the protocol efficacy, the measures of maximum phonation time (MPT), fundamental frequency (F0), average intensity (dB), jitter, shimmer, noise to harmonics ratio (NHR), and diadochokinesis (DDK) were compared between pre and post-test and analyzed with a paired sample t-test. The results showed that the measures of MPT, F0, dB, and sequential motion rates (SMR) were significantly increased after administering the protocol. Also, there were statistically significant differences in the measures of shimmer, and alternating motion rates (AMR) of the syllable /K$\\inve$/ between pre- and post-test. The results indicated that the accent-based music speech protocol may improve speech motor coordination including respiration, phonation, articulation, resonance, and prosody of patients with dysarthria. This suggests the possibility of utilizing the music speech protocol to maximize immediate treatment effects in the course of a long-term treatment for patients with dysarthria.
Kraaijenga, S A C; Oskam, I M; van Son, R J J H; Hamming-Vrieze, O; Hilgers, F J M; van den Brekel, M W M; van der Molen, L
2016-04-01
Assessment of long-term objective and subjective voice, speech, articulation, and quality of life in patients with head and neck cancer (HNC) treated with concurrent chemoradiotherapy (CRT) for advanced, stage IV disease. Twenty-two disease-free survivors, treated with cisplatin-based CRT for inoperable HNC (1999-2004), were evaluated at 10-years post-treatment. A standard Dutch text was recorded. Perceptual analysis of voice, speech, and articulation was conducted by two expert listeners (SLPs). Also an experimental expert system based on automatic speech recognition was used. Patients' perception of voice and speech and related quality of life was assessed with the Voice Handicap Index (VHI) and Speech Handicap Index (SHI) questionnaires. At a median follow-up of 11-years, perceptual evaluation showed abnormal scores in up to 64% of cases, depending on the outcome parameter analyzed. Automatic assessment of voice and speech parameters correlated moderate to strong with perceptual outcome scores. Patient-reported problems with voice (VHI>15) and speech (SHI>6) in daily life were present in 68% and 77% of patients, respectively. Patients treated with IMRT showed significantly less impairment compared to those treated with conventional radiotherapy. More than 10-years after organ-preservation treatment, voice and speech problems are common in this patient cohort, as assessed with perceptual evaluation, automatic speech recognition, and with validated structured questionnaires. There were fewer complaints in patients treated with IMRT than with conventional radiotherapy. Copyright © 2016 Elsevier Ltd. All rights reserved.
Two different phenomena in basic motor speech performance in premanifest Huntington disease.
Skodda, Sabine; Grönheit, Wenke; Lukas, Carsten; Bellenberg, Barbara; von Hein, Sarah M; Hoffmann, Rainer; Saft, Carsten
2016-03-09
Dysarthria is a common feature in Huntington disease (HD). The aim of this cross-sectional pilot study was the description and objective analysis of different speech parameters with special emphasis on the aspect of speech timing of connected speech and nonspeech verbal utterances in premanifest HD (preHD). A total of 28 preHD mutation carriers and 28 age- and sex-matched healthy speakers had to perform a reading task and several syllable repetition tasks. Results of computerized acoustic analysis of different variables for the measurement of speech rate and regularity were correlated with clinical measures and MRI-based brain atrophy assessment by voxel-based morphometry. An impaired capacity to steadily repeat single syllables with higher variations in preHD compared to healthy controls was found (variance 1: Cohen d = 1.46). Notably, speech rate was increased compared to controls and showed correlations to the volume of certain brain areas known to be involved in the sensory-motor speech networks (net speech rate: Cohen d = 1.19). Furthermore, speech rate showed correlations to disease burden score, probability of disease onset, the estimated years to onset, and clinical measures like the cognitive score. Measurement of speech rate and regularity might be helpful additional tools for the monitoring of subclinical functional disability in preHD. As one of the possible causes for higher performance in preHD, we discuss huntingtin-dependent temporarily advantageous development processes of the brain. © 2016 American Academy of Neurology.
Speech recognition: Acoustic-phonetic knowledge acquisition and representation
NASA Astrophysics Data System (ADS)
Zue, Victor W.
1988-09-01
The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.
Noise and pitch interact during the cortical segregation of concurrent speech.
Bidelman, Gavin M; Yellamsetty, Anusha
2017-08-01
Behavioral studies reveal listeners exploit intrinsic differences in voice fundamental frequency (F0) to segregate concurrent speech sounds-the so-called "F0-benefit." More favorable signal-to-noise ratio (SNR) in the environment, an extrinsic acoustic factor, similarly benefits the parsing of simultaneous speech. Here, we examined the neurobiological substrates of these two cues in the perceptual segregation of concurrent speech mixtures. We recorded event-related brain potentials (ERPs) while listeners performed a speeded double-vowel identification task. Listeners heard two concurrent vowels whose F0 differed by zero or four semitones presented in either clean (no noise) or noise-degraded (+5 dB SNR) conditions. Behaviorally, listeners were more accurate in correctly identifying both vowels for larger F0 separations but F0-benefit was more pronounced at more favorable SNRs (i.e., pitch × SNR interaction). Analysis of the ERPs revealed that only the P2 wave (∼200 ms) showed a similar F0 x SNR interaction as behavior and was correlated with listeners' perceptual F0-benefit. Neural classifiers applied to the ERPs further suggested that speech sounds are segregated neurally within 200 ms based on SNR whereas segregation based on pitch occurs later in time (400-700 ms). The earlier timing of extrinsic SNR compared to intrinsic F0-based segregation implies that the cortical extraction of speech from noise is more efficient than differentiating speech based on pitch cues alone, which may recruit additional cortical processes. Findings indicate that noise and pitch differences interact relatively early in cerebral cortex and that the brain arrives at the identities of concurrent speech mixtures as early as ∼200 ms. Copyright © 2017 Elsevier B.V. All rights reserved.
Auditory-Perceptual Learning Improves Speech Motor Adaptation in Children
Shiller, Douglas M.; Rochon, Marie-Lyne
2015-01-01
Auditory feedback plays an important role in children’s speech development by providing the child with information about speech outcomes that is used to learn and fine-tune speech motor plans. The use of auditory feedback in speech motor learning has been extensively studied in adults by examining oral motor responses to manipulations of auditory feedback during speech production. Children are also capable of adapting speech motor patterns to perceived changes in auditory feedback, however it is not known whether their capacity for motor learning is limited by immature auditory-perceptual abilities. Here, the link between speech perceptual ability and the capacity for motor learning was explored in two groups of 5–7-year-old children who underwent a period of auditory perceptual training followed by tests of speech motor adaptation to altered auditory feedback. One group received perceptual training on a speech acoustic property relevant to the motor task while a control group received perceptual training on an irrelevant speech contrast. Learned perceptual improvements led to an enhancement in speech motor adaptation (proportional to the perceptual change) only for the experimental group. The results indicate that children’s ability to perceive relevant speech acoustic properties has a direct influence on their capacity for sensory-based speech motor adaptation. PMID:24842067
Başkent, Deniz; Eiler, Cheryl L; Edwards, Brent
2007-06-01
To present a comprehensive analysis of the feasibility of genetic algorithms (GA) for finding the best fit of hearing aids or cochlear implants for individual users in clinical or research settings, where the algorithm is solely driven by subjective human input. Due to varying pathology, the best settings of an auditory device differ for each user. It is also likely that listening preferences vary at the same time. The settings of a device customized for a particular user can only be evaluated by the user. When optimization algorithms are used for fitting purposes, this situation poses a difficulty for a systematic and quantitative evaluation of the suitability of the fitting parameters produced by the algorithm. In the present study, an artificial listening environment was generated by distorting speech using a noiseband vocoder. The settings produced by the GA for this listening problem could objectively be evaluated by measuring speech recognition and comparing the performance to the best vocoder condition where speech was least distorted. Nine normal-hearing subjects participated in the study. The parameters to be optimized were the number of vocoder channels, the shift between the input frequency range and the synthesis frequency range, and the compression-expansion of the input frequency range over the synthesis frequency range. The subjects listened to pairs of sentences processed with the vocoder, and entered a preference for the sentence with better intelligibility. The GA modified the solutions iteratively according to the subject preferences. The program converged when the user ranked the same set of parameters as the best in three consecutive steps. The results produced by the GA were analyzed for quality by measuring speech intelligibility, for test-retest reliability by running the GA three times with each subject, and for convergence properties. Speech recognition scores averaged across subjects were similar for the best vocoder solution and for the solutions produced by the GA. The average number of iterations was 8 and the average convergence time was 25.5 minutes. The settings produced by different GA runs for the same subject were slightly different; however, speech recognition scores measured with these settings were similar. Individual data from subjects showed that in each run, a small number of GA solutions produced poorer speech intelligibility than for the best setting. This was probably a result of the combination of the inherent randomness of the GA, the convergence criterion used in the present study, and possible errors that the users might have made during the paired comparisons. On the other hand, the effect of these errors was probably small compared to the other two factors, as a comparison between subjective preferences and objective measures showed that for many subjects the two were in good agreement. The results showed that the GA was able to produce good solutions by using listener preferences in a relatively short time. For practical applications, the program can be made more robust by running the GA twice or by not using an automatic stopping criterion, and it can be made faster by optimizing the number of the paired comparisons completed in each iteration.
NASA Astrophysics Data System (ADS)
Pishravian, Arash; Aghabozorgi Sahaf, Masoud Reza
2012-12-01
In this paper speech-music separation using Blind Source Separation is discussed. The separating algorithm is based on the mutual information minimization where the natural gradient algorithm is used for minimization. In order to do that, score function estimation from observation signals (combination of speech and music) samples is needed. The accuracy and the speed of the mentioned estimation will affect on the quality of the separated signals and the processing time of the algorithm. The score function estimation in the presented algorithm is based on Gaussian mixture based kernel density estimation method. The experimental results of the presented algorithm on the speech-music separation and comparing to the separating algorithm which is based on the Minimum Mean Square Error estimator, indicate that it can cause better performance and less processing time
Noise-robust speech recognition through auditory feature detection and spike sequence decoding.
Schafer, Phillip B; Jin, Dezhe Z
2014-03-01
Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.
Dimension-Based Statistical Learning Affects Both Speech Perception and Production
ERIC Educational Resources Information Center
Lehet, Matthew; Holt, Lori L.
2017-01-01
Multiple acoustic dimensions signal speech categories. However, dimensions vary in their informativeness; some are more diagnostic of category membership than others. Speech categorization reflects these dimensional regularities such that diagnostic dimensions carry more "perceptual weight" and more effectively signal category membership…
Performing speech recognition research with hypercard
NASA Technical Reports Server (NTRS)
Shepherd, Chip
1993-01-01
The purpose of this paper is to describe a HyperCard-based system for performing speech recognition research and to instruct Human Factors professionals on how to use the system to obtain detailed data about the user interface of a prototype speech recognition application.
Asynchronous sampling of speech with some vocoder experimental results
NASA Technical Reports Server (NTRS)
Babcock, M. L.
1972-01-01
The method of asynchronously sampling speech is based upon the derivatives of the acoustical speech signal. The following results are apparent from experiments to date: (1) It is possible to represent speech by a string of pulses of uniform amplitude, where the only information contained in the string is the spacing of the pulses in time; (2) the string of pulses may be produced in a simple analog manner; (3) the first derivative of the original speech waveform is the most important for the encoding process; (4) the resulting pulse train can be utilized to control an acoustical signal production system to regenerate the intelligence of the original speech.
Faulkner, Andrew; Rosen, Stuart; Green, Tim
2012-10-01
Two experimental groups were trained for 2 h with live or recorded speech that was noise-vocoded and spectrally shifted and was from the same text and talker. These two groups showed equivalent improvements in performance for vocoded and shifted sentences, and the group trained with recorded speech showed consistently greater improvements than untrained controls. Another group trained with unshifted noise-vocoded speech improved no more than untrained controls. Computer-based training thus appears at least as effective as labor-intensive live-voice training for improving the perception of spectrally shifted noise-vocoded speech, and by implication, for training of users of cochlear implants.
SAM: speech-aware applications in medicine to support structured data entry.
Wormek, A. K.; Ingenerf, J.; Orthner, H. F.
1997-01-01
In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech recognizer to a specific medical application. The goals of our work are first, to develop a speech-aware core component which is able to establish connections to speech recognition engines of different vendors. This is realized in SAM. Second, with applications based on SAM we want to support the physician in his/her routine clinical care activities. Within the STAMP project (STAndardized Multimedia report generator in Pathology), we extend SAM by combining a structured data entry approach with speech recognition technology. Another speech-aware application in the field of Diabetes care is connected to a terminology server. The server delivers a controlled vocabulary which can be used for speech recognition. PMID:9357730
Audibility-based predictions of speech recognition for children and adults with normal hearing.
McCreery, Ryan W; Stelmachowicz, Patricia G
2011-12-01
This study investigated the relationship between audibility and predictions of speech recognition for children and adults with normal hearing. The Speech Intelligibility Index (SII) is used to quantify the audibility of speech signals and can be applied to transfer functions to predict speech recognition scores. Although the SII is used clinically with children, relatively few studies have evaluated SII predictions of children's speech recognition directly. Children have required more audibility than adults to reach maximum levels of speech understanding in previous studies. Furthermore, children may require greater bandwidth than adults for optimal speech understanding, which could influence frequency-importance functions used to calculate the SII. Speech recognition was measured for 116 children and 19 adults with normal hearing. Stimulus bandwidth and background noise level were varied systematically in order to evaluate speech recognition as predicted by the SII and derive frequency-importance functions for children and adults. Results suggested that children required greater audibility to reach the same level of speech understanding as adults. However, differences in performance between adults and children did not vary across frequency bands. © 2011 Acoustical Society of America
Impact of a Moving Noise Masker on Speech Perception in Cochlear Implant Users
Weissgerber, Tobias; Rader, Tobias; Baumann, Uwe
2015-01-01
Objectives Previous studies investigating speech perception in noise have typically been conducted with static masker positions. The aim of this study was to investigate the effect of spatial separation of source and masker (spatial release from masking, SRM) in a moving masker setup and to evaluate the impact of adaptive beamforming in comparison with fixed directional microphones in cochlear implant (CI) users. Design Speech reception thresholds (SRT) were measured in S0N0 and in a moving masker setup (S0Nmove) in 12 normal hearing participants and 14 CI users (7 subjects bilateral, 7 bimodal with a hearing aid in the contralateral ear). Speech processor settings were a moderately directional microphone, a fixed beamformer, or an adaptive beamformer. The moving noise source was generated by means of wave field synthesis and was smoothly moved in a shape of a half-circle from one ear to the contralateral ear. Noise was presented in either of two conditions: continuous or modulated. Results SRTs in the S0Nmove setup were significantly improved compared to the S0N0 setup for both the normal hearing control group and the bilateral group in continuous noise, and for the control group in modulated noise. There was no effect of subject group. A significant effect of directional sensitivity was found in the S0Nmove setup. In the bilateral group, the adaptive beamformer achieved lower SRTs than the fixed beamformer setting. Adaptive beamforming improved SRT in both CI user groups substantially by about 3 dB (bimodal group) and 8 dB (bilateral group) depending on masker type. Conclusions CI users showed SRM that was comparable to normal hearing subjects. In listening situations of everyday life with spatial separation of source and masker, directional microphones significantly improved speech perception with individual improvements of up to 15 dB SNR. Users of bilateral speech processors with both directional microphones obtained the highest benefit. PMID:25970594
Kurup, Ravi Kumar; Kurup, Parameswara Achutha
2003-06-01
The isoprenoid pathway produces three key metabolites--endogenous digoxin, dolichol, and ubiquinone. Since endogenous digoxin can regulate neurotransmitter transport and dolichols can modulate glycoconjugate synthesis important in synaptic connectivity, the pathway was assessed in patients with dyslexia, delayed recovery from global aphasia consequent to a dominant hemispheric thrombotic infarct, and developmental delay of speech milestone. The pathway was also studied in right hemispheric, left hemispheric, and bihemispheric dominance to find out the role of hemispheric dominance in the pathogenesis of speech disorders. The plasma/serum--activity of HMG CoA reductase, magnesium, digoxin, dolichol, ubiquinone--and tryptophan/tyrosine catabolic patterns, as well as RBC (Na+)-K+ ATPase activity, were measured in the above mentioned groups. The glycoconjugate metabolism and membrane composition was also studied. The study showed that in dyslexia, developmental delay of speech milestone, and delayed recovery from global aphasia there was an upregulated isoprenoidal pathway with increased digoxin and dolichol levels. The membrane (Na+)-K+ ATPase activity, serum magnesium and ubiquinone levels were low. The tryptophan catabolites were increased and the tyrosine catabolites including dopamine decreased in the serum contributing to a speech dysfunction. There was an increase in carbohydrate residues of glycoproteins, glycosaminoglycans, and glycolipids levels as well as an increased activity of GAG degrading enzymes and glyco hydrolases in the serum. The cholesterol:phospholipid ratio of RBC membrane increased and membrane glycoconjugates showed a decrease. All of these could contribute to altered synaptic inactivity in these disorders. The patterns correlated with those obtained in right hemispheric chemical dominance. Right hemispheric chemical dominance may play a role in the genesis of these disorders. Hemispheric chemical dominance has no correlation with handedness or the dichotic listening test.
Yang, Ying; Liu, Yue-Hui; Fu, Ming-Fu; Li, Chun-Lin; Wang, Li-Yan; Wang, Qi; Sun, Xi-Bin
2015-08-20
Early auditory and speech development in home-based early intervention of infants and toddlers with hearing loss younger than 2 years are still spare in China. This study aimed to observe the development of auditory and speech in deaf infants and toddlers who were fitted with hearing aids and/or received cochlear implantation between the chronological ages of 7-24 months, and analyze the effect of chronological age and recovery time on auditory and speech development in the course of home-based early intervention. This longitudinal study included 55 hearing impaired children with severe and profound binaural deafness, who were divided into Group A (7-12 months), Group B (13-18 months) and Group C (19-24 months) based on the chronological age. Categories auditory performance (CAP) and speech intelligibility rating scale (SIR) were used to evaluate auditory and speech development at baseline and 3, 6, 9, 12, 18, and 24 months of habilitation. Descriptive statistics were used to describe demographic features and were analyzed by repeated measures analysis of variance. With 24 months of hearing intervention, 78% of the patients were able to understand common phrases and conversation without lip-reading, 96% of the patients were intelligible to a listener. In three groups, children showed the rapid growth of trend features in each period of habilitation. CAP and SIR scores have developed rapidly within 24 months after fitted auxiliary device in Group A, which performed much better auditory and speech abilities than Group B (P < 0.05) and Group C (P < 0.05). Group B achieved better results than Group C, whereas no significant differences were observed between Group B and Group C (P > 0.05). The data suggested the early hearing intervention and home-based habilitation benefit auditory and speech development. Chronological age and recovery time may be major factors for aural verbal outcomes in hearing impaired children. The development of auditory and speech in hearing impaired children may be relatively crucial in thefirst year's habilitation after fitted with the auxiliary device.
Neumann, K; Holler-Zittlau, I; van Minnen, S; Sick, U; Zaretsky, Y; Euler, H A
2011-01-01
The German Kindersprachscreening (KiSS) is a universal speech and language screening test for large-scale identification of Hessian kindergarten children requiring special educational language training or clinical speech/language therapy. To calculate the procedural screening validity, 257 children (aged 4.0 to 4.5 years) were tested using KiSS and four language tests (Reynell Development Language Scales III, Patholinguistische Diagnostik, PLAKSS, AWST-R). The majority or consensus judgements of three speech-language professionals, based on the language test results, served as a reference criterion. The base (fail) rates of the professionals were either self-determined or preset based on known prevalence rates. Screening validity was higher for preset than for self-determined base rates due to higher inter-judge agreement. The confusion matrices of the overall index classification of the KiSS (speech-language abnormalities with educational or clinical needs) with the fixed base rate expert judgement about language impairment, including fluency or voice disorders, yielded a sensitivity of 88% and a specificity of 78%, for just language impairment 84% and 75%, respectively. Specificities for disorders requiring clinical diagnostics in the KiSS (language impairment alone or combined with fluency/voice disorders) related to the test-based consensus expert judgment was about 93%. Sensitivities were unsatisfactory because the differentiation between educational and clinical needs requires improvement. Since the judgement concordances between the speech-language professionals was only moderate, the development of a comprehensive German reference test for speech and language disorders with evidence-based algorithmic decision rules rather than subjective clinical judgement is advocated.
The Role of Corticostriatal Systems in Speech Category Learning
Yi, Han-Gyol; Maddox, W. Todd; Mumford, Jeanette A.; Chandrasekaran, Bharath
2016-01-01
One of the most difficult category learning problems for humans is learning nonnative speech categories. While feedback-based category training can enhance speech learning, the mechanisms underlying these benefits are unclear. In this functional magnetic resonance imaging study, we investigated neural and computational mechanisms underlying feedback-dependent speech category learning in adults. Positive feedback activated a large corticostriatal network including the dorsolateral prefrontal cortex, inferior parietal lobule, middle temporal gyrus, caudate, putamen, and the ventral striatum. Successful learning was contingent upon the activity of domain-general category learning systems: the fast-learning reflective system, involving the dorsolateral prefrontal cortex that develops and tests explicit rules based on the feedback content, and the slow-learning reflexive system, involving the putamen in which the stimuli are implicitly associated with category responses based on the reward value in feedback. Computational modeling of response strategies revealed significant use of reflective strategies early in training and greater use of reflexive strategies later in training. Reflexive strategy use was associated with increased activation in the putamen. Our results demonstrate a critical role for the reflexive corticostriatal learning system as a function of response strategy and proficiency during speech category learning. Keywords: category learning, fMRI, corticostriatal systems, speech, putamen PMID:25331600
Fang, Chunying; Li, Haifeng; Ma, Lin; Zhang, Mancai
2017-01-01
Pathological speech usually refers to speech distortion resulting from illness or other biological insults. The assessment of pathological speech plays an important role in assisting the experts, while automatic evaluation of speech intelligibility is difficult because it is usually nonstationary and mutational. In this paper, we carry out an independent innovation of feature extraction and reduction, and we describe a multigranularity combined feature scheme which is optimized by the hierarchical visual method. A novel method of generating feature set based on S -transform and chaotic analysis is proposed. There are BAFS (430, basic acoustics feature), local spectral characteristics MSCC (84, Mel S -transform cepstrum coefficients), and chaotic features (12). Finally, radar chart and F -score are proposed to optimize the features by the hierarchical visual fusion. The feature set could be optimized from 526 to 96 dimensions based on NKI-CCRT corpus and 104 dimensions based on SVD corpus. The experimental results denote that new features by support vector machine (SVM) have the best performance, with a recognition rate of 84.4% on NKI-CCRT corpus and 78.7% on SVD corpus. The proposed method is thus approved to be effective and reliable for pathological speech intelligibility evaluation.
Hoth, S
2016-08-01
The Freiburg speech intelligibility test according to DIN 45621 was introduced around 60 years ago. For decades, and still today, the Freiburg test has been a standard whose relevance extends far beyond pure audiometry. It is used primarily to determine the speech perception threshold (based on two-digit numbers) and the ability to discriminate speech at suprathreshold presentation levels (based on monosyllabic nouns). Moreover, it is a measure of the degree of disability, the requirement for and success of technical hearing aids (auxiliaries directives), and the compensation for disability and handicap (Königstein recommendation). In differential audiological diagnostics, the Freiburg test contributes to the distinction between low- and high-frequency hearing loss, as well as to identification of conductive, sensory, neural, and central disorders. Currently, the phonemic and perceptual balance of the monosyllabic test lists is subject to critical discussions. Obvious deficiencies exist for testing speech recognition in noise. In this respect, alternatives such as sentence or rhyme tests in closed-answer inventories are discussed.
Zhang, Xiaoheng; Wang, Lirui; Cao, Yao; Wang, Pin; Zhang, Cheng; Yang, Liuyang; Li, Yongming; Zhang, Yanling; Cheng, Oumei
2018-02-01
Diagnosis of Parkinson's disease (PD) based on speech data has been proved to be an effective way in recent years. However, current researches just care about the feature extraction and classifier design, and do not consider the instance selection. Former research by authors showed that the instance selection can lead to improvement on classification accuracy. However, no attention is paid on the relationship between speech sample and feature until now. Therefore, a new diagnosis algorithm of PD is proposed in this paper by simultaneously selecting speech sample and feature based on relevant feature weighting algorithm and multiple kernel method, so as to find their synergy effects, thereby improving classification accuracy. Experimental results showed that this proposed algorithm obtained apparent improvement on classification accuracy. It can obtain mean classification accuracy of 82.5%, which was 30.5% higher than the relevant algorithm. Besides, the proposed algorithm detected the synergy effects of speech sample and feature, which is valuable for speech marker extraction.
Brain-to-text: decoding spoken phrases from phone representations in the brain.
Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja
2015-01-01
It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.
Brain-to-text: decoding spoken phrases from phone representations in the brain
Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja
2015-01-01
It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech. PMID:26124702
Geravanchizadeh, Masoud; Fallah, Ali
2015-12-01
A binaural and psychoacoustically motivated intelligibility model, based on a well-known monaural microscopic model is proposed. This model simulates a phoneme recognition task in the presence of spatially distributed speech-shaped noise in anechoic scenarios. In the proposed model, binaural advantage effects are considered by generating a feature vector for a dynamic-time-warping speech recognizer. This vector consists of three subvectors incorporating two monaural subvectors to model the better-ear hearing, and a binaural subvector to simulate the binaural unmasking effect. The binaural unit of the model is based on equalization-cancellation theory. This model operates blindly, which means separate recordings of speech and noise are not required for the predictions. Speech intelligibility tests were conducted with 12 normal hearing listeners by collecting speech reception thresholds (SRTs) in the presence of single and multiple sources of speech-shaped noise. The comparison of the model predictions with the measured binaural SRTs, and with the predictions of a macroscopic binaural model called extended equalization-cancellation, shows that this approach predicts the intelligibility in anechoic scenarios with good precision. The square of the correlation coefficient (r(2)) and the mean-absolute error between the model predictions and the measurements are 0.98 and 0.62 dB, respectively.
Systems concept for speech technology application in general aviation
NASA Technical Reports Server (NTRS)
North, R. A.; Bergeron, H.
1984-01-01
The application potential of voice recognition and synthesis circuits for general aviation, single-pilot IFR (SPIFR) situations is examined. The viewpoint of the pilot was central to workload analyses and assessment of the effectiveness of the voice systems. A twin-engine, high performance general aviation aircraft on a cross-country fixed route was employed as the study model. No actual control movements were considered and other possible functions were scored by three IFR-rated instructors. The SPIFR was concluded helpful in alleviating visual and manual workloads during take-off, approach and landing, particularly for data retrieval and entry tasks. Voice synthesis was an aid in alerting a pilot to in-flight problems. It is expected that usable systems will be available within 5 yr.
Lee, Yune-Sang; Turkeltaub, Peter; Granger, Richard; Raizada, Rajeev D S
2012-03-14
Although much effort has been directed toward understanding the neural basis of speech processing, the neural processes involved in the categorical perception of speech have been relatively less studied, and many questions remain open. In this functional magnetic resonance imaging (fMRI) study, we probed the cortical regions mediating categorical speech perception using an advanced brain-mapping technique, whole-brain multivariate pattern-based analysis (MVPA). Normal healthy human subjects (native English speakers) were scanned while they listened to 10 consonant-vowel syllables along the /ba/-/da/ continuum. Outside of the scanner, individuals' own category boundaries were measured to divide the fMRI data into /ba/ and /da/ conditions per subject. The whole-brain MVPA revealed that Broca's area and the left pre-supplementary motor area evoked distinct neural activity patterns between the two perceptual categories (/ba/ vs /da/). Broca's area was also found when the same analysis was applied to another dataset (Raizada and Poldrack, 2007), which previously yielded the supramarginal gyrus using a univariate adaptation-fMRI paradigm. The consistent MVPA findings from two independent datasets strongly indicate that Broca's area participates in categorical speech perception, with a possible role of translating speech signals into articulatory codes. The difference in results between univariate and multivariate pattern-based analyses of the same data suggest that processes in different cortical areas along the dorsal speech perception stream are distributed on different spatial scales.
Ballard, Kirrie J.; Savage, Sharon; Leyton, Cristian E.; Vogel, Adam P.; Hornberger, Michael; Hodges, John R.
2014-01-01
Differentiation of logopenic (lvPPA) and nonfluent/agrammatic (nfvPPA) variants of Primary Progressive Aphasia is important yet remains challenging since it hinges on expert based evaluation of speech and language production. In this study acoustic measures of speech in conjunction with voxel-based morphometry were used to determine the success of the measures as an adjunct to diagnosis and to explore the neural basis of apraxia of speech in nfvPPA. Forty-one patients (21 lvPPA, 20 nfvPPA) were recruited from a consecutive sample with suspected frontotemporal dementia. Patients were diagnosed using the current gold-standard of expert perceptual judgment, based on presence/absence of particular speech features during speaking tasks. Seventeen healthy age-matched adults served as controls. MRI scans were available for 11 control and 37 PPA cases; 23 of the PPA cases underwent amyloid ligand PET imaging. Measures, corresponding to perceptual features of apraxia of speech, were periods of silence during reading and relative vowel duration and intensity in polysyllable word repetition. Discriminant function analyses revealed that a measure of relative vowel duration differentiated nfvPPA cases from both control and lvPPA cases (r 2 = 0.47) with 88% agreement with expert judgment of presence of apraxia of speech in nfvPPA cases. VBM analysis showed that relative vowel duration covaried with grey matter intensity in areas critical for speech motor planning and programming: precentral gyrus, supplementary motor area and inferior frontal gyrus bilaterally, only affected in the nfvPPA group. This bilateral involvement of frontal speech networks in nfvPPA potentially affects access to compensatory mechanisms involving right hemisphere homologues. Measures of silences during reading also discriminated the PPA and control groups, but did not increase predictive accuracy. Findings suggest that a measure of relative vowel duration from of a polysyllable word repetition task may be sufficient for detecting most cases of apraxia of speech and distinguishing between nfvPPA and lvPPA. PMID:24587083
Ding, Nai; Pan, Xunyi; Luo, Cheng; Su, Naifei; Zhang, Wen; Zhang, Jianfeng
2018-01-31
How the brain groups sequential sensory events into chunks is a fundamental question in cognitive neuroscience. This study investigates whether top-down attention or specific tasks are required for the brain to apply lexical knowledge to group syllables into words. Neural responses tracking the syllabic and word rhythms of a rhythmic speech sequence were concurrently monitored using electroencephalography (EEG). The participants performed different tasks, attending to either the rhythmic speech sequence or a distractor, which was another speech stream or a nonlinguistic auditory/visual stimulus. Attention to speech, but not a lexical-meaning-related task, was required for reliable neural tracking of words, even when the distractor was a nonlinguistic stimulus presented cross-modally. Neural tracking of syllables, however, was reliably observed in all tested conditions. These results strongly suggest that neural encoding of individual auditory events (i.e., syllables) is automatic, while knowledge-based construction of temporal chunks (i.e., words) crucially relies on top-down attention. SIGNIFICANCE STATEMENT Why we cannot understand speech when not paying attention is an old question in psychology and cognitive neuroscience. Speech processing is a complex process that involves multiple stages, e.g., hearing and analyzing the speech sound, recognizing words, and combining words into phrases and sentences. The current study investigates which speech-processing stage is blocked when we do not listen carefully. We show that the brain can reliably encode syllables, basic units of speech sounds, even when we do not pay attention. Nevertheless, when distracted, the brain cannot group syllables into multisyllabic words, which are basic units for speech meaning. Therefore, the process of converting speech sound into meaning crucially relies on attention. Copyright © 2018 the authors 0270-6474/18/381178-11$15.00/0.
Caballero-Morales, Santiago-Omar
2013-01-01
An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness). Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR's output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech. PMID:23935410
Synthesized Speech Output and Children: A Scoping Review
ERIC Educational Resources Information Center
Drager, Kathryn D. R.; Reichle, Joe; Pinkoski, Carrie
2010-01-01
Purpose: Many computer-based augmentative and alternative communication systems in use by children have speech output. This article (a) provides a scoping review of the literature addressing the intelligibility and listener comprehension of synthesized speech output with children and (b) discusses future research directions. Method: Studies…
Fingerspelled and Printed Words Are Recoded into a Speech-based Code in Short-term Memory.
Sehyr, Zed Sevcikova; Petrich, Jennifer; Emmorey, Karen
2017-01-01
We conducted three immediate serial recall experiments that manipulated type of stimulus presentation (printed or fingerspelled words) and word similarity (speech-based or manual). Matched deaf American Sign Language signers and hearing non-signers participated (mean reading age = 14-15 years). Speech-based similarity effects were found for both stimulus types indicating that deaf signers recoded both printed and fingerspelled words into a speech-based phonological code. A manual similarity effect was not observed for printed words indicating that print was not recoded into fingerspelling (FS). A manual similarity effect was observed for fingerspelled words when similarity was based on joint angles rather than on handshape compactness. However, a follow-up experiment suggested that the manual similarity effect was due to perceptual confusion at encoding. Overall, these findings suggest that FS is strongly linked to English phonology for deaf adult signers who are relatively skilled readers. This link between fingerspelled words and English phonology allows for the use of a more efficient speech-based code for retaining fingerspelled words in short-term memory and may strengthen the representation of English vocabulary. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
On the importance of early reflections for speech in rooms.
Bradley, J S; Sato, H; Picard, M
2003-06-01
This paper presents the results of new studies based on speech intelligibility tests in simulated sound fields and analyses of impulse response measurements in rooms used for speech communication. The speech intelligibility test results confirm the importance of early reflections for achieving good conditions for speech in rooms. The addition of early reflections increased the effective signal-to-noise ratio and related speech intelligibility scores for both impaired and nonimpaired listeners. The new results also show that for common conditions where the direct sound is reduced, it is only possible to understand speech because of the presence of early reflections. Analyses of measured impulse responses in rooms intended for speech show that early reflections can increase the effective signal-to-noise ratio by up to 9 dB. A room acoustics computer model is used to demonstrate that the relative importance of early reflections can be influenced by the room acoustics design.
Perceptual analysis of speech following traumatic brain injury in childhood.
Cahill, Louise M; Murdoch, Bruce E; Theodoros, Deborah G
2002-05-01
To investigate perceptually the speech dimensions, oromotor function, and speech intelligibility of a group of individuals with traumatic brain injury (TBI) acquired in childhood. The speech of 24 children with TBI was analysed perceptually and compared with that of a group of non-neurologically impaired children matched for age and sex. The 16 dysarthric TBI subjects were significantly less intelligible than the control subjects, and demonstrated significant impairment in 12 of the 33 speech dimensions rated. In addition, the eight non-dysarthric TBI subjects were significantly impaired in many areas of oromotor function on the Frenchay Dysarthria Assessment, indicating some degree of pre-clinical speech impairment. The results of the perceptual analysis are discussed in terms of the possible underlying pathophysiological bases of the deviant speech features identified, and the need for a comprehensive instrumental assessment, to more accurately determine the level of breakdown in the speech production mechanism in children following TBI.
Subjective comparison and evaluation of speech enhancement algorithms
Hu, Yi; Loizou, Philipos C.
2007-01-01
Making meaningful comparisons between the performance of the various speech enhancement algorithms proposed over the years, has been elusive due to lack of a common speech database, differences in the types of noise used and differences in the testing methodology. To facilitate such comparisons, we report on the development of a noisy speech corpus suitable for evaluation of speech enhancement algorithms. This corpus is subsequently used for the subjective evaluation of 13 speech enhancement methods encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms. The subjective evaluation was performed by Dynastat, Inc. using the ITU-T P.835 methodology designed to evaluate the speech quality along three dimensions: signal distortion, noise distortion and overall quality. This paper reports the results of the subjective tests. PMID:18046463
Methods of Improving Speech Intelligibility for Listeners with Hearing Resolution Deficit
2012-01-01
Abstract Methods developed for real-time time scale modification (TSM) of speech signal are presented. They are based on the non-uniform, speech rate depended SOLA algorithm (Synchronous Overlap and Add). Influence of the proposed method on the intelligibility of speech was investigated for two separate groups of listeners, i.e. hearing impaired children and elderly listeners. It was shown that for the speech with average rate equal to or higher than 6.48 vowels/s, all of the proposed methods have statistically significant impact on the improvement of speech intelligibility for hearing impaired children with reduced hearing resolution and one of the proposed methods significantly improves comprehension of speech in the group of elderly listeners with reduced hearing resolution. Virtual slides http://www.diagnosticpathology.diagnomx.eu/vs/2065486371761991 PMID:23009662
FEENAUGHTY, LYNDA; TJADEN, KRIS; BENEDICT, RALPH H.B.; WEINSTOCK-GUTTMAN, BIANCA
2017-01-01
This preliminary study investigated how cognitive-linguistic status in multiple sclerosis (MS) is reflected in two speech tasks (i.e. oral reading, narrative) that differ in cognitive-linguistic demand. Twenty individuals with MS were selected to comprise High and Low performance groups based on clinical tests of executive function and information processing speed and efficiency. Ten healthy controls were included for comparison. Speech samples were audio-recorded and measures of global speech timing were obtained. Results indicated predicted differences in global speech timing (i.e. speech rate and pause characteristics) for speech tasks differing in cognitive-linguistic demand, but the magnitude of these task-related differences was similar for all speaker groups. Findings suggest that assumptions concerning the cognitive-linguistic demands of reading aloud as compared to spontaneous speech may need to be re-considered for individuals with cognitive impairment. Qualitative trends suggest that additional studies investigating the association between cognitive-linguistic and speech motor variables in MS are warranted. PMID:23294227
Visual speech information: a help or hindrance in perceptual processing of dysarthric speech.
Borrie, Stephanie A
2015-03-01
This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.
Visual Feedback of Tongue Movement for Novel Speech Sound Learning
Katz, William F.; Mehta, Sonya
2015-01-01
Pronunciation training studies have yielded important information concerning the processing of audiovisual (AV) information. Second language (L2) learners show increased reliance on bottom-up, multimodal input for speech perception (compared to monolingual individuals). However, little is known about the role of viewing one's own speech articulation processes during speech training. The current study investigated whether real-time, visual feedback for tongue movement can improve a speaker's learning of non-native speech sounds. An interactive 3D tongue visualization system based on electromagnetic articulography (EMA) was used in a speech training experiment. Native speakers of American English produced a novel speech sound (/ɖ/; a voiced, coronal, palatal stop) before, during, and after trials in which they viewed their own speech movements using the 3D model. Talkers' productions were evaluated using kinematic (tongue-tip spatial positioning) and acoustic (burst spectra) measures. The results indicated a rapid gain in accuracy associated with visual feedback training. The findings are discussed with respect to neural models for multimodal speech processing. PMID:26635571
Chivukula, Srinivas; Pikul, Brian K; Black, Keith L; Pouratian, Nader; Bookheimer, Susan Y
2018-05-18
We evaluated plasticity in speech supplemental motor area (SMA) tissue in two patients using functional magnetic resonance imaging (fMRI), following resection of tumors in or associated with the dominant hemisphere speech SMA. Patient A underwent resection of a anaplastic astrocytoma NOS associated with the left speech SMA, experienced SMA syndrome related mutism postoperatively, but experienced full recovery 14 months later. FMRI performed 32 months after surgery demonstrated a migration of speech SMA to homologous contralateral hemispheric regional tissue. Patient B underwent resection of a oligodendroglioma NOS in the left speech SMA, and postoperatively experienced speech hesitancy, latency and poor fluency, which gradually resolved over 18 months. FMRI performed at 64 months after surgery showed a reorganization of speech SMA to the contralateral hemisphere. These data support the hypothesis of dynamic, time based plasticity in speech SMA tissue, and may represent a noninvasive neural marker for SMA syndrome recovery. Copyright © 2018 Elsevier Inc. All rights reserved.
Kim, Heejung; Hahm, Jarang; Lee, Hyekyoung; Kang, Eunjoo; Kang, Hyejin; Lee, Dong Soo
2015-05-01
The human brain naturally integrates audiovisual information to improve speech perception. However, in noisy environments, understanding speech is difficult and may require much effort. Although the brain network is supposed to be engaged in speech perception, it is unclear how speech-related brain regions are connected during natural bimodal audiovisual or unimodal speech perception with counterpart irrelevant noise. To investigate the topological changes of speech-related brain networks at all possible thresholds, we used a persistent homological framework through hierarchical clustering, such as single linkage distance, to analyze the connected component of the functional network during speech perception using functional magnetic resonance imaging. For speech perception, bimodal (audio-visual speech cue) or unimodal speech cues with counterpart irrelevant noise (auditory white-noise or visual gum-chewing) were delivered to 15 subjects. In terms of positive relationship, similar connected components were observed in bimodal and unimodal speech conditions during filtration. However, during speech perception by congruent audiovisual stimuli, the tighter couplings of left anterior temporal gyrus-anterior insula component and right premotor-visual components were observed than auditory or visual speech cue conditions, respectively. Interestingly, visual speech is perceived under white noise by tight negative coupling in the left inferior frontal region-right anterior cingulate, left anterior insula, and bilateral visual regions, including right middle temporal gyrus, right fusiform components. In conclusion, the speech brain network is tightly positively or negatively connected, and can reflect efficient or effortful processes during natural audiovisual integration or lip-reading, respectively, in speech perception.
Speech emotion recognition methods: A literature review
NASA Astrophysics Data System (ADS)
Basharirad, Babak; Moradhaseli, Mohammadreza
2017-10-01
Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.
NASA Astrophysics Data System (ADS)
Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis
2003-12-01
This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.
Blind speech separation system for humanoid robot with FastICA for audio filtering and separation
NASA Astrophysics Data System (ADS)
Budiharto, Widodo; Santoso Gunawan, Alexander Agung
2016-07-01
Nowadays, there are many developments in building intelligent humanoid robot, mainly in order to handle voice and image. In this research, we propose blind speech separation system using FastICA for audio filtering and separation that can be used in education or entertainment. Our main problem is to separate the multi speech sources and also to filter irrelevant noises. After speech separation step, the results will be integrated with our previous speech and face recognition system which is based on Bioloid GP robot and Raspberry Pi 2 as controller. The experimental results show the accuracy of our blind speech separation system is about 88% in command and query recognition cases.
Price, Cathy J.
2012-01-01
The anatomy of language has been investigated with PET or fMRI for more than 20 years. Here I attempt to provide an overview of the brain areas associated with heard speech, speech production and reading. The conclusions of many hundreds of studies were considered, grouped according to the type of processing, and reported in the order that they were published. Many findings have been replicated time and time again leading to some consistent and undisputable conclusions. These are summarised in an anatomical model that indicates the location of the language areas and the most consistent functions that have been assigned to them. The implications for cognitive models of language processing are also considered. In particular, a distinction can be made between processes that are localized to specific structures (e.g. sensory and motor processing) and processes where specialisation arises in the distributed pattern of activation over many different areas that each participate in multiple functions. For example, phonological processing of heard speech is supported by the functional integration of auditory processing and articulation; and orthographic processing is supported by the functional integration of visual processing, articulation and semantics. Future studies will undoubtedly be able to improve the spatial precision with which functional regions can be dissociated but the greatest challenge will be to understand how different brain regions interact with one another in their attempts to comprehend and produce language. PMID:22584224
Shih, Ludy C; Piel, Jordan; Warren, Amanda; Kraics, Lauren; Silver, Althea; Vanderhorst, Veronique; Simon, David K; Tarsy, Daniel
2012-06-01
Parkinson's disease related speech and voice impairment have significant impact on quality of life measures. LSVT(®)LOUD voice and speech therapy (Lee Silverman Voice Therapy) has demonstrated scientific efficacy and clinical effectiveness, but musically based voice and speech therapy has been underexplored as a potentially useful method of rehabilitation. We undertook a pilot, open-label study of a group-based singing intervention, consisting of twelve 90-min weekly sessions led by a voice and speech therapist/singing instructor. The primary outcome measure of vocal loudness as measured by sound pressure level (SPL) at 50 cm during connected speech was not significantly different one week after the intervention or at 13 weeks after the intervention. A number of secondary measures reflecting pitch range, phonation time and maximum loudness also were unchanged. Voice related quality of life (VRQOL) and voice handicap index (VHI) also were unchanged. This study suggests that a group singing therapy intervention at this intensity and frequency does not result in significant improvement in objective and subject-rated measures of voice and speech impairment. Copyright © 2012 Elsevier Ltd. All rights reserved.
Fu, Szu-Wei; Li, Pei-Chun; Lai, Ying-Hui; Yang, Cheng-Chien; Hsieh, Li-Chun; Tsao, Yu
2017-11-01
Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients. Objective: This paper focuses on machine learning based voice conversion (VC) techniques for improving the speech intelligibility of surgical patients who have had parts of their articulators removed. Because of the removal of parts of the articulator, a patient's speech may be distorted and difficult to understand. To overcome this problem, VC methods can be applied to convert the distorted speech such that it is clear and more intelligible. To design an effective VC method, two key points must be considered: 1) the amount of training data may be limited (because speaking for a long time is usually difficult for postoperative patients); 2) rapid conversion is desirable (for better communication). Methods: We propose a novel joint dictionary learning based non-negative matrix factorization (JD-NMF) algorithm. Compared to conventional VC techniques, JD-NMF can perform VC efficiently and effectively with only a small amount of training data. Results: The experimental results demonstrate that the proposed JD-NMF method not only achieves notably higher short-time objective intelligibility (STOI) scores (a standardized objective intelligibility evaluation metric) than those obtained using the original unconverted speech but is also significantly more efficient and effective than a conventional exemplar-based NMF VC method. Conclusion: The proposed JD-NMF method may outperform the state-of-the-art exemplar-based NMF VC method in terms of STOI scores under the desired scenario. Significance: We confirmed the advantages of the proposed joint training criterion for the NMF-based VC. Moreover, we verified that the proposed JD-NMF can effectively improve the speech intelligibility scores of oral surgery patients.
2017-01-05
1 Performance Evaluation of Glottal Inverse Filtering Algorithms Using a Physiologically Based Articulatory Speech Synthesizer Yu-Ren Chien, Daryush...D. Mehta, Member, IEEE, Jón Guðnason, Matías Zañartu, Member, IEEE, and Thomas F. Quatieri, Fellow, IEEE Abstract—Glottal inverse filtering aims to...of inverse filtering performance has been challenging due to the practical difficulty in measuring the true glottal signals while speech signals are
Analyzing crowdsourced ratings of speech-based take-over requests for automated driving.
Bazilinskyy, P; de Winter, J C F
2017-10-01
Take-over requests in automated driving should fit the urgency of the traffic situation. The robustness of various published research findings on the valuations of speech-based warning messages is unclear. This research aimed to establish how people value speech-based take-over requests as a function of speech rate, background noise, spoken phrase, and speaker's gender and emotional tone. By means of crowdsourcing, 2669 participants from 95 countries listened to a random 10 out of 140 take-over requests, and rated each take-over request on urgency, commandingness, pleasantness, and ease of understanding. Our results replicate several published findings, in particular that an increase in speech rate results in a monotonic increase of perceived urgency. The female voice was easier to understand than a male voice when there was a high level of background noise, a finding that contradicts the literature. Moreover, a take-over request spoken with Indian accent was found to be easier to understand by participants from India than by participants from other countries. Our results replicate effects in the literature regarding speech-based warnings, and shed new light on effects of background noise, gender, and nationality. The results may have implications for the selection of appropriate take-over requests in automated driving. Additionally, our study demonstrates the promise of crowdsourcing for testing human factors and ergonomics theories with large sample sizes. Copyright © 2017 Elsevier Ltd. All rights reserved.
Tracking Speech Sound Acquisition
ERIC Educational Resources Information Center
Powell, Thomas W.
2011-01-01
This article describes a procedure to aid in the clinical appraisal of child speech. The approach, based on the work by Dinnsen, Chin, Elbert, and Powell (1990; Some constraints on functionally disordered phonologies: Phonetic inventories and phonotactics. "Journal of Speech and Hearing Research", 33, 28-37), uses a railway idiom to track gains in…
Children Mix Direct and Indirect Speech: Evidence from Pronoun Comprehension
ERIC Educational Resources Information Center
Köder, Franziska; Maier, Emar
2016-01-01
This study investigates children's acquisition of the distinction between direct speech (Elephant said, "I get the football") and indirect speech ("Elephant said that he gets the football"), by measuring children's interpretation of first, second, and third person pronouns. Based on evidence from various linguistic sources, we…
Measuring Speech Comprehensibility in Students with Down Syndrome
ERIC Educational Resources Information Center
Yoder, Paul J.; Woynaroski, Tiffany; Camarata, Stephen
2016-01-01
Purpose: There is an ongoing need to develop assessments of spontaneous speech that focus on whether the child's utterances are comprehensible to listeners. This study sought to identify the attributes of a stable ratings-based measure of speech comprehensibility, which enabled examining the criterion-related validity of an orthography-based…
Cortical Bases of Speech Perception: Evidence from Functional Lesion Studies
ERIC Educational Resources Information Center
Boatman, Dana
2004-01-01
Functional lesion studies have yielded new information about the cortical organization of speech perception in the human brain. We will review a number of recent findings, focusing on studies of speech perception that use the techniques of electrocortical mapping by cortical stimulation and hemispheric anesthetization by intracarotid amobarbital.…
The Hierarchical Cortical Organization of Human Speech Processing
de Heer, Wendy A.; Huth, Alexander G.; Griffiths, Thomas L.
2017-01-01
Speech comprehension requires that the brain extract semantic meaning from the spectral features represented at the cochlea. To investigate this process, we performed an fMRI experiment in which five men and two women passively listened to several hours of natural narrative speech. We then used voxelwise modeling to predict BOLD responses based on three different feature spaces that represent the spectral, articulatory, and semantic properties of speech. The amount of variance explained by each feature space was then assessed using a separate validation dataset. Because some responses might be explained equally well by more than one feature space, we used a variance partitioning analysis to determine the fraction of the variance that was uniquely explained by each feature space. Consistent with previous studies, we found that speech comprehension involves hierarchical representations starting in primary auditory areas and moving laterally on the temporal lobe: spectral features are found in the core of A1, mixtures of spectral and articulatory in STG, mixtures of articulatory and semantic in STS, and semantic in STS and beyond. Our data also show that both hemispheres are equally and actively involved in speech perception and interpretation. Further, responses as early in the auditory hierarchy as in STS are more correlated with semantic than spectral representations. These results illustrate the importance of using natural speech in neurolinguistic research. Our methodology also provides an efficient way to simultaneously test multiple specific hypotheses about the representations of speech without using block designs and segmented or synthetic speech. SIGNIFICANCE STATEMENT To investigate the processing steps performed by the human brain to transform natural speech sound into meaningful language, we used models based on a hierarchical set of speech features to predict BOLD responses of individual voxels recorded in an fMRI experiment while subjects listened to natural speech. Both cerebral hemispheres were actively involved in speech processing in large and equal amounts. Also, the transformation from spectral features to semantic elements occurs early in the cortical speech-processing stream. Our experimental and analytical approaches are important alternatives and complements to standard approaches that use segmented speech and block designs, which report more laterality in speech processing and associated semantic processing to higher levels of cortex than reported here. PMID:28588065
Freedom of racist speech: Ego and expressive threats.
White, Mark H; Crandall, Christian S
2017-09-01
Do claims of "free speech" provide cover for prejudice? We investigate whether this defense of racist or hate speech serves as a justification for prejudice. In a series of 8 studies (N = 1,624), we found that explicit racial prejudice is a reliable predictor of the "free speech defense" of racist expression. Participants endorsed free speech values for singing racists songs or posting racist comments on social media; people high in prejudice endorsed free speech more than people low in prejudice (meta-analytic r = .43). This endorsement was not principled-high levels of prejudice did not predict endorsement of free speech values when identical speech was directed at coworkers or the police. Participants low in explicit racial prejudice actively avoided endorsing free speech values in racialized conditions compared to nonracial conditions, but participants high in racial prejudice increased their endorsement of free speech values in racialized conditions. Three experiments failed to find evidence that defense of racist speech by the highly prejudiced was based in self-relevant or self-protective motives. Two experiments found evidence that the free speech argument protected participants' own freedom to express their attitudes; the defense of other's racist speech seems motivated more by threats to autonomy than threats to self-regard. These studies serve as an elaboration of the Justification-Suppression Model (Crandall & Eshleman, 2003) of prejudice expression. The justification of racist speech by endorsing fundamental political values can serve to buffer racial and hate speech from normative disapproval. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Maas, Edwin; Mailend, Marja-Liisa
2012-10-01
The purpose of this article is to present an argument for the use of online reaction time (RT) methods to the study of apraxia of speech (AOS) and to review the existing small literature in this area and the contributions it has made to our fundamental understanding of speech planning (deficits) in AOS. Following a brief description of limitations of offline perceptual methods, we provide a narrative review of various types of RT paradigms from the (speech) motor programming and psycholinguistic literatures and their (thus far limited) application with AOS. On the basis of the review of the literature, we conclude that with careful consideration of potential challenges and caveats, RT approaches hold great promise to advance our understanding of AOS, in particular with respect to the speech planning processes that generate the speech signal before initiation. A deeper understanding of the nature and time course of speech planning and its disruptions in AOS may enhance diagnosis and treatment for AOS. Only a handful of published studies on apraxia of speech have used reaction time methods. However, these studies have provided deeper insight into speech planning impairments in AOS based on a variety of experimental paradigms.
ERIC Educational Resources Information Center
Ramsey, E. Michele
2004-01-01
Objective: To integrate speaking practice with rhetorical theory. Type of speech: Persuasive. Point value: 100 points (i.e., 30 points based on peer evaluations, 30 points based on individual performance, 40 points based on the group presentation), which is 25% of course grade. Requirements: (a) References: 7-10; (b) Length: 20-30 minutes; (c)…
Speech recognition in advanced rotorcraft - Using speech controls to reduce manual control overload
NASA Technical Reports Server (NTRS)
Vidulich, Michael A.; Bortolussi, Michael R.
1988-01-01
An experiment has been conducted to ascertain the usefulness of helicopter pilot speech controls and their effect on time-sharing performance, under the impetus of multiple-resource theories of attention which predict that time-sharing should be more efficient with mixed manual and speech controls than with all-manual ones. The test simulation involved an advanced, single-pilot scout/attack helicopter. Performance and subjective workload levels obtained supported the claimed utility of speech recognition-based controls; specifically, time-sharing performance was improved while preparing a data-burst transmission of information during helicopter hover.
NASA Technical Reports Server (NTRS)
An, S. H.; Yao, K.
1986-01-01
Lattice algorithm has been employed in numerous adaptive filtering applications such as speech analysis/synthesis, noise canceling, spectral analysis, and channel equalization. In this paper the application to adaptive-array processing is discussed. The advantages are fast convergence rate as well as computational accuracy independent of the noise and interference conditions. The results produced by this technique are compared to those obtained by the direct matrix inverse method.
V2S: Voice to Sign Language Translation System for Malaysian Deaf People
NASA Astrophysics Data System (ADS)
Mean Foong, Oi; Low, Tang Jung; La, Wai Wan
The process of learning and understand the sign language may be cumbersome to some, and therefore, this paper proposes a solution to this problem by providing a voice (English Language) to sign language translation system using Speech and Image processing technique. Speech processing which includes Speech Recognition is the study of recognizing the words being spoken, regardless of whom the speaker is. This project uses template-based recognition as the main approach in which the V2S system first needs to be trained with speech pattern based on some generic spectral parameter set. These spectral parameter set will then be stored as template in a database. The system will perform the recognition process through matching the parameter set of the input speech with the stored templates to finally display the sign language in video format. Empirical results show that the system has 80.3% recognition rate.
Fluid-acoustic interactions and their impact on pathological voiced speech
NASA Astrophysics Data System (ADS)
Erath, Byron D.; Zanartu, Matias; Peterson, Sean D.; Plesniak, Michael W.
2011-11-01
Voiced speech is produced by vibration of the vocal fold structures. Vocal fold dynamics arise from aerodynamic pressure loadings, tissue properties, and acoustic modulation of the driving pressures. Recent speech science advancements have produced a physiologically-realistic fluid flow solver (BLEAP) capable of prescribing asymmetric intraglottal flow attachment that can be easily assimilated into reduced order models of speech. The BLEAP flow solver is extended to incorporate acoustic loading and sound propagation in the vocal tract by implementing a wave reflection analog approach for sound propagation based on the governing BLEAP equations. This enhanced physiological description of the physics of voiced speech is implemented into a two-mass model of speech. The impact of fluid-acoustic interactions on vocal fold dynamics is elucidated for both normal and pathological speech through linear and nonlinear analysis techniques. Supported by NSF Grant CBET-1036280.
NASA Astrophysics Data System (ADS)
Nomura, Yukihiro; Lu, Jianming; Sekiya, Hiroo; Yahagi, Takashi
This paper presents a speech enhancement using the classification between the dominants of speech and noise. In our system, a new classification scheme between the dominants of speech and noise is proposed. The proposed classifications use the standard deviation of the spectrum of observation signal in each band. We introduce two oversubtraction factors for the dominants of speech and noise, respectively. And spectral subtraction is carried out after the classification. The proposed method is tested on several noise types from the Noisex-92 database. From the investigation of segmental SNR, Itakura-Saito distance measure, inspection of spectrograms and listening tests, the proposed system is shown to be effective to reduce background noise. Moreover, the enhanced speech using our system generates less musical noise and distortion than that of conventional systems.
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Karnofsky, K. F.; Stevens, K. N.; Alakel, M. N.
1983-12-01
The use of multiple sensors to transduce speech was investigated. A data base of speech and noise was collected from a number of transducers located on and around the head of the speaker. The transducers included pressure, first order gradient, second order gradient microphones and an accelerometer. The effort analyzed this data and evaluated the performance of a multiple sensor configuration. The conclusion was: multiple transducer configurations can provide a signal containing more useable speech information than that provided by a microphone.
Structural brain aging and speech production: a surface-based brain morphometry study.
Tremblay, Pascale; Deschamps, Isabelle
2016-07-01
While there has been a growing number of studies examining the neurofunctional correlates of speech production over the past decade, the neurostructural correlates of this immensely important human behaviour remain less well understood, despite the fact that previous studies have established links between brain structure and behaviour, including speech and language. In the present study, we thus examined, for the first time, the relationship between surface-based cortical thickness (CT) and three different behavioural indexes of sublexical speech production: response duration, reaction times and articulatory accuracy, in healthy young and older adults during the production of simple and complex meaningless sequences of syllables (e.g., /pa-pa-pa/ vs. /pa-ta-ka/). The results show that each behavioural speech measure was sensitive to the complexity of the sequences, as indicated by slower reaction times, longer response durations and decreased articulatory accuracy in both groups for the complex sequences. Older adults produced longer speech responses, particularly during the production of complex sequence. Unique age-independent and age-dependent relationships between brain structure and each of these behavioural measures were found in several cortical and subcortical regions known for their involvement in speech production, including the bilateral anterior insula, the left primary motor area, the rostral supramarginal gyrus, the right inferior frontal sulcus, the bilateral putamen and caudate, and in some region less typically associated with speech production, such as the posterior cingulate cortex.
Altieri, Nicholas; Pisoni, David B.; Townsend, James T.
2012-01-01
Summerfield (1987) proposed several accounts of audiovisual speech perception, a field of research that has burgeoned in recent years. The proposed accounts included the integration of discrete phonetic features, vectors describing the values of independent acoustical and optical parameters, the filter function of the vocal tract, and articulatory dynamics of the vocal tract. The latter two accounts assume that the representations of audiovisual speech perception are based on abstract gestures, while the former two assume that the representations consist of symbolic or featural information obtained from visual and auditory modalities. Recent converging evidence from several different disciplines reveals that the general framework of Summerfield’s feature-based theories should be expanded. An updated framework building upon the feature-based theories is presented. We propose a processing model arguing that auditory and visual brain circuits provide facilitatory information when the inputs are correctly timed, and that auditory and visual speech representations do not necessarily undergo translation into a common code during information processing. Future research on multisensory processing in speech perception should investigate the connections between auditory and visual brain regions, and utilize dynamic modeling tools to further understand the timing and information processing mechanisms involved in audiovisual speech integration. PMID:21968081
Altieri, Nicholas; Pisoni, David B; Townsend, James T
2011-01-01
Summerfield (1987) proposed several accounts of audiovisual speech perception, a field of research that has burgeoned in recent years. The proposed accounts included the integration of discrete phonetic features, vectors describing the values of independent acoustical and optical parameters, the filter function of the vocal tract, and articulatory dynamics of the vocal tract. The latter two accounts assume that the representations of audiovisual speech perception are based on abstract gestures, while the former two assume that the representations consist of symbolic or featural information obtained from visual and auditory modalities. Recent converging evidence from several different disciplines reveals that the general framework of Summerfield's feature-based theories should be expanded. An updated framework building upon the feature-based theories is presented. We propose a processing model arguing that auditory and visual brain circuits provide facilitatory information when the inputs are correctly timed, and that auditory and visual speech representations do not necessarily undergo translation into a common code during information processing. Future research on multisensory processing in speech perception should investigate the connections between auditory and visual brain regions, and utilize dynamic modeling tools to further understand the timing and information processing mechanisms involved in audiovisual speech integration.
Wang, Kun-Ching
2015-01-14
The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI). In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII). The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD) algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC) and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.
Barista: A Framework for Concurrent Speech Processing by USC-SAIL
Can, Doğan; Gibson, James; Vaz, Colin; Georgiou, Panayiotis G.; Narayanan, Shrikanth S.
2016-01-01
We present Barista, an open-source framework for concurrent speech processing based on the Kaldi speech recognition toolkit and the libcppa actor library. With Barista, we aim to provide an easy-to-use, extensible framework for constructing highly customizable concurrent (and/or distributed) networks for a variety of speech processing tasks. Each Barista network specifies a flow of data between simple actors, concurrent entities communicating by message passing, modeled after Kaldi tools. Leveraging the fast and reliable concurrency and distribution mechanisms provided by libcppa, Barista lets demanding speech processing tasks, such as real-time speech recognizers and complex training workflows, to be scheduled and executed on parallel (and/or distributed) hardware. Barista is released under the Apache License v2.0. PMID:27610047
Barista: A Framework for Concurrent Speech Processing by USC-SAIL.
Can, Doğan; Gibson, James; Vaz, Colin; Georgiou, Panayiotis G; Narayanan, Shrikanth S
2014-05-01
We present Barista, an open-source framework for concurrent speech processing based on the Kaldi speech recognition toolkit and the libcppa actor library. With Barista, we aim to provide an easy-to-use, extensible framework for constructing highly customizable concurrent (and/or distributed) networks for a variety of speech processing tasks. Each Barista network specifies a flow of data between simple actors, concurrent entities communicating by message passing, modeled after Kaldi tools. Leveraging the fast and reliable concurrency and distribution mechanisms provided by libcppa, Barista lets demanding speech processing tasks, such as real-time speech recognizers and complex training workflows, to be scheduled and executed on parallel (and/or distributed) hardware. Barista is released under the Apache License v2.0.
Goehring, Tobias; Bolner, Federico; Monaghan, Jessica J M; van Dijk, Bas; Zarowski, Andrzej; Bleeck, Stefan
2017-02-01
Speech understanding in noisy environments is still one of the major challenges for cochlear implant (CI) users in everyday life. We evaluated a speech enhancement algorithm based on neural networks (NNSE) for improving speech intelligibility in noise for CI users. The algorithm decomposes the noisy speech signal into time-frequency units, extracts a set of auditory-inspired features and feeds them to the neural network to produce an estimation of which frequency channels contain more perceptually important information (higher signal-to-noise ratio, SNR). This estimate is used to attenuate noise-dominated and retain speech-dominated CI channels for electrical stimulation, as in traditional n-of-m CI coding strategies. The proposed algorithm was evaluated by measuring the speech-in-noise performance of 14 CI users using three types of background noise. Two NNSE algorithms were compared: a speaker-dependent algorithm, that was trained on the target speaker used for testing, and a speaker-independent algorithm, that was trained on different speakers. Significant improvements in the intelligibility of speech in stationary and fluctuating noises were found relative to the unprocessed condition for the speaker-dependent algorithm in all noise types and for the speaker-independent algorithm in 2 out of 3 noise types. The NNSE algorithms used noise-specific neural networks that generalized to novel segments of the same noise type and worked over a range of SNRs. The proposed algorithm has the potential to improve the intelligibility of speech in noise for CI users while meeting the requirements of low computational complexity and processing delay for application in CI devices. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Objective measurement of motor speech characteristics in the healthy pediatric population.
Wong, A W; Allegro, J; Tirado, Y; Chadha, N; Campisi, P
2011-12-01
To obtain objective measurements of motor speech characteristics in normal children, using a computer-based motor speech software program. Cross-sectional, observational design in a university-based ambulatory pediatric otolaryngology clinic. Participants included 112 subjects (54 females and 58 males) aged 4-18 years. Participants with previously diagnosed hearing loss, voice and motor disorders, and children unable to repeat a passage in English were excluded. Voice samples were recorded and analysed using the Motor Speech Profile (MSP) software (KayPENTAX, Lincoln Park, NJ). The MSP produced measures of diadochokinetics, second formant transition, intonation, and syllabic rates. Demographic data, including sex, age, and cigarette smoke exposure were obtained. Normative data for several motor speech characteristics were derived for children ranging from age 4 to 18 years. A number of age-dependent changes were indentified, including an increase in average diadochokinetic rate (p<0.001) and standard syllabic duration (p<0.001) with age. There were no identified differences in motor speech characteristics between males and females across the measured age range. Variations in fundamental frequency (Fo) during speech did not change significantly with age for both males and females. To our knowledge, this is the first pediatric normative database for the MSP progam. The MSP is suitable for testing children and can be used to study developmental changes in motor speech. The analysis demonstrated that males and females behave similarly and show the same relationship with age for the motor speech characteristics studied. This normative database will provide essential comparative data for future studies exploring alterations in motor speech that may occur with hearing, voice, and motor disorders and to assess the results of targeted therapies. Copyright © 2011 Elsevier Ireland Ltd. All rights reserved.
ERIC Educational Resources Information Center
Chen, Howard Hao-Jan
2011-01-01
Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…
CLINIC-LABORATORY DESIGN BASED ON FUNCTION AND PHILOSOPHY AT PURDUE UNIVERSITY.
ERIC Educational Resources Information Center
HANLEY, T.D.; STEER, M.D.
THIS REPORT DESCRIBES THE DESIGN OF A NEW CLINIC AND LABORATORY FOR SPEECH AND HEARING TO ACCOMMODATE THE THREE BASIC PROGRAMS OF--(1) CLINICAL TRAINING OF UNDERGRADUATE AND GRADUATE STUDENT MAJORS, (2) SERVICES MADE AVAILABLE TO THE SPEECH AND HEARING HANDICAPPED, AND (3) RESEARCH IN SPEECH PATHOLOGY, AUDIOLOGY, PSYCHO-ACOUSTICS, AND…
ERIC Educational Resources Information Center
Camarata, Stephen; Yoder, Paul; Camarata, Mary
2006-01-01
Children with Down syndrome often display speech-comprehensibility and grammatical deficits beyond what would be predicted based upon general mental age. Historically, speech-comprehensibility has often been treated using traditional articulation therapy and oral-motor training so there may be little or no coordination of grammatical and…
Reading, Why Not? Literacy Skills in Children with Motor and Speech Impairments
ERIC Educational Resources Information Center
Ferreira, Janna; Ronnberg, Jerker; Gustafson, Stefan; Wengelin, Asa
2007-01-01
In this study, 12 participants with various levels of motor and speech deficits were tested to explore their reading skills in relation to letter knowledge, speech level, auditory discrimination, phonological awareness, language skills, digit span, and nonverbal IQ. Two subgroups, based on a median split of reading performance, are described: the…
Speech Deterioration in Amyotrophic Lateral Sclerosis (ALS) after Manifestation of Bulbar Symptoms
ERIC Educational Resources Information Center
Makkonen, Tanja; Ruottinen, Hanna; Puhto, Riitta; Helminen, Mika; Palmio, Johanna
2018-01-01
Background: The symptoms and their progression in amyotrophic lateral sclerosis (ALS) are typically studied after the diagnosis has been confirmed. However, many people with ALS already have severe dysarthria and loss of adequate speech at the time of diagnosis. Speech-and-language therapy interventions should be targeted timely based on…
Speech Pathology and Dialect Differences. Dialects and Educational Equity.
ERIC Educational Resources Information Center
Wolfram, Walt
Discussions in speech and language pathology often contain references to language differences and the ways these differences compare with speech and language disorders. There is ongoing research on the regional varieties of English, and within the past decade, information on social and ethnic variation in language has been accumulating. Based on…
Re-Evaluation and the Core Curriculum: How Will Speech Communication Fare?
ERIC Educational Resources Information Center
Madson, Lynda P.; Myers, Russel M.
This paper discusses the effects of reestablished or redefined core curriculum requirements on college speech communication programs, based on the survey responses of speech communication faculty members at 104 four-year schools. The following conclusions and recommendations are presented as a result of the survey: Although many employers consider…
Cascading Influences on the Production of Speech: Evidence from Articulation
ERIC Educational Resources Information Center
McMillan, Corey T.; Corley, Martin
2010-01-01
Recent investigations have supported the suggestion that phonological speech errors may reflect the simultaneous activation of more than one phonemic representation. This presents a challenge for speech error evidence which is based on the assumption of well-formedness, because we may continue to perceive well-formed errors, even when they are not…
Speech Development of Autistic Children by Interactive Computer Games
ERIC Educational Resources Information Center
Rahman, Mustafizur; Ferdous, S. M.; Ahmed, Syed Ishtiaque; Anwar, Anika
2011-01-01
Purpose: Speech disorder is one of the most common problems found with autistic children. The purpose of this paper is to investigate the introduction of computer-based interactive games along with the traditional therapies in order to help improve the speech of autistic children. Design/methodology/approach: From analysis of the works of Ivar…
Technology and Speech Training: An Affair to Remember.
ERIC Educational Resources Information Center
Levitt, Harry
1989-01-01
A history of speech training technology is presented, from the simple hand-held mirror to complicated computer-based systems and tactile devices, and subsequent papers in this theme issue are introduced. Both the advantages and problems of technological aids are addressed. Simplicity in the application and use of speech training aids is stressed.…
The Contributions of Speech Communication Scholarship to the Study of Terrorism: Review and Preview.
ERIC Educational Resources Information Center
Dowling, Ralph E.
Based on the premise that existing research into terrorism shows great promise, this paper notes that, despite widespread recognition of terrorism's communicative dimensions, few studies have been done from within the discipline of speech communication. The paper defines the discipline of speech communication and rhetorical studies, reviews the…
Lousada, M; Jesus, Luis M T; Hall, A; Joffe, V
2014-01-01
The effectiveness of two treatment approaches (phonological therapy and articulation therapy) for treatment of 14 children, aged 4;0-6;7 years, with phonologically based speech-sound disorder (SSD) has been previously analysed with severity outcome measures (percentage of consonants correct score, percentage occurrence of phonological processes and phonetic inventory). Considering that the ultimate goal of intervention for children with phonologically based SSD is to improve intelligibility, it is curious that intervention studies focusing on children's phonology do not routinely use intelligibility as an outcome measure. It is therefore important that the impact of interventions on speech intelligibility is explored. This paper investigates the effectiveness of the two treatment approaches (phonological therapy and articulation therapy) using intelligibility measures, both in single words and in continuous speech, as the primary outcome. Fourteen children with phonologically based SSD participated in the intervention. The children were randomly assigned to phonological therapy or articulation therapy (seven children in each group). Two assessment methods were used for measuring intelligibility: a word identification task (for single words) and a rating scale (for continuous speech). Twenty-one unfamiliar adults listened and judged the children's intelligibility. Reliability analyses showed overall high agreement between listeners across both methods. Significant improvements were noted in intelligibility in both single words (paired t(6)=4.409, p=0.005) and continuous speech (asymptotic Z=2.371, p=0.018) for the group receiving phonology therapy pre- to post-treatment, but no differences in intelligibility were found for those receiving the articulation therapy pre- to post-treatment, either for single words (paired t(6)=1.763, p=0.128) or continuous speech (asymptotic Z=1.442, p=0.149). Intelligibility measures were sensitive enough to show changes in the phonological therapy group but not in the articulation therapy group. These findings emphasize the importance of using intelligibility as an outcome measure to complement the results obtained with other severity measures when exploring the effectiveness of speech interventions. This study presents new evidence for the effectiveness of phonological therapy in improving intelligibility with children with SSD. © 2014 Royal College of Speech and Language Therapists.
LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.
Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu
2005-01-01
Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.
Vilela, Nadia; Barrozo, Tatiane Faria; Pagan-Neves, Luciana de Oliveira; Sanches, Seisse Gabriela Gandolfi; Wertzner, Haydée Fiszbein; Carvallo, Renata Mota Mamede
2016-02-01
To identify a cutoff value based on the Percentage of Consonants Correct-Revised index that could indicate the likelihood of a child with a speech-sound disorder also having a (central) auditory processing disorder . Language, audiological and (central) auditory processing evaluations were administered. The participants were 27 subjects with speech-sound disorders aged 7 to 10 years and 11 months who were divided into two different groups according to their (central) auditory processing evaluation results. When a (central) auditory processing disorder was present in association with a speech disorder, the children tended to have lower scores on phonological assessments. A greater severity of speech disorder was related to a greater probability of the child having a (central) auditory processing disorder. The use of a cutoff value for the Percentage of Consonants Correct-Revised index successfully distinguished between children with and without a (central) auditory processing disorder. The severity of speech-sound disorder in children was influenced by the presence of (central) auditory processing disorder. The attempt to identify a cutoff value based on a severity index was successful.
Neural Spike-Train Analyses of the Speech-Based Envelope Power Spectrum Model
Rallapalli, Varsha H.
2016-01-01
Diagnosing and treating hearing impairment is challenging because people with similar degrees of sensorineural hearing loss (SNHL) often have different speech-recognition abilities. The speech-based envelope power spectrum model (sEPSM) has demonstrated that the signal-to-noise ratio (SNRENV) from a modulation filter bank provides a robust speech-intelligibility measure across a wider range of degraded conditions than many long-standing models. In the sEPSM, noise (N) is assumed to: (a) reduce S + N envelope power by filling in dips within clean speech (S) and (b) introduce an envelope noise floor from intrinsic fluctuations in the noise itself. While the promise of SNRENV has been demonstrated for normal-hearing listeners, it has not been thoroughly extended to hearing-impaired listeners because of limited physiological knowledge of how SNHL affects speech-in-noise envelope coding relative to noise alone. Here, envelope coding to speech-in-noise stimuli was quantified from auditory-nerve model spike trains using shuffled correlograms, which were analyzed in the modulation-frequency domain to compute modulation-band estimates of neural SNRENV. Preliminary spike-train analyses show strong similarities to the sEPSM, demonstrating feasibility of neural SNRENV computations. Results suggest that individual differences can occur based on differential degrees of outer- and inner-hair-cell dysfunction in listeners currently diagnosed into the single audiological SNHL category. The predicted acoustic-SNR dependence in individual differences suggests that the SNR-dependent rate of susceptibility could be an important metric in diagnosing individual differences. Future measurements of the neural SNRENV in animal studies with various forms of SNHL will provide valuable insight for understanding individual differences in speech-in-noise intelligibility.
Automatic evaluation of hypernasality based on a cleft palate speech database.
He, Ling; Zhang, Jing; Liu, Qi; Yin, Heng; Lech, Margaret; Huang, Yunzhi
2015-05-01
The hypernasality is one of the most typical characteristics of cleft palate (CP) speech. The evaluation outcome of hypernasality grading decides the necessity of follow-up surgery. Currently, the evaluation of CP speech is carried out by experienced speech therapists. However, the result strongly depends on their clinical experience and subjective judgment. This work aims to propose an automatic evaluation system for hypernasality grading in CP speech. The database tested in this work is collected by the Hospital of Stomatology, Sichuan University, which has the largest number of CP patients in China. Based on the production process of hypernasality, source sound pulse and vocal tract filter features are presented. These features include pitch, the first and second energy amplified frequency bands, cepstrum based features, MFCC, short-time energy in the sub-bands features. These features combined with KNN classier are applied to automatically classify four grades of hypernasality: normal, mild, moderate and severe. The experiment results show that the proposed system achieves a good performance. The classification rates for four hypernasality grades reach up to 80.4%. The sensitivity of proposed features to the gender is also discussed.
A Dynamic Compressive Gammachirp Auditory Filterbank
Irino, Toshio; Patterson, Roy D.
2008-01-01
It is now common to use knowledge about human auditory processing in the development of audio signal processors. Until recently, however, such systems were limited by their linearity. The auditory filter system is known to be level-dependent as evidenced by psychophysical data on masking, compression, and two-tone suppression. However, there were no analysis/synthesis schemes with nonlinear filterbanks. This paper describe18300060s such a scheme based on the compressive gammachirp (cGC) auditory filter. It was developed to extend the gammatone filter concept to accommodate the changes in psychophysical filter shape that are observed to occur with changes in stimulus level in simultaneous, tone-in-noise masking. In models of simultaneous noise masking, the temporal dynamics of the filtering can be ignored. Analysis/synthesis systems, however, are intended for use with speech sounds where the glottal cycle can be long with respect to auditory time constants, and so they require specification of the temporal dynamics of auditory filter. In this paper, we describe a fast-acting level control circuit for the cGC filter and show how psychophysical data involving two-tone suppression and compression can be used to estimate the parameter values for this dynamic version of the cGC filter (referred to as the “dcGC” filter). One important advantage of analysis/synthesis systems with a dcGC filterbank is that they can inherit previously refined signal processing algorithms developed with conventional short-time Fourier transforms (STFTs) and linear filterbanks. PMID:19330044
Neuroscience-inspired computational systems for speech recognition under noisy conditions
NASA Astrophysics Data System (ADS)
Schafer, Phillip B.
Humans routinely recognize speech in challenging acoustic environments with background music, engine sounds, competing talkers, and other acoustic noise. However, today's automatic speech recognition (ASR) systems perform poorly in such environments. In this dissertation, I present novel methods for ASR designed to approach human-level performance by emulating the brain's processing of sounds. I exploit recent advances in auditory neuroscience to compute neuron-based representations of speech, and design novel methods for decoding these representations to produce word transcriptions. I begin by considering speech representations modeled on the spectrotemporal receptive fields of auditory neurons. These representations can be tuned to optimize a variety of objective functions, which characterize the response properties of a neural population. I propose an objective function that explicitly optimizes the noise invariance of the neural responses, and find that it gives improved performance on an ASR task in noise compared to other objectives. The method as a whole, however, fails to significantly close the performance gap with humans. I next consider speech representations that make use of spiking model neurons. The neurons in this method are feature detectors that selectively respond to spectrotemporal patterns within short time windows in speech. I consider a number of methods for training the response properties of the neurons. In particular, I present a method using linear support vector machines (SVMs) and show that this method produces spikes that are robust to additive noise. I compute the spectrotemporal receptive fields of the neurons for comparison with previous physiological results. To decode the spike-based speech representations, I propose two methods designed to work on isolated word recordings. The first method uses a classical ASR technique based on the hidden Markov model. The second method is a novel template-based recognition scheme that takes advantage of the neural representation's invariance in noise. The scheme centers on a speech similarity measure based on the longest common subsequence between spike sequences. The combined encoding and decoding scheme outperforms a benchmark system in extremely noisy acoustic conditions. Finally, I consider methods for decoding spike representations of continuous speech. To help guide the alignment of templates to words, I design a syllable detection scheme that robustly marks the locations of syllabic nuclei. The scheme combines SVM-based training with a peak selection algorithm designed to improve noise tolerance. By incorporating syllable information into the ASR system, I obtain strong recognition results in noisy conditions, although the performance in noiseless conditions is below the state of the art. The work presented here constitutes a novel approach to the problem of ASR that can be applied in the many challenging acoustic environments in which we use computer technologies today. The proposed spike-based processing methods can potentially be exploited in effcient hardware implementations and could significantly reduce the computational costs of ASR. The work also provides a framework for understanding the advantages of spike-based acoustic coding in the human brain.
ERIC Educational Resources Information Center
Kurjan, Randy Moskowitz
2000-01-01
This article discusses the role of speech-language pathologists in serving preschool children with dysphagia. Current approaches to feeding and swallowing intervention, etiologies and programs, transdisciplinary teaming, developmental and feeding evaluation, and types of service delivery models (home-based and center-based) for preschool children…
Emotion to emotion speech conversion in phoneme level
NASA Astrophysics Data System (ADS)
Bulut, Murtaza; Yildirim, Serdar; Busso, Carlos; Lee, Chul Min; Kazemzadeh, Ebrahim; Lee, Sungbok; Narayanan, Shrikanth
2004-10-01
Having an ability to synthesize emotional speech can make human-machine interaction more natural in spoken dialogue management. This study investigates the effectiveness of prosodic and spectral modification in phoneme level on emotion-to-emotion speech conversion. The prosody modification is performed with the TD-PSOLA algorithm (Moulines and Charpentier, 1990). We also transform the spectral envelopes of source phonemes to match those of target phonemes using LPC-based spectral transformation approach (Kain, 2001). Prosodic speech parameters (F0, duration, and energy) for target phonemes are estimated from the statistics obtained from the analysis of an emotional speech database of happy, angry, sad, and neutral utterances collected from actors. Listening experiments conducted with native American English speakers indicate that the modification of prosody only or spectrum only is not sufficient to elicit targeted emotions. The simultaneous modification of both prosody and spectrum results in higher acceptance rates of target emotions, suggesting that not only modeling speech prosody but also modeling spectral patterns that reflect underlying speech articulations are equally important to synthesize emotional speech with good quality. We are investigating suprasegmental level modifications for further improvement in speech quality and expressiveness.
Chen, Yung-Yue
2018-05-08
Mobile devices are often used in our daily lives for the purposes of speech and communication. The speech quality of mobile devices is always degraded due to the environmental noises surrounding mobile device users. Regretfully, an effective background noise reduction solution cannot easily be developed for this speech enhancement problem. Due to these depicted reasons, a methodology is systematically proposed to eliminate the effects of background noises for the speech communication of mobile devices. This methodology integrates a dual microphone array with a background noise elimination algorithm. The proposed background noise elimination algorithm includes a whitening process, a speech modelling method and an H ₂ estimator. Due to the adoption of the dual microphone array, a low-cost design can be obtained for the speech enhancement of mobile devices. Practical tests have proven that this proposed method is immune to random background noises, and noiseless speech can be obtained after executing this denoise process.
The somatotopy of speech: Phonation and articulation in the human motor cortex
Brown, Steven; Laird, Angela R.; Pfordresher, Peter Q.; Thelen, Sarah M.; Turkeltaub, Peter; Liotti, Mario
2010-01-01
A sizable literature on the neuroimaging of speech production has reliably shown activations in the orofacial region of the primary motor cortex. These activations have invariably been interpreted as reflecting “mouth” functioning and thus articulation. We used functional magnetic resonance imaging to compare an overt speech task with tongue movement, lip movement, and vowel phonation. The results showed that the strongest motor activation for speech was the somatotopic larynx area of the motor cortex, thus reflecting the significant contribution of phonation to speech production. In order to analyze further the phonatory component of speech, we performed a voxel-based meta-analysis of neuroimaging studies of syllable-singing (11 studies) and compared the results with a previously-published meta-analysis of oral reading (11 studies), showing again a strong overlap in the larynx motor area. Overall, these findings highlight the under-recognized presence of phonation in imaging studies of speech production, and support the role of the larynx motor cortex in mediating the “melodicity” of speech. PMID:19162389
Davidow, Jason H
2014-01-01
Metronome-paced speech results in the elimination, or substantial reduction, of stuttering moments. The cause of fluency during this fluency-inducing condition is unknown. Several investigations have reported changes in speech pattern characteristics from a control condition to a metronome-paced speech condition, but failure to control speech rate between conditions limits our ability to determine if the changes were necessary for fluency. This study examined the effect of speech rate on several speech production variables during one-syllable-per-beat metronomic speech in order to determine changes that may be important for fluency during this fluency-inducing condition. Thirteen persons who stutter (PWS), aged 18-62 years, completed a series of speaking tasks. Several speech production variables were compared between conditions produced at different metronome beat rates, and between a control condition and a metronome-paced speech condition produced at a rate equal to the control condition. Vowel duration, voice onset time, pressure rise time and phonated intervals were significantly impacted by metronome beat rate. Voice onset time and the percentage of short (30-100 ms) phonated intervals significantly decreased from the control condition to the equivalent rate metronome-paced speech condition. A reduction in the percentage of short phonated intervals may be important for fluency during syllable-based metronome-paced speech for PWS. Future studies should continue examining the necessity of this reduction. In addition, speech rate must be controlled in future fluency-inducing condition studies, including neuroimaging investigations, in order for this research to make a substantial contribution to finding the fluency-inducing mechanism of fluency-inducing conditions. © 2013 Royal College of Speech and Language Therapists.
Moral Blow to the Marine Corps: The Repeal of the Don’t Ask Don’t Tell Policy
2011-04-08
else individually has to accept homosexual acts as acceptable (based on freedom of speech and freedom of religion). The DOD report stated that of...free speech rights being curtailed would lead them to withdraw their endorsement.൞ The issuebere is really freedom of speech . Chaplains already...effects regardless of the intent. Freedom of speech and religion are our most important rights as Americans. By trying to protect one group are we
Speech Perception and Short Term Memory Deficits in Persistent Developmental Speech Disorder
Kenney, Mary Kay; Barac-Cikoja, Dragana; Finnegan, Kimberly; Jeffries, Neal; Ludlow, Christy L.
2008-01-01
Children with developmental speech disorders may have additional deficits in speech perception and/or short-term memory. To determine whether these are only transient developmental delays that can accompany the disorder in childhood or persist as part of the speech disorder, adults with a persistent familial speech disorder were tested on speech perception and short-term memory. Nine adults with a persistent familial developmental speech disorder without language impairment were compared with 20 controls on tasks requiring the discrimination of fine acoustic cues for word identification and on measures of verbal and nonverbal short-term memory. Significant group differences were found in the slopes of the discrimination curves for first formant transitions for word identification with stop gaps of 40 and 20 ms with effect sizes of 1.60 and 1.56. Significant group differences also occurred on tests of nonverbal rhythm and tonal memory, and verbal short-term memory with effect sizes of 2.38, 1.56 and 1.73. No group differences occurred in the use of stop gap durations for word identification. Because frequency-based speech perception and short-term verbal and nonverbal memory deficits both persisted into adulthood in the speech-impaired adults, these deficits may be involved in the persistence of speech disorders without language impairment. PMID:15896836
Speech-recognition interfaces for music information retrieval
NASA Astrophysics Data System (ADS)
Goto, Masataka
2005-09-01
This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)
Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo
2014-01-01
Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society's Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children's speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation.
Developing a weighted measure of speech sound accuracy.
Preston, Jonathan L; Ramsdell, Heather L; Oller, D Kimbrough; Edwards, Mary Louise; Tobin, Stephen J
2011-02-01
To develop a system for numerically quantifying a speaker's phonetic accuracy through transcription-based measures. With a focus on normal and disordered speech in children, the authors describe a system for differentially weighting speech sound errors on the basis of various levels of phonetic accuracy using a Weighted Speech Sound Accuracy (WSSA) score. The authors then evaluate the reliability and validity of this measure. Phonetic transcriptions were analyzed from several samples of child speech, including preschoolers and young adolescents with and without speech sound disorders and typically developing toddlers. The new measure of phonetic accuracy was validated against existing measures, was used to discriminate typical and disordered speech production, and was evaluated to examine sensitivity to changes in phonetic accuracy over time. Reliability between transcribers and consistency of scores among different word sets and testing points are compared. Initial psychometric data indicate that WSSA scores correlate with other measures of phonetic accuracy as well as listeners' judgments of the severity of a child's speech disorder. The measure separates children with and without speech sound disorders and captures growth in phonetic accuracy in toddlers' speech over time. The measure correlates highly across transcribers, word lists, and testing points. Results provide preliminary support for the WSSA as a valid and reliable measure of phonetic accuracy in children's speech.
Real-time spectrum estimation–based dual-channel speech-enhancement algorithm for cochlear implant
2012-01-01
Background Improvement of the cochlear implant (CI) front-end signal acquisition is needed to increase speech recognition in noisy environments. To suppress the directional noise, we introduce a speech-enhancement algorithm based on microphone array beamforming and spectral estimation. The experimental results indicate that this method is robust to directional mobile noise and strongly enhances the desired speech, thereby improving the performance of CI devices in a noisy environment. Methods The spectrum estimation and the array beamforming methods were combined to suppress the ambient noise. The directivity coefficient was estimated in the noise-only intervals, and was updated to fit for the mobile noise. Results The proposed algorithm was realized in the CI speech strategy. For actual parameters, we use Maxflat filter to obtain fractional sampling points and cepstrum method to differentiate the desired speech frame and the noise frame. The broadband adjustment coefficients were added to compensate the energy loss in the low frequency band. Discussions The approximation of the directivity coefficient is tested and the errors are discussed. We also analyze the algorithm constraint for noise estimation and distortion in CI processing. The performance of the proposed algorithm is analyzed and further be compared with other prevalent methods. Conclusions The hardware platform was constructed for the experiments. The speech-enhancement results showed that our algorithm can suppresses the non-stationary noise with high SNR. Excellent performance of the proposed algorithm was obtained in the speech enhancement experiments and mobile testing. And signal distortion results indicate that this algorithm is robust with high SNR improvement and low speech distortion. PMID:23006896
A nationwide survey of nonspeech oral motor exercise use: implications for evidence-based practice.
Lof, Gregory L; Watson, Maggie M
2008-07-01
A nationwide survey was conducted to determine if speech-language pathologists (SLPs) use nonspeech oral motor exercises (NSOMEs) to address children's speech sound problems. For those SLPs who used NSOMEs, the survey also identified (a) the types of NSOMEs used by the SLPs, (b) the SLPs' underlying beliefs about why they use NSOMEs, (c) clinicians' training for these exercises, (d) the application of NSOMEs across various clinical populations, and (e) specific tasks/procedures/tools that are used for intervention. A total of 2,000 surveys were mailed to a randomly selected subgroup of SLPs, obtained from the American Speech-Language-Hearing Association (ASHA) membership roster, who self-identified that they worked in various settings with children who have speech sound problems. The questions required answers that used both a forced choice and Likert-type scales. The response rate was 27.5% (537 out of 2,000). Of these respondents, 85% reported using NSOMEs to deal with children's speech sound production problems. Those SLPs reported that the research literature supports the use of NSOMEs, and that they learned to use these techniques from continuing education events. They also stated that NSOMEs can help improve the speech of children from disparate etiologies, and "warming up" and strengthening the articulators are important components of speech sound therapy. There are theoretical and research data that challenge both the use of NSOMEs and the efficacy of such exercises in resolving speech sound problems. SLPs need to follow the concepts of evidence-based practice in order to determine if these exercises are actually effective in bringing about changes in speech productions.
Phonological working memory and FOXP2.
Schulze, Katrin; Vargha-Khadem, Faraneh; Mishkin, Mortimer
2018-01-08
The discovery and description of the affected members of the KE family (aKE) initiated research on how genes enable the unique human trait of speech and language. Many aspects of this genetic influence on speech-related cognitive mechanisms are still elusive, e.g. if and how cognitive processes not directly involved in speech production are affected. In the current study we investigated the effect of the FOXP2 mutation on Working Memory (WM). Half the members of the multigenerational KE family have an inherited speech-language disorder, characterised as a verbal and orofacial dyspraxia caused by a mutation of the FOXP2 gene. The core phenotype of the affected KE members (aKE) is a deficiency in repeating words, especially complex non-words, and in coordinating oromotor sequences generally. Execution of oromotor sequences and repetition of phonological sequences both require WM, but to date the aKE's memory ability in this domain has not been examined in detail. To do so we used a test series based on the Baddeley and Hitch WM model, which posits that the central executive (CE), important for planning and manipulating information, works in conjunction with two modality-specific components: The phonological loop (PL), specialized for processing speech-based information; and the visuospatial sketchpad (VSSP), dedicated to processing visual and spatial information. We compared WM performance related to CE, PL, and VSSP function in five aKE and 15 healthy controls (including three unaffected members of the KE family who do not have the FOXP2 mutation). The aKE scored significantly below this control group on the PL component, but not on the VSSP or CE components. Further, the aKE were impaired relative to the controls not only in motor (i.e. articulatory) output but also on the recognition-based PL subtest (word-list matching), which does not require speech production. These results suggest that the aKE's impaired phonological WM may be due to a defect in subvocal rehearsal of speech-based material, and that this defect may be due in turn to compromised speech-based representations. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.
Using DEDICOM for completely unsupervised part-of-speech tagging.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chew, Peter A.; Bader, Brett William; Rozovskaya, Alla
A standard and widespread approach to part-of-speech tagging is based on Hidden Markov Models (HMMs). An alternative approach, pioneered by Schuetze (1993), induces parts of speech from scratch using singular value decomposition (SVD). We introduce DEDICOM as an alternative to SVD for part-of-speech induction. DEDICOM retains the advantages of SVD in that it is completely unsupervised: no prior knowledge is required to induce either the tagset or the associations of terms with tags. However, unlike SVD, it is also fully compatible with the HMM framework, in that it can be used to estimate emission- and transition-probability matrices which can thenmore » be used as the input for an HMM. We apply the DEDICOM method to the CONLL corpus (CONLL 2000) and compare the output of DEDICOM to the part-of-speech tags given in the corpus, and find that the correlation (almost 0.5) is quite high. Using DEDICOM, we also estimate part-of-speech ambiguity for each term, and find that these estimates correlate highly with part-of-speech ambiguity as measured in the original corpus (around 0.88). Finally, we show how the output of DEDICOM can be evaluated and compared against the more familiar output of supervised HMM-based tagging.« less
Schlosser, Ralf W; Koul, Rajinder K
2015-01-01
The purpose of this scoping review was to (a) map the research evidence on the effectiveness of augmentative and alternative communication (AAC) interventions using speech output technologies (e.g., speech-generating devices, mobile technologies with AAC-specific applications, talking word processors) for individuals with autism spectrum disorders, (b) identify gaps in the existing literature, and (c) posit directions for future research. Outcomes related to speech, language, and communication were considered. A total of 48 studies (47 single case experimental designs and 1 randomized control trial) involving 187 individuals were included. Results were reviewed in terms of three study groupings: (a) studies that evaluated the effectiveness of treatment packages involving speech output, (b) studies comparing one treatment package with speech output to other AAC modalities, and (c) studies comparing the presence with the absence of speech output. The state of the evidence base is discussed and several directions for future research are posited.
Speech disorders did not correlate with age at onset of Parkinson's disease.
Dias, Alice Estevo; Barbosa, Maira Tonidandel; Limongi, João Carlos Papaterra; Barbosa, Egberto Reis
2016-02-01
Speech disorders are common manifestations of Parkinson´s disease. Objective To compare speech articulation in patients according to age at onset of the disease. Methods Fifty patients was divided into two groups: Group I consisted of 30 patients with age at onset between 40 and 55 years; Group II consisted of 20 patients with age at onset after 65 years. All patients were evaluated based on the Unified Parkinson's Disease Rating Scale scores, Hoehn and Yahr scale and speech evaluation by perceptual and acoustical analysis. Results There was no statistically significant difference between the two groups regarding neurological involvement and speech characteristics. Correlation analysis indicated differences in speech articulation in relation to staging and axial scores of rigidity and bradykinesia for middle and late-onset. Conclusions Impairment of speech articulation did not correlate with age at onset of disease, but was positively related with disease duration and higher scores in both groups.
Science-based practice and the speech-language pathologist.
Lof, Gregory L
2011-06-01
Evidence-based practice (EBP) is a well established concept in the field of speech-language pathology. However, evidence from research may not be the primary information that practitioners use to guide their treatment selection from the many potential options. There are various alternative therapy procedures that are strongly promoted, so clinicians must become skilled at identifying pseudoscience from science in order to determine if a treatment is legitimate or actually quackery. In order to advance the use of EBP, clinicians can gather practice-based evidence (PBE) by using the scientific method. By adhering to the principles of science, speech-language pathologists can incorporate science-based practice (SBP) into all aspects of their clinical work.
Racism, Group Defamation, and Freedom of Speech on Campus.
ERIC Educational Resources Information Center
Laramee, William A.
1991-01-01
Examines racism on college campuses. Discusses group defamation and freedom of speech within that context. Concludes in this period of racial unrest and conflict, a reappraisal is in order of delicate balance between protection from group and class defamation on the one hand and free speech on other, using law as an important base from which to…
ERIC Educational Resources Information Center
Harrison, Emily; Wood, Clare; Holliman, Andrew J.; Vousden, Janet I.
2018-01-01
Despite empirical evidence of a relationship between sensitivity to speech rhythm and reading, there have been few studies that have examined the impact of rhythmic training on reading attainment, and no intervention study has focused on speech rhythm sensitivity specifically to enhance reading skills. Seventy-three typically developing 4- to…
ERIC Educational Resources Information Center
Shenker, Rosalee C.
2006-01-01
Background: There will always be a place for stuttering treatments designed to eliminate or reduce stuttered speech. When those treatments are required, direct speech measures of treatment process and outcome are needed in clinical practice. Aims: Based on the contents of published clinical trials of such treatments, three "core" measures of…
ERIC Educational Resources Information Center
Ludlow, Christy L.; Hoit, Jeannette; Kent, Raymond; Ramig, Lorraine O.; Shrivastav, Rahul; Strand, Edythe; Yorkston, Kathryn; Sapienza, Christine M.
2008-01-01
Purpose: To review the principles of neural plasticity and make recommendations for research on the neural bases for rehabilitation of neurogenic speech disorders. Method: A working group in speech motor control and disorders developed this report, which examines the potential relevance of basic research on the brain mechanisms involved in neural…
APPLICATION OF MOWRER'S AUTISTIC THEORY TO THE SPEECH HABILITATION OF MENTALLY RETARDED PUPILS.
ERIC Educational Resources Information Center
RIGRODSKY, S.; AND OTHERS
A SPEECH THERAPY METHOD FOR MENTAL RETARDATES WAS DEVELOPED AND EVALUATED. THE METHOD WAS BASED UPON THE ESTABLISHMENT OF FAVORABLE ASSOCIATIONS IN THE CHILD BETWEEN THE WORDS AND SOUNDS OF LANGUAGE AND THE PRODUCER OF THE LANGUAGE, USING STIMULUS-REWARD AND SITUATION-REWARD PRINCIPLES. TRADITIONAL METHODS OF SPEECH THERAPY WERE ADMINISTERED,…
ERIC Educational Resources Information Center
Vallin, Marlene Boyd
A study tested those theories upon which instruction and curriculum in speech and public communication are based. The study investigated the relationship of mode of delivery on ratings of individual speech characteristics as well as the relationship of these perceptions of effectiveness in a public communication setting. Twenty-four videotapes of…
ERIC Educational Resources Information Center
Van Sickle, Angela
2016-01-01
Clinical Question: Would individuals with acquired apraxia of speech (AOS) demonstrate greater improvements for speech production with an articulatory kinematic approach or a rate/rhythm approach? Method: EBP Intervention Comparison Review. Study Sources: ASHA journal, Google Scholar, PubMed, CINAHL Plus with Full Text, Web of Science, Ovid, and…
Basilakos, Alexandra; Rorden, Chris; Bonilha, Leonardo; Moser, Dana; Fridriksson, Julius
2015-01-01
Background and Purpose Acquired apraxia of speech (AOS) is a motor speech disorder caused by brain damage. AOS often co-occurs with aphasia, a language disorder in which patients may also demonstrate speech production errors. The overlap of speech production deficits in both disorders has raised questions regarding if AOS emerges from a unique pattern of brain damage or as a sub-element of the aphasic syndrome. The purpose of this study was to determine whether speech production errors in AOS and aphasia are associated with distinctive patterns of brain injury. Methods Forty-three patients with history of a single left-hemisphere stroke underwent comprehensive speech and language testing. The Apraxia of Speech Rating Scale was used to rate speech errors specific to AOS versus speech errors that can also be associated with AOS and/or aphasia. Localized brain damage was identified using structural MRI, and voxel-based lesion-impairment mapping was used to evaluate the relationship between speech errors specific to AOS, those that can occur in AOS and/or aphasia, and brain damage. Results The pattern of brain damage associated with AOS was most strongly associated with damage to cortical motor regions, with additional involvement of somatosensory areas. Speech production deficits that could be attributed to AOS and/or aphasia were associated with damage to the temporal lobe and the inferior pre-central frontal regions. Conclusion AOS likely occurs in conjunction with aphasia due to the proximity of the brain areas supporting speech and language, but the neurobiological substrate for each disorder differs. PMID:25908457
Anderson, Carolyn; Cohen, Wendy
2012-01-01
Children's speech sound development is assessed by comparing speech production with the typical development of speech sounds based on a child's age and developmental profile. One widely used method of sampling is to elicit a single-word sample along with connected speech. Words produced spontaneously rather than imitated may give a more accurate indication of a child's speech development. A published word complexity measure can be used to score later-developing speech sounds and more complex word patterns. There is a need for a screening word list that is quick to administer and reliably differentiates children with typically developing speech from children with patterns of delayed/disordered speech. To identify a short word list based on word complexity that could be spontaneously named by most typically developing children aged 3;00-5;05 years. One hundred and five children aged between 3;00 and 5;05 years from three local authority nursery schools took part in the study. Items from a published speech assessment were modified and extended to include a range of phonemic targets in different word positions in 78 monosyllabic and polysyllabic words. The 78 words were ranked both by phonemic/phonetic complexity as measured by word complexity and by ease of spontaneous production. The ten most complex words (hereafter Triage 10) were named spontaneously by more than 90% of the children. There was no significant difference between the complexity measures for five identified age groups when the data were examined in 6-month groups. A qualitative analysis revealed eight children with profiles of phonological delay or disorder. When these children were considered separately, there was a statistically significant difference (p < 0.005) between the mean word complexity measure of the group compared with the mean for the remaining children in all other age groups. The Triage 10 words reliably differentiated children with typically developing speech from those with delayed or disordered speech patterns. The Triage 10 words can be used as a screening tool for triage and general assessment and have the potential to monitor progress during intervention. Further testing is being undertaken to establish reliability with children referred to speech and language therapy services. © 2012 Royal College of Speech and Language Therapists.
Proceedings of the Second Joint Technology Workshop on Neural Networks and Fuzzy Logic, volume 1
NASA Technical Reports Server (NTRS)
Lea, Robert N. (Editor); Villarreal, James (Editor)
1991-01-01
Documented here are papers presented at the Neural Networks and Fuzzy Logic Workshop sponsored by NASA and the University of Houston, Clear Lake. The workshop was held April 11 to 13 at the Johnson Space Flight Center. Technical topics addressed included adaptive systems, learning algorithms, network architectures, vision, robotics, neurobiological connections, speech recognition and synthesis, fuzzy set theory and application, control and dynamics processing, space applications, fuzzy logic and neural network computers, approximate reasoning, and multiobject decision making.
NASA Astrophysics Data System (ADS)
Přibil, Jiří; Přibilová, Anna; Frollo, Ivan
2017-12-01
The paper focuses on two methods of evaluation of successfulness of speech signal enhancement recorded in the open-air magnetic resonance imager during phonation for the 3D human vocal tract modeling. The first approach enables to obtain a comparison based on statistical analysis by ANOVA and hypothesis tests. The second method is based on classification by Gaussian mixture models (GMM). The performed experiments have confirmed that the proposed ANOVA and GMM classifiers for automatic evaluation of the speech quality are functional and produce fully comparable results with the standard evaluation based on the listening test method.
Automatic concept extraction from spoken medical reports.
Happe, André; Pouliquen, Bruno; Burgun, Anita; Cuggia, Marc; Le Beux, Pierre
2003-07-01
The objective of this project is to investigate methods whereby a combination of speech recognition and automated indexing methods substitute for current transcription and indexing practices. We based our study on existing speech recognition software programs and on NOMINDEX, a tool that extracts MeSH concepts from medical text in natural language and that is mainly based on a French medical lexicon and on the UMLS. For each document, the process consists of three steps: (1) dictation and digital audio recording, (2) speech recognition, (3) automatic indexing. The evaluation consisted of a comparison between the set of concepts extracted by NOMINDEX after the speech recognition phase and the set of keywords manually extracted from the initial document. The method was evaluated on a set of 28 patient discharge summaries extracted from the MENELAS corpus in French, corresponding to in-patients admitted for coronarography. The overall precision was 73% and the overall recall was 90%. Indexing errors were mainly due to word sense ambiguity and abbreviations. A specific issue was the fact that the standard French translation of MeSH terms lacks diacritics. A preliminary evaluation of speech recognition tools showed that the rate of accurate recognition was higher than 98%. Only 3% of the indexing errors were generated by inadequate speech recognition. We discuss several areas to focus on to improve this prototype. However, the very low rate of indexing errors due to speech recognition errors highlights the potential benefits of combining speech recognition techniques and automatic indexing.
Speech-driven environmental control systems--a qualitative analysis of users' perceptions.
Judge, Simon; Robertson, Zoë; Hawley, Mark; Enderby, Pam
2009-05-01
To explore users' experiences and perceptions of speech-driven environmental control systems (SPECS) as part of a larger project aiming to develop a new SPECS. The motivation for this part of the project was to add to the evidence base for the use of SPECS and to determine the key design specifications for a new speech-driven system from a user's perspective. Semi-structured interviews were conducted with 12 users of SPECS from around the United Kingdom. These interviews were transcribed and analysed using a qualitative method based on framework analysis. Reliability is the main influence on the use of SPECS. All the participants gave examples of occasions when their speech-driven system was unreliable; in some instances, this unreliability was reported as not being a problem (e.g., for changing television channels); however, it was perceived as a problem for more safety critical functions (e.g., opening a door). Reliability was cited by participants as the reason for using a switch-operated system as back up. Benefits of speech-driven systems focused on speech operation enabling access when other methods were not possible; quicker operation and better aesthetic considerations. Overall, there was a perception of increased independence from the use of speech-driven environmental control. In general, speech was considered a useful method of operating environmental controls by the participants interviewed; however, their perceptions regarding reliability often influenced their decision to have backup or alternative systems for certain functions.
NASA Astrophysics Data System (ADS)
Kattoju, Ravi Kiran; Barber, Daniel J.; Abich, Julian; Harris, Jonathan
2016-05-01
With increasing necessity for intuitive Soldier-robot communication in military operations and advancements in interactive technologies, autonomous robots have transitioned from assistance tools to functional and operational teammates able to service an array of military operations. Despite improvements in gesture and speech recognition technologies, their effectiveness in supporting Soldier-robot communication is still uncertain. The purpose of the present study was to evaluate the performance of gesture and speech interface technologies to facilitate Soldier-robot communication during a spatial-navigation task with an autonomous robot. Gesture and speech semantically based spatial-navigation commands leveraged existing lexicons for visual and verbal communication from the U.S Army field manual for visual signaling and a previously established Squad Level Vocabulary (SLV). Speech commands were recorded by a Lapel microphone and Microsoft Kinect, and classified by commercial off-the-shelf automatic speech recognition (ASR) software. Visual signals were captured and classified using a custom wireless gesture glove and software. Participants in the experiment commanded a robot to complete a simulated ISR mission in a scaled down urban scenario by delivering a sequence of gesture and speech commands, both individually and simultaneously, to the robot. Performance and reliability of gesture and speech hardware interfaces and recognition tools were analyzed and reported. Analysis of experimental results demonstrated the employed gesture technology has significant potential for enabling bidirectional Soldier-robot team dialogue based on the high classification accuracy and minimal training required to perform gesture commands.
Davidow, Jason H.
2013-01-01
Background Metronome-paced speech results in the elimination, or substantial reduction, of stuttering moments. The cause of fluency during this fluency-inducing condition is unknown. Several investigations have reported changes in speech pattern characteristics from a control condition to a metronome-paced speech condition, but failure to control speech rate between conditions limits our ability to determine if the changes were necessary for fluency. Aims This study examined the effect of speech rate on several speech production variables during one-syllable-per-beat metronomic speech, in order to determine changes that may be important for fluency during this fluency-inducing condition. Methods and Procedures Thirteen persons who stutter (PWS), aged 18–62 years, completed a series of speaking tasks. Several speech production variables were compared between conditions produced at different metronome beat rates, and between a control condition and a metronome-paced speech condition produced at a rate equal to the control condition. Outcomes & Results Vowel duration, voice onset time, pressure rise time, and phonated intervals were significantly impacted by metronome beat rate. Voice onset time and the percentage of short (30–100 ms) phonated intervals significantly decreased from the control condition to the equivalent rate metronome-paced speech condition. Conclusions & Implications A reduction in the percentage of short phonated intervals may be important for fluency during syllable-based metronome-paced speech for PWS. Future studies should continue examining the necessity of this reduction. In addition, speech rate must be controlled in future fluency-inducing condition studies, including neuroimaging investigations, in order for this research to make a substantial contribution to finding the fluency-inducing mechanism of fluency-inducing conditions. PMID:24372888
ERIC Educational Resources Information Center
Teten, Amy F.; DeVeney, Shari L.; Friehe, Mary J.
2016-01-01
Purpose: The purpose of this survey was to determine the self-perceived competence levels in voice disorders of practicing school-based speech-language pathologists (SLPs) and identify correlated variables. Method: Participants were 153 master's level, school-based SLPs with a Nebraska teaching certificate and/or licensure who completed a survey,…
School-Based Speech-Language Pathologists' Perspectives on Dysphagia Management in the Schools
ERIC Educational Resources Information Center
Bailey, Rita L.; Stoner, Julia B.; Angell, Maureen E.; Fetzer, Alycia
2008-01-01
Purpose: Although provision of dysphagia services is within the scope of practice of speech-language pathologists (SLPs), little is known about the perspectives of school-based SLPs in relation to these services. The purpose of this study was to examine SLPs' perspectives related to school-based management of students with dysphagia. Method: Focus…
Wang, Kun-Ching
2015-01-01
The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI). In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII). The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD) algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC) and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech. PMID:25594590
DOT National Transportation Integrated Search
2013-05-01
This project proposed to explore the potential of a user friendly, natural speech based : information inquiry application as one means of increasing public transit utilization. It suggested : that a key challenge to expanding transit ridership isto e...
A Mis-recognized Medical Vocabulary Correction System for Speech-based Electronic Medical Record
Seo, Hwa Jeong; Kim, Ju Han; Sakabe, Nagamasa
2002-01-01
Speech recognition as an input tool for electronic medical record (EMR) enables efficient data entry at the point of care. However, the recognition accuracy for medical vocabulary is much poorer than that for doctor-patient dialogue. We developed a mis-recognized medical vocabulary correction system based on syllable-by-syllable comparison of speech text against medical vocabulary database. Using specialty medical vocabulary, the algorithm detects and corrects mis-recognized medical vocabularies in narrative text. Our preliminary evaluation showed 94% of accuracy in mis-recognized medical vocabulary correction.
Zourmand, Alireza; Ting, Hua-Nong; Mirhassani, Seyed Mostafa
2013-03-01
Speech is one of the prevalent communication mediums for humans. Identifying the gender of a child speaker based on his/her speech is crucial in telecommunication and speech therapy. This article investigates the use of fundamental and formant frequencies from sustained vowel phonation to distinguish the gender of Malay children aged between 7 and 12 years. The Euclidean minimum distance and multilayer perceptron were used to classify the gender of 360 Malay children based on different combinations of fundamental and formant frequencies (F0, F1, F2, and F3). The Euclidean minimum distance with normalized frequency data achieved a classification accuracy of 79.44%, which was higher than that of the nonnormalized frequency data. Age-dependent modeling was used to improve the accuracy of gender classification. The Euclidean distance method obtained 84.17% based on the optimal classification accuracy for all age groups. The accuracy was further increased to 99.81% using multilayer perceptron based on mel-frequency cepstral coefficients. Copyright © 2013 The Voice Foundation. Published by Mosby, Inc. All rights reserved.
Envelope and intensity based prediction of psychoacoustic masking and speech intelligibility.
Biberger, Thomas; Ewert, Stephan D
2016-08-01
Human auditory perception and speech intelligibility have been successfully described based on the two concepts of spectral masking and amplitude modulation (AM) masking. The power-spectrum model (PSM) [Patterson and Moore (1986). Frequency Selectivity in Hearing, pp. 123-177] accounts for effects of spectral masking and critical bandwidth, while the envelope power-spectrum model (EPSM) [Ewert and Dau (2000). J. Acoust. Soc. Am. 108, 1181-1196] has been successfully applied to AM masking and discrimination. Both models extract the long-term (envelope) power to calculate signal-to-noise ratios (SNR). Recently, the EPSM has been applied to speech intelligibility (SI) considering the short-term envelope SNR on various time scales (multi-resolution speech-based envelope power-spectrum model; mr-sEPSM) to account for SI in fluctuating noise [Jørgensen, Ewert, and Dau (2013). J. Acoust. Soc. Am. 134, 436-446]. Here, a generalized auditory model is suggested combining the classical PSM and the mr-sEPSM to jointly account for psychoacoustics and speech intelligibility. The model was extended to consider the local AM depth in conditions with slowly varying signal levels, and the relative role of long-term and short-term SNR was assessed. The suggested generalized power-spectrum model is shown to account for a large variety of psychoacoustic data and to predict speech intelligibility in various types of background noise.
The Temporal Prediction of Stress in Speech and Its Relation to Musical Beat Perception.
Beier, Eleonora J; Ferreira, Fernanda
2018-01-01
While rhythmic expectancies are thought to be at the base of beat perception in music, the extent to which stress patterns in speech are similarly represented and predicted during on-line language comprehension is debated. The temporal prediction of stress may be advantageous to speech processing, as stress patterns aid segmentation and mark new information in utterances. However, while linguistic stress patterns may be organized into hierarchical metrical structures similarly to musical meter, they do not typically present the same degree of periodicity. We review the theoretical background for the idea that stress patterns are predicted and address the following questions: First, what is the evidence that listeners can predict the temporal location of stress based on preceding rhythm? If they can, is it thanks to neural entrainment mechanisms similar to those utilized for musical beat perception? And lastly, what linguistic factors other than rhythm may account for the prediction of stress in natural speech? We conclude that while expectancies based on the periodic presentation of stresses are at play in some of the current literature, other processes are likely to affect the prediction of stress in more naturalistic, less isochronous speech. Specifically, aspects of prosody other than amplitude changes (e.g., intonation) as well as lexical, syntactic and information structural constraints on the realization of stress may all contribute to the probabilistic expectation of stress in speech.
Biofeedback in dysphonia - progress and challenges.
Amorim, Geová Oliveira de; Balata, Patrícia Maria Mendes; Vieira, Laís Guimarães; Moura, Thaís; Silva, Hilton Justino da
There is evidence that all the complex machinery involved in speech acts along with the auditory system, and their adjustments can be altered. To present the evidence of biofeedback application for treatment of vocal disorders, emphasizing the muscle tension dysphonia. A systematic review was conducted in Scielo, Lilacs, PubMed and Web of Sciences databases, using the combination of descriptors, and admitting as inclusion criteria: articles published in journals with editorial committee, reporting cases or experimental or quasi-experimental research on the use of biofeedback in real time as additional source of treatment monitoring of muscle tension dysphonia or for vocal training. Thirty-three articles were identified in databases, and seven were included in the qualitative synthesis. The beginning of electromyographic biofeedback studies applied to speech therapy were promising and pointed to a new method that enabled good results in muscle tension dysphonia. Nonetheless, the discussion of the results lacked physiological evidence that could serve as their basis. The search for such explanations has become a challenge for speech therapists, and determined two research lines: one dedicated to the improvement of the electromyographic biofeedback methodology for voice disorders, to reduce confounding variables, and the other dedicated to the research of neural processes involved in changing the muscle engram of normal and dysphonic patients. There is evidence that the electromyographic biofeedback promotes changes in the neural networks responsible for speech, and can change behavior for vocal emissions with quality. Copyright © 2017 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.
Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach
NASA Astrophysics Data System (ADS)
Bugatti, Alessandro; Flammini, Alessandra; Migliorati, Pierangelo
2002-12-01
We focus the attention on the problem of audio classification in speech and music for multimedia applications. In particular, we present a comparison between two different techniques for speech/music discrimination. The first method is based on Zero crossing rate and Bayesian classification. It is very simple from a computational point of view, and gives good results in case of pure music or speech. The simulation results show that some performance degradation arises when the music segment contains also some speech superimposed on music, or strong rhythmic components. To overcome these problems, we propose a second method, that uses more features, and is based on neural networks (specifically a multi-layer Perceptron). In this case we obtain better performance, at the expense of a limited growth in the computational complexity. In practice, the proposed neural network is simple to be implemented if a suitable polynomial is used as the activation function, and a real-time implementation is possible even if low-cost embedded systems are used.
Phonologically-based biomarkers for major depressive disorder
NASA Astrophysics Data System (ADS)
Trevino, Andrea Carolina; Quatieri, Thomas Francis; Malyska, Nicolas
2011-12-01
Of increasing importance in the civilian and military population is the recognition of major depressive disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we introduce vocal biomarkers that are derived automatically from phonologically-based measures of speech rate. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a 6-week duration. We find that dissecting average measures of speech rate into phone-specific characteristics and, in particular, combined phone-duration measures uncovers stronger relationships between speech rate and depression severity than global measures previously reported for a speech-rate biomarker. Results of this study are supported by correlation of our measures with depression severity and classification of depression state with these vocal measures. Our approach provides a general framework for analyzing individual symptom categories through phonological units, and supports the premise that speaking rate can be an indicator of psychomotor retardation severity.
Vanryckeghem, Martine; Matthews, Michael; Xu, Peixin
2017-11-08
The aim of this study was to evaluate the usefulness of the Speech Situation Checklist for adults who stutter (SSC) in differentiating people who stutter (PWS) from speakers with no stutter based on self-reports of anxiety and speech disruption in communicative settings. The SSC's psychometric properties were examined, norms were established, and suggestions for treatment were formulated. The SSC was administered to 88 PWS seeking treatment and 209 speakers with no stutter between the ages of 18 and 62. The SSC consists of 2 sections investigating negative emotional reaction and speech disruption in 38 speech situations that are identical in both sections. The SSC-Emotional Reaction and SSC-Speech Disruption data show that these self-report tests differentiate PWS from speakers with no stutter to a statistically significant extent and have great discriminative value. The tests have good internal reliability, content, and construct validity. Age and gender do not affect the scores of the PWS. The SSC-Emotional Reaction and SSC-Speech Disruption seem to be powerful measures to investigate negative emotion and speech breakdown in an array of speech situations. The item scores give direction to treatment by suggesting speech situations that need a clinician's attention in terms of generalization and carry-over of within-clinic therapeutic gains into in vivo settings.
Alderson-Day, Ben; Fernyhough, Charles
2015-01-01
Inner speech is often reported to be a common and central part of inner experience, but its true prevalence is unclear. Many questionnaire-based measures appear to lack convergent validity and it has been claimed that they overestimate inner speech in comparison to experience sampling methods (which involve collecting data at random timepoints). The present study compared self-reporting of inner speech collected via a general questionnaire and experience sampling, using data from a custom-made smartphone app (Inner Life). Fifty-one university students completed a generalized self-report measure of inner speech (the Varieties of Inner Speech Questionnaire, VISQ) and responded to at least seven random alerts to report on incidences of inner speech over a 2-week period. Correlations and pairwise comparisons were used to compare generalized endorsements and randomly sampled scores for each VISQ subscale. Significant correlations were observed between general and randomly sampled measures for only two of the four VISQ subscales, and endorsements of inner speech with evaluative or motivational characteristics did not correlate at all across different measures. Endorsement of inner speech items was significantly lower for random sampling compared to generalized self-report, for all VISQ subscales. Exploratory analysis indicated that specific inner speech characteristics were also related to anxiety and future-oriented thinking. PMID:25964773
Sengupta, Ranit
2015-01-01
Despite recent progress in our understanding of sensorimotor integration in speech learning, a comprehensive framework to investigate its neural basis is lacking at behaviorally relevant timescales. Structural and functional imaging studies in humans have helped us identify brain networks that support speech but fail to capture the precise spatiotemporal coordination within the networks that takes place during speech learning. Here we use neuronal oscillations to investigate interactions within speech motor networks in a paradigm of speech motor adaptation under altered feedback with continuous recording of EEG in which subjects adapted to the real-time auditory perturbation of a target vowel sound. As subjects adapted to the task, concurrent changes were observed in the theta-gamma phase coherence during speech planning at several distinct scalp regions that is consistent with the establishment of a feedforward map. In particular, there was an increase in coherence over the central region and a decrease over the fronto-temporal regions, revealing a redistribution of coherence over an interacting network of brain regions that could be a general feature of error-based motor learning in general. Our findings have implications for understanding the neural basis of speech motor learning and could elucidate how transient breakdown of neuronal communication within speech networks relates to speech disorders. PMID:25632078
Speech effort measurement and stuttering: investigating the chorus reading effect.
Ingham, Roger J; Warner, Allison; Byrd, Anne; Cotton, John
2006-06-01
The purpose of this study was to investigate chorus reading's (CR's) effect on speech effort during oral reading by adult stuttering speakers and control participants. The effect of a speech effort measurement highlighting strategy was also investigated. Twelve persistent stuttering (PS) adults and 12 normally fluent control participants completed 1-min base rate readings (BR-nonchorus) and CRs within a BR/CR/BR/CR/BR experimental design. Participants self-rated speech effort using a 9-point scale after each reading trial. Stuttering frequency, speech rate, and speech naturalness measures were also obtained. Instructions highlighting speech effort ratings during BR and CR phases were introduced after the first CR. CR improved speech effort ratings for the PS group, but the control group showed a reverse trend. Both groups' effort ratings were not significantly different during CR phases but were significantly poorer than the control group's effort ratings during BR phases. The highlighting strategy did not significantly change effort ratings. The findings show that CR will produce not only stutter-free and natural sounding speech but also reliable reductions in speech effort. However, these reductions do not reach effort levels equivalent to those achieved by normally fluent speakers, thereby conditioning its use as a gold standard of achievable normal fluency by PS speakers.
Moharir, Madhavi; Barnett, Noel; Taras, Jillian; Cole, Martha; Ford-Jones, E Lee; Levin, Leo
2014-01-01
Failure to recognize and intervene early in speech and language delays can lead to multifaceted and potentially severe consequences for early child development and later literacy skills. While routine evaluations of speech and language during well-child visits are recommended, there is no standardized (office) approach to facilitate this. Furthermore, extensive wait times for speech and language pathology consultation represent valuable lost time for the child and family. Using speech and language expertise, and paediatric collaboration, key content for an office-based tool was developed. The tool aimed to help physicians achieve three main goals: early and accurate identification of speech and language delays as well as children at risk for literacy challenges; appropriate referral to speech and language services when required; and teaching and, thus, empowering parents to create rich and responsive language environments at home. Using this tool, in combination with the Canadian Paediatric Society’s Read, Speak, Sing and Grow Literacy Initiative, physicians will be better positioned to offer practical strategies to caregivers to enhance children’s speech and language capabilities. The tool represents a strategy to evaluate speech and language delays. It depicts age-specific linguistic/phonetic milestones and suggests interventions. The tool represents a practical interim treatment while the family is waiting for formal speech and language therapy consultation. PMID:24627648
The speech perception skills of children with and without speech sound disorder.
Hearnshaw, Stephanie; Baker, Elise; Munro, Natalie
To investigate whether Australian-English speaking children with and without speech sound disorder (SSD) differ in their overall speech perception accuracy. Additionally, to investigate differences in the perception of specific phonemes and the association between speech perception and speech production skills. Twenty-five Australian-English speaking children aged 48-60 months participated in this study. The SSD group included 12 children and the typically developing (TD) group included 13 children. Children completed routine speech and language assessments in addition to an experimental Australian-English lexical and phonetic judgement task based on Rvachew's Speech Assessment and Interactive Learning System (SAILS) program (Rvachew, 2009). This task included eight words across four word-initial phonemes-/k, ɹ, ʃ, s/. Children with SSD showed significantly poorer perceptual accuracy on the lexical and phonetic judgement task compared with TD peers. The phonemes /ɹ/ and /s/ were most frequently perceived in error across both groups. Additionally, the phoneme /ɹ/ was most commonly produced in error. There was also a positive correlation between overall speech perception and speech production scores. Children with SSD perceived speech less accurately than their typically developing peers. The findings suggest that an Australian-English variation of a lexical and phonetic judgement task similar to the SAILS program is promising and worthy of a larger scale study. Copyright © 2017 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Ferati, Mexhid Adem
2012-01-01
To access interactive systems, blind and visually impaired users can leverage their auditory senses by using non-speech sounds. The current structure of non-speech sounds, however, is geared toward conveying user interface operations (e.g., opening a file) rather than large theme-based information (e.g., a history passage) and, thus, is ill-suited…
ERIC Educational Resources Information Center
McCormack, Jane; McLeod, Sharynne; Harrison, Linda J.; McAllister, Lindy
2010-01-01
Purpose: To explore the application of the Activities and Participation component of the International Classification of Functioning, Disability and Health - Children and Youth (ICF-CY, World Health Organization, 2007) as a framework for investigating the perceived impact of speech impairment in childhood. Method: A 32-item questionnaire based on…
Soller, R. William; Chan, Philip; Higa, Amy
2012-01-01
Background Language barriers are significant hurdles for chronic disease patients in achieving self-management goals of therapy, particularly in settings where practitioners have limited nonprimary language skills, and in-person translators may not always be available. S-MINDS© (Speaking Multilingual Interactive Natural Dialog System), a concept-based speech translation approach developed by Fluential Inc., can be applied to bridge the technologic gaps that limit the complexity and length of utterances that can be recognized and translated by devices and has the potential to broaden access to translation services in the clinical settings. Methods The prototype translation system was evaluated prospectively for accuracy and patient satisfaction in underserved Spanish-speaking patients with diabetes and limited English proficiency and was compared with other commercial systems for robustness against degradation of translation due to ambient noise and speech patterns. Results Accuracy related to translating the English–Spanish–English communication string from practitioner to device to patient to device to practitioner was high (97–100%). Patient satisfaction was high (means of 4.7–4.9 over four domains on a 5-point Likert scale). The device outperformed three other commercial speech translation systems in terms of accuracy during fast speech utterances, under quiet and noisy fluent speech conditions, and when challenged with various speech disfluencies (i.e., fillers, false starts, stutters, repairs, and long pauses). Conclusions A concept-based English–Spanish speech translation system has been successfully developed in prototype form that can accept long utterances (up to 20 words) with limited to no degradation in accuracy. The functionality of the system is superior to leading commercial speech translation systems. PMID:22920821
Strom, Mark A; Silverberg, Jonathan I
2016-01-01
To determine if eczema is associated with an increased risk of a speech disorder. We analyzed data on 354,416 children and adolescents from 19 US population-based cohorts: the 2003-2004 and 2007-2008 National Survey of Children's Health and 1997-2013 National Health Interview Survey, each prospective, questionnaire-based cohorts. In multivariate survey logistic regression models adjusting for sociodemographics and comorbid allergic disease, eczema was significantly associated with higher odds of speech disorder in 12 of 19 cohorts (P < .05). The pooled prevalence of speech disorder in children with eczema was 4.7% (95% CI 4.5%-5.0%) compared with 2.2% (95% CI 2.2%-2.3%) in children without eczema. In pooled multivariate analysis, eczema was associated with increased odds of speech disorder (aOR [95% CI] 1.81 [1.57-2.05], P < .001). In a single study assessing eczema severity, mild (1.36 [1.02-1.81], P = .03) and severe eczema (3.56 [1.70-7.48], P < .001) were associated with higher odds of speech disorder. History of eczema was associated with moderate (2.35 [1.34-4.10], P = .003) and severe (2.28 [1.11-4.72], P = .03) speech disorder. Finally, significant interactions were found, such that children with both eczema and attention deficit disorder with or without hyperactivity or sleep disturbance had vastly increased risk of speech disorders than either by itself. Pediatric eczema may be associated with increased risk of speech disorder. Further, prospective studies are needed to characterize the exact nature of this association. Copyright © 2016 Elsevier Inc. All rights reserved.
Strom, Mark A.; Silverberg, Jonathan I.
2016-01-01
Objective To determine if eczema is associated with an increased risk of a speech disorder. Study design We analyzed data on 354 416 children and adolescents from 19 US population-based cohorts: the 2003–2004 and 2007–2008 National Survey of Children’s Health and 1997–2013 National Health Interview Survey, each prospective, questionnaire-based cohorts. Results In multivariate survey logistic regression models adjusting for sociodemographics and comorbid allergic disease, eczema was significantly associated with higher odds of speech disorder in 12 of 19 cohorts (P < .05). The pooled prevalence of speech disorder in children with eczema was 4.7% (95% CI 4.5%–5.0%) compared with 2.2% (95% CI 2.2%–2.3%) in children without eczema. In pooled multivariate analysis, eczema was associated with increased odds of speech disorder (aOR [95% CI] 1.81 [1.57–2.05], P < .001). In a single study assessing eczema severity, mild (1.36 [1.02–1.81], P = .03) and severe eczema (3.56 [1.70–7.48], P < .001) were associated with higher odds of speech disorder. History of eczema was associated with moderate (2.35 [1.34–4.10], P = .003) and severe (2.28 [1.11–4.72], P = .03) speech disorder. Finally, significant interactions were found, such that children with both eczema and attention deficit disorder with or without hyperactivity or sleep disturbance had vastly increased risk of speech disorders than either by itself. Conclusions Pediatric eczema may be associated with increased risk of speech disorder. Further, prospective studies are needed to characterize the exact nature of this association. PMID:26520915
Tsai, Ching-Shu; Chen, Vincent Chin-Hung; Yang, Yao-Hsu; Hung, Tai-Hsin; Lu, Mong-Liang; Huang, Kuo-You; Gossop, Michael
2017-01-01
Manifestations of Mycoplasma pneumoniae infection can range from self-limiting upper respiratory symptoms to various neurological complications, including speech and language impairment. But an association between Mycoplasma pneumoniae infection and speech and language impairment has not been sufficiently explored. In this study, we aim to investigate the association between Mycoplasma pneumoniae infection and subsequent speech and language impairment in a nationwide population-based sample using Taiwan's National Health Insurance Research Database. We identified 5,406 children with Mycoplasma pneumoniae infection (International Classification of Disease, Revision 9, Clinical Modification code 4830) and compared to 21,624 age-, sex-, urban- and income-matched controls on subsequent speech and language impairment. The mean follow-up interval for all subjects was 6.44 years (standard deviation = 2.42 years); the mean latency period between the initial Mycoplasma pneumoniae infection and presence of speech and language impairment was 1.96 years (standard deviation = 1.64 years). The results showed that Mycoplasma pneumoniae infection was significantly associated with greater incidence of speech and language impairment [hazard ratio (HR) = 1.49, 95% CI: 1.23-1.80]. In addition, significantly increased hazard ratio of subsequent speech and language impairment in the groups younger than 6 years old and no significant difference in the groups over the age of 6 years were found (HR = 1.43, 95% CI:1.09-1.88 for age 0-3 years group; HR = 1.67, 95% CI: 1.25-2.23 for age 4-5 years group; HR = 1.14, 95% CI: 0.54-2.39 for age 6-7 years group; and HR = 0.83, 95% CI:0.23-2.92 for age 8-18 years group). In conclusion, Mycoplasma pneumoniae infection is temporally associated with incident speech and language impairment.
Jeong, J-W; Sundaram, S; Behen, M E; Chugani, H T
2016-06-01
Pure speech delay is a common developmental disorder which, according to some estimates, affects 5%-8% of the population. Speech delay may not only be an isolated condition but also can be part of a broader condition such as global developmental delay. The present study investigated whether diffusion tensor imaging tractography-based connectome can differentiate global developmental delay from speech delay in young children. Twelve children with pure speech delay (39.1 ± 20.9 months of age, 9 boys), 14 children with global developmental delay (39.3 ± 18.2 months of age, 12 boys), and 10 children with typical development (38.5 ± 20.5 months of age, 7 boys) underwent 3T DTI. For each subject, whole-brain connectome analysis was performed by using 116 cortical ROIs. The following network metrics were measured at individual regions: strength (number of the shortest paths), efficiency (measures of global and local integration), cluster coefficient (a measure of local aggregation), and betweeness (a measure of centrality). Compared with typical development, global and local efficiency were significantly reduced in both global developmental delay and speech delay (P < .0001). The nodal strength of the cognitive network is reduced in global developmental delay, whereas the nodal strength of the language network is reduced in speech delay. This finding resulted in a high accuracy of >83% ± 4% to discriminate global developmental delay from speech delay. The network abnormalities identified in the present study may underlie the neurocognitive and behavioral consequences commonly identified in children with global developmental delay and speech delay. Further validation studies in larger samples are required. © 2016 by American Journal of Neuroradiology.
Visual activity predicts auditory recovery from deafness after adult cochlear implantation.
Strelnikov, Kuzma; Rouger, Julien; Demonet, Jean-François; Lagleyre, Sebastien; Fraysse, Bernard; Deguine, Olivier; Barone, Pascal
2013-12-01
Modern cochlear implantation technologies allow deaf patients to understand auditory speech; however, the implants deliver only a coarse auditory input and patients must use long-term adaptive processes to achieve coherent percepts. In adults with post-lingual deafness, the high progress of speech recovery is observed during the first year after cochlear implantation, but there is a large range of variability in the level of cochlear implant outcomes and the temporal evolution of recovery. It has been proposed that when profoundly deaf subjects receive a cochlear implant, the visual cross-modal reorganization of the brain is deleterious for auditory speech recovery. We tested this hypothesis in post-lingually deaf adults by analysing whether brain activity shortly after implantation correlated with the level of auditory recovery 6 months later. Based on brain activity induced by a speech-processing task, we found strong positive correlations in areas outside the auditory cortex. The highest positive correlations were found in the occipital cortex involved in visual processing, as well as in the posterior-temporal cortex known for audio-visual integration. The other area, which positively correlated with auditory speech recovery, was localized in the left inferior frontal area known for speech processing. Our results demonstrate that the visual modality's functional level is related to the proficiency level of auditory recovery. Based on the positive correlation of visual activity with auditory speech recovery, we suggest that visual modality may facilitate the perception of the word's auditory counterpart in communicative situations. The link demonstrated between visual activity and auditory speech perception indicates that visuoauditory synergy is crucial for cross-modal plasticity and fostering speech-comprehension recovery in adult cochlear-implanted deaf patients.
Use of intonation contours for speech recognition in noise by cochlear implant recipients.
Meister, Hartmut; Landwehr, Markus; Pyschny, Verena; Grugel, Linda; Walger, Martin
2011-05-01
The corruption of intonation contours has detrimental effects on sentence-based speech recognition in normal-hearing listeners Binns and Culling [(2007). J. Acoust. Soc. Am. 122, 1765-1776]. This paper examines whether this finding also applies to cochlear implant (CI) recipients. The subjects' F0-discrimination and speech perception in the presence of noise were measured, using sentences with regular and inverted F0-contours. The results revealed that speech recognition for regular contours was significantly better than for inverted contours. This difference was related to the subjects' F0-discrimination providing further evidence that the perception of intonation patterns is important for the CI-mediated speech recognition in noise.
Speech recognition: Acoustic phonetic and lexical knowledge representation
NASA Astrophysics Data System (ADS)
Zue, V. W.
1983-02-01
The purpose of this program is to develop a speech data base facility under which the acoustic characteristics of speech sounds in various contexts can be studied conveniently; investigate the phonological properties of a large lexicon of, say 10,000 words, and determine to what extent the phontactic constraints can be utilized in speech recognition; study the acoustic cues that are used to mark work boundaries; develop a test bed in the form of a large-vocabulary, IWR system to study the interactions of acoustic, phonetic and lexical knowledge; and develop a limited continuous speech recognition system with the goal of recognizing any English word from its spelling in order to assess the interactions of higher-level knowledge sources.
Reflective practice in speech-language pathology: a scoping review.
Caty, Marie-Ève; Kinsella, Elizabeth Anne; Doyle, Philip C
2015-01-01
Within the profession of speech-language pathology, there is limited information related to both conceptual and empirical perspectives of reflective practice. This review considers the key concepts and approaches to reflection and reflective practice that have been published in the speech-language pathology literature in order to identify potential research gaps. A scoping review was conducted using Arksey and O'Malley's (2005) framework. A total of 42 relevant publications were selected for review. The resulting literature mapping revealed that scholarship on reflection and reflective practice in speech-language pathology is limited. Our conceptual mapping pointed to the use of both multiple and generic terms and a lack of conceptual clarity about reflection and reflective practice in speech-language pathology. Two predominant approaches to reflection and reflective practice were identified: written reflection and reflective discussion. Both educational and clinical practice contexts were associated with reflection and reflective practice. Publications reviewed were primarily concerned with reflection and reflective practice by novices and expert practitioners. Based on this review, we posit that there is considerable need for conceptual and empirical work with a goal to support university- and work-based educational initiatives involving reflection and reflective practice in speech-language pathology.
Tone classification of syllable-segmented Thai speech based on multilayer perception
NASA Astrophysics Data System (ADS)
Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman
2002-05-01
Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.
Steel, Joanne; Ferguson, Alison; Spencer, Elizabeth; Togher, Leanne
2013-01-01
To investigate speech pathologists' current practice with adults who are in post-traumatic amnesia (PTA). Speech pathologists with experience of adults in PTA were invited to take part in an online survey through Australian professional email/internet-based interest groups. Forty-five speech pathologists responded to the online survey. The majority of respondents (78%) reported using informal, observational assessment methods commencing at initial contact with people in PTA or when patients' level of alertness allowed and initiating formal assessment on emergence from PTA. Seven respondents (19%) reported undertaking no assessment during PTA. Clinicians described using a range of techniques to monitor cognitive-communication during PTA, including static, dynamic, functional and impairment-based methods. The study confirmed that speech pathologists have a key role in the multidisciplinary team caring for the person in PTA, especially with family education and facilitating interactions with the rehabilitation team and family. Decision-making around timing and means of assessment of cognitive-communication during PTA appeared primarily reliant on speech pathologists' professional experience and the culture of their workplace. The findings support the need for further research into the nature of cognitive-communication disorder and resolution over this period.
Legal Issues and Computer Use by School-Based Audiologists and Speech-Language Pathologists.
ERIC Educational Resources Information Center
Wynne, Michael K.; Hurst, David S.
1995-01-01
This article reviews ethical and legal issues regarding school-based integration and application of technologies, particularly when used by speech-language pathologists and audiologists. Four issues are addressed: (1) software copyright and licensed use; (2) information access and the right to privacy; (3) computer-assisted or…
Enhancing Computer-Based Lessons for Effective Speech Education.
ERIC Educational Resources Information Center
Hemphill, Michael R.; Standerfer, Christina C.
1987-01-01
Assesses the advantages of computer-based instruction on speech education. Concludes that, while it offers tremendous flexibility to the instructor--especially in dynamic lesson design, feedback, graphics, and artificial intelligence--there is no inherent advantage to the use of computer technology in the classroom, unless the student interacts…
Evidence-Based Practice in Communication Disorders: Progress Not Perfection
ERIC Educational Resources Information Center
Kent, Ray D.
2006-01-01
Purpose: This commentary is written in response to a companion paper by Nan Bernstein Ratner ("Evidence-Based Practice: An Examination of its Ramifications for the Practice of Speech-Language Pathology"). Method: The comments reflect my experience as Vice President for Research and Technology of the American Speech-Language-Hearing Association…
Speech-Language Pathologists' and Teachers' Perceptions of Classroom-Based Interventions.
ERIC Educational Resources Information Center
Beck, Ann R.; Dennis, Marcia
1997-01-01
Speech-language pathologists (N=21) and teachers (N=54) were surveyed regarding their perceptions of classroom-based interventions. The two groups agreed about the primary advantages and disadvantages of most interventions, the primary areas of difference being classroom management and ease of data collection. Other findings indicated few…
Improved Speech Coding Based on Open-Loop Parameter Estimation
NASA Technical Reports Server (NTRS)
Juang, Jer-Nan; Chen, Ya-Chin; Longman, Richard W.
2000-01-01
A nonlinear optimization algorithm for linear predictive speech coding was developed early that not only optimizes the linear model coefficients for the open loop predictor, but does the optimization including the effects of quantization of the transmitted residual. It also simultaneously optimizes the quantization levels used for each speech segment. In this paper, we present an improved method for initialization of this nonlinear algorithm, and demonstrate substantial improvements in performance. In addition, the new procedure produces monotonically improving speech quality with increasing numbers of bits used in the transmitted error residual. Examples of speech encoding and decoding are given for 8 speech segments and signal to noise levels as high as 47 dB are produced. As in typical linear predictive coding, the optimization is done on the open loop speech analysis model. Here we demonstrate that minimizing the error of the closed loop speech reconstruction, instead of the simpler open loop optimization, is likely to produce negligible improvement in speech quality. The examples suggest that the algorithm here is close to giving the best performance obtainable from a linear model, for the chosen order with the chosen number of bits for the codebook.
Obstructive sleep apnea severity estimation: Fusion of speech-based systems.
Ben Or, D; Dafna, E; Tarasiuk, A; Zigel, Y
2016-08-01
Obstructive sleep apnea (OSA) is a common sleep-related breathing disorder. Previous studies associated OSA with anatomical abnormalities of the upper respiratory tract that may be reflected in the acoustic characteristics of speech. We tested the hypothesis that the speech signal carries essential information that can assist in early assessment of OSA severity by estimating apnea-hypopnea index (AHI). 198 men referred to routine polysomnography (PSG) were recorded shortly prior to sleep onset while reading a one-minute speech protocol. The different parts of the speech recordings, i.e., sustained vowels, short-time frames of fluent speech, and the speech recording as a whole, underwent separate analyses, using sustained vowels features, short-term features, and long-term features, respectively. Applying support vector regression and regression trees, these features were used in order to estimate AHI. The fusion of the outputs of the three subsystems resulted in a diagnostic agreement of 67.3% between the speech-estimated AHI and the PSG-determined AHI, and an absolute error rate of 10.8 events/hr. Speech signal analysis may assist in the estimation of AHI, thus allowing the development of a noninvasive tool for OSA screening.
Speech and language development in 2-year-old children with cerebral palsy.
Hustad, Katherine C; Allison, Kristen; McFadd, Emily; Riehle, Katherine
2014-06-01
We examined early speech and language development in children who had cerebral palsy. Questions addressed whether children could be classified into early profile groups on the basis of speech and language skills and whether there were differences on selected speech and language measures among groups. Speech and language assessments were completed on 27 children with CP who were between the ages of 24 and 30 months (mean age 27.1 months; SD 1.8). We examined several measures of expressive and receptive language, along with speech intelligibility. Two-step cluster analysis was used to identify homogeneous groups of children based on their performance on the seven dependent variables characterizing speech and language performance. Three groups of children identified were those not yet talking (44% of the sample); those whose talking abilities appeared to be emerging (41% of the sample); and those who were established talkers (15% of the sample). Group differences were evident on all variables except receptive language skills. 85% of 2-year-old children with CP in this study had clinical speech and/or language delays relative to age expectations. Findings suggest that children with CP should receive speech and language assessment and treatment at or before 2 years of age.
Investigation of habitual pitch during free play activities for preschool-aged children.
Chen, Yang; Kimelman, Mikael D Z; Micco, Katie
2009-01-01
This study is designed to compare the habitual pitch measured in two different speech activities (free play activity and traditionally used structured speech activity) for normally developing preschool-aged children to explore to what extent preschoolers vary their vocal pitch among different speech environments. Habitual pitch measurements were conducted for 10 normally developing children (2 boys, 8 girls) between the ages of 31 months and 71 months during two different activities: (1) free play; and (2) structured speech. Speech samples were recorded using a throat microphone connected with a wireless transmitter in both activities. The habitual pitch (in Hz) was measured for all collected speech samples by using voice analysis software (Real-Time Pitch). Significantly higher habitual pitch is found during free play in contrast to structured speech activities. In addition, there is no showing of significant difference of habitual pitch elicited across a variety of structured speech activities. Findings suggest that the vocal usage of preschoolers appears to be more effortful during free play than during structured activities. It is recommended that a comprehensive evaluation for young children's voice needs to be based on the speech/voice samples collected from both free play and structured activities.
A Deep Ensemble Learning Method for Monaural Speech Separation.
Zhang, Xiao-Lei; Wang, DeLiang
2016-03-01
Monaural speech separation is a fundamental problem in robust speech processing. Recently, deep neural network (DNN)-based speech separation methods, which predict either clean speech or an ideal time-frequency mask, have demonstrated remarkable performance improvement. However, a single DNN with a given window length does not leverage contextual information sufficiently, and the differences between the two optimization objectives are not well understood. In this paper, we propose a deep ensemble method, named multicontext networks, to address monaural speech separation. The first multicontext network averages the outputs of multiple DNNs whose inputs employ different window lengths. The second multicontext network is a stack of multiple DNNs. Each DNN in a module of the stack takes the concatenation of original acoustic features and expansion of the soft output of the lower module as its input, and predicts the ratio mask of the target speaker; the DNNs in the same module employ different contexts. We have conducted extensive experiments with three speech corpora. The results demonstrate the effectiveness of the proposed method. We have also compared the two optimization objectives systematically and found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.
Mefferd, Antje S.
2016-01-01
The degree of speech movement pattern consistency can provide information about speech motor control. Although tongue motor control is particularly important because of the tongue's primary contribution to the speech acoustic signal, capturing tongue movements during speech remains difficult and costly. This study sought to determine if formant movements could be used to estimate tongue movement pattern consistency indirectly. Two age groups (seven young adults and seven older adults) and six speech conditions (typical, slow, loud, clear, fast, bite block speech) were selected to elicit an age- and task-dependent performance range in tongue movement pattern consistency. Kinematic and acoustic spatiotemporal indexes (STI) were calculated based on sentence-length tongue movement and formant movement signals, respectively. Kinematic and acoustic STI values showed strong associations across talkers and moderate to strong associations for each talker across speech tasks; although, in cases where task-related tongue motor performance changes were relatively small, the acoustic STI values were poorly associated with kinematic STI values. These findings suggest that, depending on the sensitivity needs, formant movement pattern consistency could be used in lieu of direct kinematic analysis to indirectly examine speech motor control. PMID:27908069
Speech deficits in serious mental illness: a cognitive resource issue?
Cohen, Alex S; McGovern, Jessica E; Dinzeo, Thomas J; Covington, Michael A
2014-12-01
Speech deficits, notably those involved in psychomotor retardation, blunted affect, alogia and poverty of content of speech, are pronounced in a wide range of serious mental illnesses (e.g., schizophrenia, unipolar depression, bipolar disorders). The present project evaluated the degree to which these deficits manifest as a function of cognitive resource limitations. We examined natural speech from 52 patients meeting criteria for serious mental illnesses (i.e., severe functional deficits with a concomitant diagnosis of schizophrenia, unipolar and/or bipolar affective disorders) and 30 non-psychiatric controls using a range of objective, computer-based measures tapping speech production ("alogia"), variability ("blunted vocal affect") and content ("poverty of content of speech"). Subjects produced natural speech during a baseline condition and while engaging in an experimentally-manipulated cognitively-effortful task. For correlational analysis, cognitive ability was measured using a standardized battery. Generally speaking, speech deficits did not differ as a function of SMI diagnosis. However, every speech production and content measure was significantly abnormal in SMI versus control groups. Speech variability measures generally did not differ between groups. For both patients and controls as a group, speech during the cognitively-effortful task was sparser and less rich in content. Relative to controls, patients were abnormal under cognitive load with respect only to average pause length. Correlations between the speech variables and cognitive ability were only significant for this same variable: average pause length. Results suggest that certain speech deficits, notably involving pause length, may manifest as a function of cognitive resource limitations. Implications for treatment, research and assessment are discussed. Copyright © 2014 Elsevier B.V. All rights reserved.
Basilakos, Alexandra; Rorden, Chris; Bonilha, Leonardo; Moser, Dana; Fridriksson, Julius
2015-06-01
Acquired apraxia of speech (AOS) is a motor speech disorder caused by brain damage. AOS often co-occurs with aphasia, a language disorder in which patients may also demonstrate speech production errors. The overlap of speech production deficits in both disorders has raised questions on whether AOS emerges from a unique pattern of brain damage or as a subelement of the aphasic syndrome. The purpose of this study was to determine whether speech production errors in AOS and aphasia are associated with distinctive patterns of brain injury. Forty-three patients with history of a single left-hemisphere stroke underwent comprehensive speech and language testing. The AOS Rating Scale was used to rate speech errors specific to AOS versus speech errors that can also be associated with both AOS and aphasia. Localized brain damage was identified using structural magnetic resonance imaging, and voxel-based lesion-impairment mapping was used to evaluate the relationship between speech errors specific to AOS, those that can occur in AOS or aphasia, and brain damage. The pattern of brain damage associated with AOS was most strongly associated with damage to cortical motor regions, with additional involvement of somatosensory areas. Speech production deficits that could be attributed to AOS or aphasia were associated with damage to the temporal lobe and the inferior precentral frontal regions. AOS likely occurs in conjunction with aphasia because of the proximity of the brain areas supporting speech and language, but the neurobiological substrate for each disorder differs. © 2015 American Heart Association, Inc.
Zeremdini, Jihen; Ben Messaoud, Mohamed Anouar; Bouzid, Aicha
2015-09-01
Humans have the ability to easily separate a composed speech and to form perceptual representations of the constituent sources in an acoustic mixture thanks to their ears. Until recently, researchers attempt to build computer models of high-level functions of the auditory system. The problem of the composed speech segregation is still a very challenging problem for these researchers. In our case, we are interested in approaches that are addressed to the monaural speech segregation. For this purpose, we study in this paper the computational auditory scene analysis (CASA) to segregate speech from monaural mixtures. CASA is the reproduction of the source organization achieved by listeners. It is based on two main stages: segmentation and grouping. In this work, we have presented, and compared several studies that have used CASA for speech separation and recognition.
Strategies for distant speech recognitionin reverberant environments
NASA Astrophysics Data System (ADS)
Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro
2015-12-01
Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.
Murdoch, B E; Pitt, G; Theodoros, D G; Ward, E C
1999-01-01
The efficacy of traditional and physiological biofeedback methods for modifying abnormal speech breathing patterns was investigated in a child with persistent dysarthria following severe traumatic brain injury (TBI). An A-B-A-B single-subject experimental research design was utilized to provide the subject with two exclusive periods of therapy for speech breathing, based on traditional therapy techniques and physiological biofeedback methods, respectively. Traditional therapy techniques included establishing optimal posture for speech breathing, explanation of the movement of the respiratory muscles, and a hierarchy of non-speech and speech tasks focusing on establishing an appropriate level of sub-glottal air pressure, and improving the subject's control of inhalation and exhalation. The biofeedback phase of therapy utilized variable inductance plethysmography (or Respitrace) to provide real-time, continuous visual biofeedback of ribcage circumference during breathing. As in traditional therapy, a hierarchy of non-speech and speech tasks were devised to improve the subject's control of his respiratory pattern. Throughout the project, the subject's respiratory support for speech was assessed both instrumentally and perceptually. Instrumental assessment included kinematic and spirometric measures, and perceptual assessment included the Frenchay Dysarthria Assessment, Assessment of Intelligibility of Dysarthric Speech, and analysis of a speech sample. The results of the study demonstrated that real-time continuous visual biofeedback techniques for modifying speech breathing patterns were not only effective, but superior to the traditional therapy techniques for modifying abnormal speech breathing patterns in a child with persistent dysarthria following severe TBI. These results show that physiological biofeedback techniques are potentially useful clinical tools for the remediation of speech breathing impairment in the paediatric dysarthric population.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.
The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation maymore » decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.« less
Coppens-Hofman, Marjolein C.; Terband, Hayo; Snik, Ad F.M.; Maassen, Ben A.M.
2017-01-01
Purpose Adults with intellectual disabilities (ID) often show reduced speech intelligibility, which affects their social interaction skills. This study aims to establish the main predictors of this reduced intelligibility in order to ultimately optimise management. Method Spontaneous speech and picture naming tasks were recorded in 36 adults with mild or moderate ID. Twenty-five naïve listeners rated the intelligibility of the spontaneous speech samples. Performance on the picture-naming task was analysed by means of a phonological error analysis based on expert transcriptions. Results The transcription analyses showed that the phonemic and syllabic inventories of the speakers were complete. However, multiple errors at the phonemic and syllabic level were found. The frequencies of specific types of errors were related to intelligibility and quality ratings. Conclusions The development of the phonemic and syllabic repertoire appears to be completed in adults with mild-to-moderate ID. The charted speech difficulties can be interpreted to indicate speech motor control and planning difficulties. These findings may aid the development of diagnostic tests and speech therapies aimed at improving speech intelligibility in this specific group. PMID:28118637
Developmental changes in sensitivity to vocal paralanguage
Friend, Margaret
2017-01-01
Developmental changes in children’s sensitivity to the role of acoustic variation in the speech stream in conveying speaker affect (vocal paralanguage) were examined. Four-, 7- and 10-year-olds heard utterances in three formats: low-pass filtered, reiterant, and normal speech. The availability of lexical and paralinguistic information varied across these three formats in a way that required children to base their judgments of speaker affect on different configurations of cues in each format. Across ages, the best performance was obtained when a rich array of acoustic cues was present and when there was no competing lexical information. Four-year-olds performed at chance when judgments had to be based solely on speech prosody in the filtered format and they were unable to selectively attend to paralanguage when discrepant lexical cues were present in normal speech. Seven-year-olds were significantly more sensitive to the paralinguistic role of speech prosody in filtered speech than were 4-year-olds and there was a trend toward greater attention to paralanguage when lexical and paralinguistic cues were inconsistent in normal speech. An integration of the ability to utilize prosodic cues to speaker affect with attention to paralanguage in cases of lexical/paralinguistic discrepancy was observed for 10-year-olds. The results are discussed in terms of the development of a perceptual bias emerging out of selective attention to language. PMID:28713218
ERIC Educational Resources Information Center
Tramontana, G. Michael; Blood, Ingrid M.; Blood, Gordon W.
2013-01-01
The purpose of this study was to determine (a) the general knowledge bases demonstrated by school-based speech-language pathologists (SLPs) in the area of genetics, (b) the confidence levels of SLPs in providing services to children and their families with genetic disorders/syndromes, (c) the attitudes of SLPs regarding genetics and communication…
ERIC Educational Resources Information Center
Wallace, Sarah E.
2015-01-01
Team-based learning (TBL), although found to increase student engagement and higher-level thinking, has not been examined in the field of speech-language pathology. The purpose of this study was to examine the effect of integrating TBL into a capstone course in evidence-based practice (EBP). The researcher evaluated 27 students' understanding of…
Auditory Training Effects on the Listening Skills of Children With Auditory Processing Disorder.
Loo, Jenny Hooi Yin; Rosen, Stuart; Bamiou, Doris-Eva
2016-01-01
Children with auditory processing disorder (APD) typically present with "listening difficulties,"' including problems understanding speech in noisy environments. The authors examined, in a group of such children, whether a 12-week computer-based auditory training program with speech material improved the perception of speech-in-noise test performance, and functional listening skills as assessed by parental and teacher listening and communication questionnaires. The authors hypothesized that after the intervention, (1) trained children would show greater improvements in speech-in-noise perception than untrained controls; (2) this improvement would correlate with improvements in observer-rated behaviors; and (3) the improvement would be maintained for at least 3 months after the end of training. This was a prospective randomized controlled trial of 39 children with normal nonverbal intelligence, ages 7 to 11 years, all diagnosed with APD. This diagnosis required a normal pure-tone audiogram and deficits in at least two clinical auditory processing tests. The APD children were randomly assigned to (1) a control group that received only the current standard treatment for children diagnosed with APD, employing various listening/educational strategies at school (N = 19); or (2) an intervention group that undertook a 3-month 5-day/week computer-based auditory training program at home, consisting of a wide variety of speech-based listening tasks with competing sounds, in addition to the current standard treatment. All 39 children were assessed for language and cognitive skills at baseline and on three outcome measures at baseline and immediate postintervention. Outcome measures were repeated 3 months postintervention in the intervention group only, to assess the sustainability of treatment effects. The outcome measures were (1) the mean speech reception threshold obtained from the four subtests of the listening in specialized noise test that assesses sentence perception in various configurations of masking speech, and in which the target speakers and test materials were unrelated to the training materials; (2) the Children's Auditory Performance Scale that assesses listening skills, completed by the children's teachers; and (3) the Clinical Evaluation of Language Fundamental-4 pragmatic profile that assesses pragmatic language use, completed by parents. All outcome measures significantly improved at immediate postintervention in the intervention group only, with effect sizes ranging from 0.76 to 1.7. Improvements in speech-in-noise performance correlated with improved scores in the Children's Auditory Performance Scale questionnaire in the trained group only. Baseline language and cognitive assessments did not predict better training outcome. Improvements in speech-in-noise performance were sustained 3 months postintervention. Broad speech-based auditory training led to improved auditory processing skills as reflected in speech-in-noise test performance and in better functional listening in real life. The observed correlation between improved functional listening with improved speech-in-noise perception in the trained group suggests that improved listening was a direct generalization of the auditory training.
Five-year speech and language outcomes in children with cleft lip-palate.
Prathanee, Benjamas; Pumnum, Tawitree; Seepuaham, Cholada; Jaiyong, Pechcharat
2016-10-01
To investigate 5-year speech and language outcomes in children with cleft lip/palate (CLP). Thirty-eight children aged 4-7 years and 8 months were recruited for this study. Speech abilities including articulation, resonance, voice, and intelligibility were assessed based on Thai Universal Parameters of Speech Outcomes. Language ability was assessed by the Language Screening Test. The findings revealed that children with clefts had speech and language delay, abnormal understandability, resonance abnormality, and voice disturbance; articulation defects that were 8.33 (1.75, 22.47), 50.00 (32.92, 67.08), 36.11 (20.82, 53.78), 30.56 (16.35, 48.11), and 94.44 (81.34, 99.32). Articulation errors were the most common speech and language defects in children with clefts, followed by abnormal understandability, resonance abnormality, and voice disturbance. These results should be of critical concern. Protocol reviewing and early intervention programs are needed for improved speech outcomes. Copyright © 2016 European Association for Cranio-Maxillo-Facial Surgery. Published by Elsevier Ltd. All rights reserved.
Application of advanced speech technology in manned penetration bombers
NASA Astrophysics Data System (ADS)
North, R.; Lea, W.
1982-03-01
This report documents research on the potential use of speech technology in a manned penetration bomber aircraft (B-52/G and H). The objectives of the project were to analyze the pilot/copilot crewstation tasks over a three-hour-and forty-minute mission and determine the tasks that would benefit the most from conversion to speech recognition/generation, determine the technological feasibility of each of the identified tasks, and prioritize these tasks based on these criteria. Secondary objectives of the program were to enunciate research strategies in the application of speech technologies in airborne environments, and develop guidelines for briefing user commands on the potential of using speech technologies in the cockpit. The results of this study indicated that for the B-52 crewmember, speech recognition would be most beneficial for retrieving chart and procedural data that is contained in the flight manuals. Technological feasibility of these tasks indicated that the checklist and procedural retrieval tasks would be highly feasible for a speech recognition system.
Motivation as a predictor of speech intelligibility after total laryngectomy.
Singer, Susanne; Meyer, Alexandra; Fuchs, Michael; Schock, Juliane; Pabst, Friedemann; Vogel, Hans-Joachim; Oeken, Jens; Sandner, Annett; Koscielny, Sven; Hormes, Karl; Breitenstein, Kerstin; Dietz, Andreas
2013-06-01
It has often been argued that if patients' success with speech rehabilitation after laryngectomy is limited, it is the result of lacking motivation on their part. This project investigated the role of motivation in speech rehabilitation. In a multicenter prospective cohort study, 141 laryngectomees were interviewed at the beginning of rehabilitation and 1 year after laryngectomy. Speech intelligibility was measured with a standardized test, and patients self-assessed their own motivation shortly after the surgery. Logistic regression, adjusted for several theory-based confounding factors, was used to assess the impact of motivation on speech intelligibility. Speech intelligibility 1 year after laryngectomy was not significantly associated with the level of motivation at the beginning of rehabilitation (odds ratio [OR], 1.3; 95% confidence interval [CI], 0.7-2.3; p = .43) after adjusting for the effect of potential confounders (implantation of a voice prosthesis, patient's cognitive abilities, frustration tolerance, physical functioning, and type of rehabilitation). Motivation is not a strong predictor of speech intelligibility 1 year after laryngectomy. Copyright © 2012 Wiley Periodicals, Inc.
Intra-oral pressure-based voicing control of electrolaryngeal speech with intra-oral vibrator.
Takahashi, Hirokazu; Nakao, Masayuki; Kikuchi, Yataro; Kaga, Kimitaka
2008-07-01
In normal speech, coordinated activities of intrinsic laryngeal muscles suspend a glottal sound at utterance of voiceless consonants, automatically realizing a voicing control. In electrolaryngeal speech, however, the lack of voicing control is one of the causes of unclear voice, voiceless consonants tending to be misheard as the corresponding voiced consonants. In the present work, we developed an intra-oral vibrator with an intra-oral pressure sensor that detected utterance of voiceless phonemes during the intra-oral electrolaryngeal speech, and demonstrated that an intra-oral pressure-based voicing control could improve the intelligibility of the speech. The test voices were obtained from one electrolaryngeal speaker and one normal speaker. We first investigated on the speech analysis software how a voice onset time (VOT) and first formant (F1) transition of the test consonant-vowel syllables contributed to voiceless/voiced contrasts, and developed an adequate voicing control strategy. We then compared the intelligibility of consonant-vowel syllables among the intra-oral electrolaryngeal speech with and without online voicing control. The increase of intra-oral pressure, typically with a peak ranging from 10 to 50 gf/cm2, could reliably identify utterance of voiceless consonants. The speech analysis and intelligibility test then demonstrated that a short VOT caused the misidentification of the voiced consonants due to a clear F1 transition. Finally, taking these results together, the online voicing control, which suspended the prosthetic tone while the intra-oral pressure exceeded 2.5 gf/cm2 and during the 35 milliseconds that followed, proved efficient to improve the voiceless/voiced contrast.
ERIC Educational Resources Information Center
Van Laere, E.; Braak, J.
2017-01-01
Text-to-speech technology can act as an important support tool in computer-based learning environments (CBLEs) as it provides auditory input, next to on-screen text. Particularly for students who use a language at home other than the language of instruction (LOI) applied at school, text-to-speech can be useful. The CBLE E-Validiv offers content in…
ERIC Educational Resources Information Center
Newton, Caroline
2012-01-01
There are some children with speech and/or language difficulties who are significantly more difficult to understand in connected speech than in single words. The study reported here explores the between-word behaviours of three such children, aged 11;8, 12;2 and 12;10. It focuses on whether these patterns could be accounted for by lenition, as…
ERIC Educational Resources Information Center
Wren, Yvonne; Humphries, Kerry; Stock, Nicola Marie; Rumsey, Nichola; Lewis, Sarah; Davies, Amy; Bennett, Rhiannon; Sandy, Jonathan
2018-01-01
Background: Efforts to increase the evidence base in speech and language therapy are often limited by methodological factors that have restricted the strength of the evidence to the lower levels of the evidence hierarchy. Where higher graded studies, such as randomized controlled trials, have been carried out, it has sometimes been difficult to…
The Influence of Socio-Economic Status and Ethnicity on Speech and Language Development
ERIC Educational Resources Information Center
Basit, Tehmina N.; Hughes, Amanda; Iqbal, Zafar; Cooper, Janet
2015-01-01
A number of factors influence the speech and language development of young children. Delays in the development of speech and language can have repercussions for school attainment and life chances. This paper is based on a survey of 3- to 4-year-old children in the city of Stoke-on-Trent in the UK. It analyses the data collected from 255 children…
ERIC Educational Resources Information Center
McCartney, Elspeth; Muir, Margaret
2017-01-01
School-leaving for pupils with long-term speech, language, swallowing or communication difficulties requires careful management. Speech and language therapists (SLTs) support communication, secure assistive technology and manage swallowing difficulties post-school. UK SLTs are employed by health services, with child SLT teams based in schools.…
2008-08-25
Going, Going, Gone!,” Daily Times, (Lahore), August 19, 2008. Speech text at [http://www.cfr.org/publication/16999...In a landmark January 2002 speech , President Musharraf vowed to end Pakistan’s use as a base for terrorism of any kind, and he banned numerous...organizations under U.S. law. In the wake of the speech , thousands of Muslim extremists were detained, though most of these were later released. In
ERIC Educational Resources Information Center
Schwartz, Lorna T.
2010-01-01
Speech and language pathologists (SLPs) are collaborators in a diagnostic process that reflects an increasing number of referrals of children with autism spectrums disorders (ASD). Also, current practices leading to the remediation of speech and language disorders have come under scrutiny for limitations in effective carryover of targeted goals…
Waring, R; Knight, R
2013-01-01
Children with speech sound disorders (SSD) form a heterogeneous group who differ in terms of the severity of their condition, underlying cause, speech errors, involvement of other aspects of the linguistic system and treatment response. To date there is no universal and agreed-upon classification system. Instead, a number of theoretically differing classification systems have been proposed based on either an aetiological (medical) approach, a descriptive-linguistic approach or a processing approach. To describe and review the supporting evidence, and to provide a critical evaluation of the current childhood SSD classification systems. Descriptions of the major specific approaches to classification are reviewed and research papers supporting the reliability and validity of the systems are evaluated. Three specific paediatric SSD classification systems; the aetiologic-based Speech Disorders Classification System, the descriptive-linguistic Differential Diagnosis system, and the processing-based Psycholinguistic Framework are identified as potentially useful in classifying children with SSD into homogeneous subgroups. The Differential Diagnosis system has a growing body of empirical support from clinical population studies, across language error pattern studies and treatment efficacy studies. The Speech Disorders Classification System is currently a research tool with eight proposed subgroups. The Psycholinguistic Framework is a potential bridge to linking cause and surface level speech errors. There is a need for a universally agreed-upon classification system that is useful to clinicians and researchers. The resulting classification system needs to be robust, reliable and valid. A universal classification system would allow for improved tailoring of treatments to subgroups of SSD which may, in turn, lead to improved treatment efficacy. © 2012 Royal College of Speech and Language Therapists.
Yunusova, Yana; Graham, Naida L.; Shellikeri, Sanjana; Phuong, Kent; Kulkarni, Madhura; Rochon, Elizabeth; Tang-Wai, David F.; Chow, Tiffany W.; Black, Sandra E.; Zinman, Lorne H.; Green, Jordan R.
2016-01-01
Objective This study examines reading aloud in patients with amyotrophic lateral sclerosis (ALS) and those with frontotemporal dementia (FTD) in order to determine whether differences in patterns of speaking and pausing exist between patients with primary motor vs. primary cognitive-linguistic deficits, and in contrast to healthy controls. Design 136 participants were included in the study: 33 controls, 85 patients with ALS, and 18 patients with either the behavioural variant of FTD (FTD-BV) or progressive nonfluent aphasia (FTD-PNFA). Participants with ALS were further divided into 4 non-overlapping subgroups—mild, respiratory, bulbar (with oral-motor deficit) and bulbar-respiratory—based on the presence and severity of motor bulbar or respiratory signs. All participants read a passage aloud. Custom-made software was used to perform speech and pause analyses, and this provided measures of speaking and articulatory rates, duration of speech, and number and duration of pauses. These measures were statistically compared in different subgroups of patients. Results The results revealed clear differences between patient groups and healthy controls on the passage reading task. A speech-based motor function measure (i.e., articulatory rate) was able to distinguish patients with bulbar ALS or FTD-PNFA from those with respiratory ALS or FTD-BV. Distinguishing the disordered groups proved challenging based on the pausing measures. Conclusions and Relevance This study demonstrated the use of speech measures in the identification of those with an oral-motor deficit, and showed the usefulness of performing a relatively simple reading test to assess speech versus pause behaviors across the ALS—FTD disease continuum. The findings also suggest that motor speech assessment should be performed as part of the diagnostic workup for patients with FTD. PMID:26789001
ERIC Educational Resources Information Center
Corcoran, Molly A.
2017-01-01
Educator evaluation is of significant interest and concern for all members of the national school community. School-based speech-language pathologists (SLPs), share these sentiments with their classroom counterparts. Frequently included in such evaluation systems, it is of concern to the SLP community that research documenting how school-based…
ERIC Educational Resources Information Center
Plumb, Allison M.; Plexico, Laura W.
2013-01-01
Purpose: To investigate the graduate training experiences of school-based speech-language pathologists (SLPs) working with children with autism spectrum disorders (ASDs). Comparisons were made between recent graduates (post 2006) and pre-2006 graduates to determine if differences existed in their academic and clinical experiences or their…
Evidence-Based Speech-Language Pathology Practices in Schools: Findings from a National Survey
ERIC Educational Resources Information Center
Hoffman, LaVae M.; Ireland, Marie; Hall-Mills, Shannon; Flynn, Perry
2013-01-01
Purpose: This study documented evidence-based practice (EBP) patterns as reported by speech-language pathologists (SLPs) employed in public schools during 2010-2011. Method: Using an online survey, practioners reported their EBP training experiences, resources available in their workplaces, and the frequency with which they engage in specific EBP…
Dysphagia Management: A Survey of School-Based Speech-Language Pathologists in Vermont
ERIC Educational Resources Information Center
Hutchins, Tiffany L.; Gerety, Katherine W.; Mulligan, Moira
2011-01-01
Purpose: This study (a) gathered information about the kinds of dysphagia management services school-based speech-language pathologists (SLPs) provide, (b) examined the attitudes of SLPs related to dysphagia management, (c) compared the responses of SLPs on the basis of their experience working in a medical setting, and (d) investigated the…
Use of Language Sample Analysis by School-Based SLPs: Results of a Nationwide Survey
ERIC Educational Resources Information Center
Pavelko, Stacey L.; Owens, Robert E., Jr.; Ireland, Marie; Hahs-Vaughn, Debbie L.
2016-01-01
Purpose: This article examines use of language sample analysis (LSA) by school-based speech-language pathologists (SLPs), including characteristics of language samples, methods of transcription and analysis, barriers to LSA use, and factors affecting LSA use, such as American Speech-Language-Hearing Association certification, number of years'…
ERIC Educational Resources Information Center
Lousada, M.; Jesus, Luis M. T.; Hall, A.; Joffe, V.
2014-01-01
Background: The effectiveness of two treatment approaches (phonological therapy and articulation therapy) for treatment of 14 children, aged 4;0-6;7 years, with phonologically based speech-sound disorder (SSD) has been previously analysed with severity outcome measures (percentage of consonants correct score, percentage occurrence of phonological…
Alternative Organization of Speech Perception Deficits in Children
ERIC Educational Resources Information Center
Gosy, Maria
2007-01-01
Children's first-language perception base takes shape gradually from birth onwards. Empirical research has confirmed that children may continue to fall short of age-based expectations in their speech perception. The purpose of this study was to assess the contribution of various perception processes in both reading and learning disabled children.…
ERIC Educational Resources Information Center
Donaldson, Amy L.; Chabon, Shelly; Lee-Wilkerson, Dorian; Kapantzoglou, Maria
2017-01-01
Traditionally speech-language pathology, along with other educational and rehabilitation-based professions, has approached disability from a deficits-based or medical-model perspective with an aim toward normalizing or ameliorating a child's atypical behaviors or performance. However, an alternative perspective rooted in a social model of…
Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study
ERIC Educational Resources Information Center
van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer
2016-01-01
The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…
Vocal Abuse Prevention Practices: A National Survey of School-Based Speech-Language Pathologists.
ERIC Educational Resources Information Center
McNamara, Anne P.; Perry, Cecyle K.
1994-01-01
A national survey of 145 school-based speech-language pathologists found that more than 80% did not have vocal abuse prevention programs, primarily because of time constraints and low priority assigned to voice problems. Existing programs were primarily for elementary asymptomatic and symptomatic children. Characteristics associated with…
When Do Ladies Talk Like Ladies?
ERIC Educational Resources Information Center
Menzel, Peter; Tyler, Mary
As Labov points out (1971), language is a social phenomenon, and therefore must be studied in its social context; sex based language differences, being part of language, must be studied in the same way. Specifically, sex based language differences can be studied by modifying the sociolinguists' notion of speech community and speech continuum, and…
NASA Astrophysics Data System (ADS)
Li, Ji; Ren, Fuji
Weblogs have greatly changed the communication ways of mankind. Affective analysis of blog posts is found valuable for many applications such as text-to-speech synthesis or computer-assisted recommendation. Traditional emotion recognition in text based on single-label classification can not satisfy higher requirements of affective computing. In this paper, the automatic identification of sentence emotion in weblogs is modeled as a multi-label text categorization task. Experiments are carried out on 12273 blog sentences from the Chinese emotion corpus Ren_CECps with 8-dimension emotion annotation. An ensemble algorithm RAKEL is used to recognize dominant emotions from the writer's perspective. Our emotion feature using detailed intensity representation for word emotions outperforms the other main features such as the word frequency feature and the traditional lexicon-based feature. In order to deal with relatively complex sentences, we integrate grammatical characteristics of punctuations, disjunctive connectives, modification relations and negation into features. It achieves 13.51% and 12.49% increases for Micro-averaged F1 and Macro-averaged F1 respectively compared to the traditional lexicon-based feature. Result shows that multiple-dimension emotion representation with grammatical features can efficiently classify sentence emotion in a multi-label problem.
Communication Disorders in Speakers of Tone Languages: Etiological Bases and Clinical Considerations
Wong, Patrick C. M.; Perrachione, Tyler K.; Gunasekera, Geshri; Chandrasekaran, Bharath
2009-01-01
Lexical tones are a phonetic contrast necessary for conveying meaning in a majority of the world’s languages. Various hearing, speech, and language disorders affect the ability to perceive or produce lexical tones, thereby seriously impairing individuals’ communicative abilities. The number of tone language speakers is increasing, even in otherwise English-speaking nations, yet insufficient emphasis has been placed on clinical assessment and rehabilitation of lexical tone disorders. The similarities and dissimilarities between lexical tones and other speech sounds make a richer scientific understanding of their physiological bases paramount to more effective remediation of speech and language disorders in general. Here we discuss the cognitive and biological bases of lexical tones, emphasizing the neural structures and networks that support their acquisition, perception, and cognitive representation. We present emerging research on lexical tone learning in the context of the clinical disorders of hearing, speech, and language that this body of research will help to address. PMID:19711234
Keeping it simple: the grammatical properties of shared book reading.
Noble, Claire H; Cameron-Faulkner, Thea; Lieven, Elena
2018-05-01
The positive effects of shared book reading on vocabulary and reading development are well attested (e.g., Bus, van Ijzendoorn, & Pellegrini, 1995). However, the role of shared book reading in grammatical development remains unclear. In this study, we conducted a construction-based analysis of caregivers' child-directed speech during shared book reading and toy play and compared the grammatical profile of the child-directed speech generated during the two activities. The findings indicate that (a) the child-directed speech generated by shared book reading contains significantly more grammatically rich constructions than child-directed speech generated by toy play, and (b) the grammatical profile of the book itself affects the grammatical profile of the child-directed speech generated by shared book reading.
Extensions to the Speech Disorders Classification System (SDCS)
Shriberg, Lawrence D.; Fourakis, Marios; Hall, Sheryl D.; Karlsson, Heather B.; Lohmeier, Heather L.; McSweeny, Jane L.; Potter, Nancy L.; Scheer-Cohen, Alison R.; Strand, Edythe A.; Tilkens, Christie M.; Wilson, David L.
2010-01-01
This report describes three extensions to a classification system for pediatric speech sound disorders termed the Speech Disorders Classification System (SDCS). Part I describes a classification extension to the SDCS to differentiate motor speech disorders from speech delay and to differentiate among three subtypes of motor speech disorders. Part II describes the Madison Speech Assessment Protocol (MSAP), an approximately two-hour battery of 25 measures that includes 15 speech tests and tasks. Part III describes the Competence, Precision, and Stability Analytics (CPSA) framework, a current set of approximately 90 perceptual- and acoustic-based indices of speech, prosody, and voice used to quantify and classify subtypes of Speech Sound Disorders (SSD). A companion paper, Shriberg, Fourakis, et al. (2010) provides reliability estimates for the perceptual and acoustic data reduction methods used in the SDCS. The agreement estimates in the companion paper support the reliability of SDCS methods and illustrate the complementary roles of perceptual and acoustic methods in diagnostic analyses of SSD of unknown origin. Examples of research using the extensions to the SDCS described in the present report include diagnostic findings for a sample of youth with motor speech disorders associated with galactosemia (Shriberg, Potter, & Strand, 2010) and a test of the hypothesis of apraxia of speech in a group of children with autism spectrum disorders (Shriberg, Paul, Black, & van Santen, 2010). All SDCS methods and reference databases running in the PEPPER (Programs to Examine Phonetic and Phonologic Evaluation Records; [Shriberg, Allen, McSweeny, & Wilson, 2001]) environment will be disseminated without cost when complete. PMID:20831378
Speech Deficits in Serious mental Illness: A Cognitive Resource Issue?
Cohen, Alex S.; McGovern, Jessica E.; Dinzeo, Thomas J.; Covington, Michael A.
2014-01-01
Speech deficits, notably those involved in psychomotor retardation, blunted affect, alogia and poverty of content of speech, are pronounced in a wide range of serious mental illnesses (e.g., schizophrenia, unipolar depression, bipolar disorders). The present project evaluated the degree to which these deficits manifest as a function of cognitive resource limitations. We examined natural speech from 52 patients meeting criteria for serious mental illnesses (i.e., severe functional deficits with a concomitant diagnosis of schizophrenia, unipolar and/or bipolar affective disorders) and 30 non-psychiatric controls using a range of objective, computer-based measures tapping speech production (“alogia”), variability (“blunted vocal affect”) and content (“poverty of content of speech”). Subjects produced natural speech during a baseline condition and while engaging in an experimentally-manipulated cognitively-effortful task. For correlational analysis, cognitive ability was measured using a standardized battery. Generally speaking, speech deficits did not differ as a function of SMI diagnosis. However, every speech production and content measure was significantly abnormal in SMI versus control groups. Speech variability measures generally did not differ between groups. For both patients and controls as a group, speech during the cognitively-effortful task was sparser and less rich in content. Relative to controls, patients were abnormal under cognitive load with respect only to average pause length. Correlations between the speech variables and cognitive ability were only significant for this same variable: average pause length. Results suggest that certain speech deficits, notably involving pause length, may manifest as a function of cognitive resource limitations. Implications for treatment, research and assessment are discussed. PMID:25464920
Dietz, Mathias; Hohmann, Volker; Jürgens, Tim
2015-01-01
For normal-hearing listeners, speech intelligibility improves if speech and noise are spatially separated. While this spatial release from masking has already been quantified in normal-hearing listeners in many studies, it is less clear how spatial release from masking changes in cochlear implant listeners with and without access to low-frequency acoustic hearing. Spatial release from masking depends on differences in access to speech cues due to hearing status and hearing device. To investigate the influence of these factors on speech intelligibility, the present study measured speech reception thresholds in spatially separated speech and noise for 10 different listener types. A vocoder was used to simulate cochlear implant processing and low-frequency filtering was used to simulate residual low-frequency hearing. These forms of processing were combined to simulate cochlear implant listening, listening based on low-frequency residual hearing, and combinations thereof. Simulated cochlear implant users with additional low-frequency acoustic hearing showed better speech intelligibility in noise than simulated cochlear implant users without acoustic hearing and had access to more spatial speech cues (e.g., higher binaural squelch). Cochlear implant listener types showed higher spatial release from masking with bilateral access to low-frequency acoustic hearing than without. A binaural speech intelligibility model with normal binaural processing showed overall good agreement with measured speech reception thresholds, spatial release from masking, and spatial speech cues. This indicates that differences in speech cues available to listener types are sufficient to explain the changes of spatial release from masking across these simulated listener types. PMID:26721918
Nuttall, Helen E.; Moore, David R.; Barry, Johanna G.; Krumbholz, Katrin
2015-01-01
The speech-evoked auditory brain stem response (speech ABR) is widely considered to provide an index of the quality of neural temporal encoding in the central auditory pathway. The aim of the present study was to evaluate the extent to which the speech ABR is shaped by spectral processing in the cochlea. High-pass noise masking was used to record speech ABRs from delimited octave-wide frequency bands between 0.5 and 8 kHz in normal-hearing young adults. The latency of the frequency-delimited responses decreased from the lowest to the highest frequency band by up to 3.6 ms. The observed frequency-latency function was compatible with model predictions based on wave V of the click ABR. The frequency-delimited speech ABR amplitude was largest in the 2- to 4-kHz frequency band and decreased toward both higher and lower frequency bands despite the predominance of low-frequency energy in the speech stimulus. We argue that the frequency dependence of speech ABR latency and amplitude results from the decrease in cochlear filter width with decreasing frequency. The results suggest that the amplitude and latency of the speech ABR may reflect interindividual differences in cochlear, as well as central, processing. The high-pass noise-masking technique provides a useful tool for differentiating between peripheral and central effects on the speech ABR. It can be used for further elucidating the neural basis of the perceptual speech deficits that have been associated with individual differences in speech ABR characteristics. PMID:25787954
Nonverbal oral apraxia in primary progressive aphasia and apraxia of speech.
Botha, Hugo; Duffy, Joseph R; Strand, Edythe A; Machulda, Mary M; Whitwell, Jennifer L; Josephs, Keith A
2014-05-13
The goal of this study was to explore the prevalence of nonverbal oral apraxia (NVOA), its association with other forms of apraxia, and associated imaging findings in patients with primary progressive aphasia (PPA) and progressive apraxia of speech (PAOS). Patients with a degenerative speech or language disorder were prospectively recruited and diagnosed with a subtype of PPA or with PAOS. All patients had comprehensive speech and language examinations. Voxel-based morphometry was performed to determine whether atrophy of a specific region correlated with the presence of NVOA. Eighty-nine patients were identified, of which 34 had PAOS, 9 had agrammatic PPA, 41 had logopenic aphasia, and 5 had semantic dementia. NVOA was very common among patients with PAOS but was found in patients with PPA as well. Several patients exhibited only one of NVOA or apraxia of speech. Among patients with apraxia of speech, the severity of the apraxia of speech was predictive of NVOA, whereas ideomotor apraxia severity was predictive of the presence of NVOA in those without apraxia of speech. Bilateral atrophy of the prefrontal cortex anterior to the premotor area and supplementary motor area was associated with NVOA. Apraxia of speech, NVOA, and ideomotor apraxia are at least partially separable disorders. The association of NVOA and apraxia of speech likely results from the proximity of the area reported here and the premotor area, which has been implicated in apraxia of speech. The association of ideomotor apraxia and NVOA among patients without apraxia of speech could represent disruption of modules shared by nonverbal oral movements and limb movements.