Note: This page contains sample records for the topic speech from Science.gov.
While these samples are representative of the content of Science.gov,
they are not comprehensive nor are they the most current set.
We encourage you to perform a real-time search of Science.gov
to obtain the most current and comprehensive results. Last update: November 12, 2013.
Contents: Manuscripts and Extended Reports--Speech Synthesis as a Tool for the Study of Speech Production; The Study of Articulatory Organization: Some Negative Progress, Phonetic Perception; Cardiac Indices of Infant Speech Perception; Cinefluorographic ...
A. S. Abramson T. Baer P. Bailey F. Bell-Berti G. J. Borden
This report is one of a series on the progress of studies on the nature of speech, instrumentation for its investigation, and practical applications. Manuscripts cover the following topics: Identification of Sine-wave Analogues of Speech Sounds; Prosodic ...
A. S. Abramson T. Baer F. Bell-Berti G. J. Borden G. Carden
... their lives. Over half of them will require speech therapy at some point during childhood. However, many children ... child’s cleft team will help you decide if speech therapy services or other types of interventions are needed. ...
|Presented in this book is a view of speech communication which enables an individual to become fully aware of his or her role as both initiator and recipient of messages. Communication is treated broadly with emphasis on the understanding and skills relating to various types of speech communication across the broad spectrum of human…
The report (1 January-30 June 1974) is one of a regular series on the status and progress of studies on the nature of speech, instrumentation for its investigation, and practical applications. (Modified author abstract)
This report (1 January - 31 March 1977) is one of a regular series on the status and progress of studies on the nature of speech, instrumentation of its investigation, and practical applications. Manuscripts cover the following topics: - Dissociation Spec...
Research at the Speech Transmission Laboratory (Dept. of Speech Communication) during the calender year 1967 is reviewed. The general descriptive theory of speech analysis is discussed. Speech waveforms were studied with special attention to irregularitie...
Watch the video presentations of each of these speeches. Gettysburg address Martin Luther King- I Have a Dream Freedom of Speech by Mario Savio Mario Savio Speech New worker plan Speech by FDR For manuscripts, audio and video of many other modern and past speeches follow the link below: American Speech Bank ...
Several articles addressing topics in speech research are presented. The topics include: exploring the functional significance of physiological tremor: A biospectroscopic approach; differences between experienced and inexperienced listeners to deaf speech; a language-oriented view of reading and its disabilities; Phonetic factors in letter detection; categorical perception; Short-term recall by deaf signers of American sign language; a common basis for auditory sensory storage in perception and immediate memory; phonological awareness and verbal short-term memory; initiation versus execution time during manual and oral counting by stutterers; trading relations in the perception of speech by five-year-old children; the role of the strap muscles in pitch lowering; phonetic validation of distinctive features; consonants and syllable boundaires; and vowel information in postvocalic frictions.
... an accident, stroke, or birth defect may have speech and language problems. Apraxia is thought to be due to a brain impairment that may or may not show up on brain magnetic resonance imaging (MRI) ... problems, particularly articulation disorders, may have hearing problems. ...
This book serves as a guide for the native and non-native speaker of English in overcoming various problems in articulation, rhythm, and intonation. It is also useful in group therapy speech programs. Forty-five practice chapters offer drill materials for all the vowels, diphthongs, and consonants of American English plus English stress and…
This paper briefly reviews state of the art related to the topic of speech variability sources in automatic speech recognition systems. It focuses on some variations within the speech signal that make the ASR task difficult. The variations detailed in the paper are intrinsic to the speech and affect the different levels of the ASR processing chain. For different sources
M. Benzeguiba; R. De Mori; O. Deroo; S. Dupont; T. Erbes; D. Jouvet; L. Fissore; P. Laface; A. Mertins; C. Ris; R. Rose; V. Tyagi; C. Wellekens
This dissertation addresses the problem of speech synthesis and speech production modelling based on the fundamental principles of human speech production. Unlike the conventional source-filter model, which assumes the independence of the excitation and the acoustic filter, we treat the entire vocal apparatus as one system consisting of a fluid dynamic aspect and a mechanical part. We model the vocal
The paper describes a database called multi level speech database for spontaneous speech processing. We designed the database to cover textual and acoustic variations from declarative speech to spontaneous speech. The database is composed of 5 categories which are, in the order of decreasing spontaneity, spontaneous speech, interview, simulated interview, declarative speech with context, and declarative speech without context. We
Minsoo Hahn; Sanghun Kim; Jung-Chul Lee; Yong-Ju Lee
\\u000a Similar to other speech- and language-processing disciplines such as speech recognition or machine translation, speech synthesis,\\u000a the artificial production of human-like speech, has become very powerful over the last 10 years.
|Brief discussions in this pamphlet suggest educational and career opportunities in the following fields of speech communication: rhetoric, public address, and communication; theatre, drama, and oral interpretation; radio, television, and film; speech pathology and audiology; speech science, phonetics, and linguistics; and speech education.…
Speech recognition performance was measured in normal-hearing and cochlear-implant listeners with maskers consisting of either steady-state speech-spectrum-shaped noise or a competing sentence. Target sentences from a male talker were presented in the presence of one of three competing talkers (same male, different male, or female) or speech-spectrum-shaped noise generated from this talker at several target-to-masker ratios. For the normal-hearing listeners,
Ginger S. Stickney; Fan-Gang Zeng; Ruth Litovsky; Peter Assmann
The long-range objectives of the Packet Speech Systems Technology Program are to develop and demonstrate techniques for efficient digital speech communications on networks suitable for both voice and data, and to investigate and develop techniques for int...
... Speech problems like stuttering Developmental disabilities Learning disorders Autism Brain injury Stroke Some speech and communication problems may be genetic. Often, no one knows the causes. By first grade, about 5 percent of ...
... stroke, head injury, tumor, or other illness affecting the brain. Acquired apraxia of speech may occur together with muscle weakness affecting speech production ( dysarthria ) or language difficulties caused by damage to the nervous system ( ...
The report is a review of the current state-of-the-art in speech bandwidth compression. Speech bandwidth compression techniques are means of reducing the bandwidth needed to represent the human voice waveform. For bandwidth conservation, the voice bandwid...
Background: Oral tumor resections cause articulation deficiencies, depending on the site, extent of resection, type of reconstruction, and tongue stump mobility. Objectives: To evaluate the speech intelligibility of pa- tients undergoing total, subtotal, or partial glossec- tomy, before and after speech therapy. Patients and Methods: Twenty-seven patients (24 men and 3 women), aged 34 to 77 years (mean age, 56.5
Cristina L. B. Furia; Luiz P. Kowalski; Maria R. D. O. Latorre; Elisabete C. Angelis; Nivia M. S. Martins; Ana P. B. Barros; Karina C. B. Ribeiro
Context dependent modelling is known to improve recognition performance for automatic speech recogni- tion. One of the major limitations, especially of ap- proaches based on Decision Trees, is that the questions thatguidethesearchfor effectivecontextsmustbeknown in advance. However, the variation in the speech signals is caused by multiple factors, not all of which may be known during the training procedure. State tying
Large vocabulary speech recognition systems traditionally rep- resent words in terms of subword units, usually phonemes. This paper investigates the potential of graphemes acting as sub- units. In order to develop context dependent grapheme based speech recognizers several decision tree based clustering proce- dures are performed and compared to each other. Grapheme based speech recognizers in three languages - English,
|THIS BOOK IS DESIGNED PRIMARILY FOR STUDENTS WHO ARE BEING TRAINED TO WORK WITH SPEECH HANDICAPPED SCHOOL CHILDREN, EITHER AS SPEECH CORRECTIONISTS OR AS CLASSROOM TEACHERS. THE BOOK DEALS WITH FOUR MAJOR QUESTIONS--(1) WHAT KINDS OF SPEECH DISORDERS ARE FOUND AMONG SCHOOL CHILDREN, (2) WHAT ARE THE PHYSICAL, PSYCHOLOGICAL AND SOCIAL CONDITIONS,…
|The 17 articles in this collection deal with theoretical and practical freedom of speech issues. The topics include: freedom of speech in Marquette Park, Illinois; Nazis in Skokie, Illinois; freedom of expression in the Confederate States of America; Robert M. LaFollette's arguments for free speech and the rights of Congress; the United States…
Interpretation from a first language to a second language via one or more communication devices is performed through a communication network (e.g. phone network or the internet) using a server for performing recognition and interpretation tasks, comprising the steps of: receiving an input speech utterance in a first language on a first mobile communication device; conditioning said input speech utterance; first transmitting said conditioned input speech utterance to a server; recognizing said first transmitted speech utterance to generate one or more recognition results; interpreting said recognition results to generate one or more interpretation results in an interlingua; mapping the interlingua to a second language in a first selected format; second transmitting said interpretation results in the first selected format to a second mobile communication device; and presenting said interpretation results in a second selected format on said second communication device.
Freud's first book published in 1891 was a monograph entitled On Aphasia. In it he challenges the main authorities of the time by asserting that their manner of understanding aphasias is no longer tenable. Freud proves their theories wrong and presents his own conception of a speech apparatus. The apparatus is the foundation of his clinical and theoretical explanations about the speech function and its pathological manifestations. He built the model of an apparatus capable of explaining spontaneous speech, a function that the competing models of his contemporaries could not fully integrate. Freud's speech apparatus is the first of several models he created to facilitate the understanding of psychic functions. This paper is part of a series dedicated to an in-depth study of the earliest model which would provide the foundations for the understanding of its reappearance in Freud's later theories, and in particular in his analytic technique. PMID:8454394
|Speech recognition is one of five main areas in the field of speech processing. Difficulties in speech recognition include variability in sound within and across speakers, in channel, in background noise, and of speech production. Speech recognition can be used in a variety of situations: to perform query operations and phone call transfers; for…
Speech recognition is one of five main areas in the field of speech processing. Difficulties in speech recognition include variability in sound within and across speakers, in channel, in background noise, and of speech production. Speech recognition can be used in a variety of situations: to perform query operations and phone call transfers; for…
This paper addresses the relationship between speech and critique by juxtaposing the ideas of free speech and fearless speech. It is written out of a strong personal belief that the inadequacy of the former mode of speech might well be superseded by the latter as a vehicle for critique. It argues that progressive notions of free speech as the basis
Classic research on the perception of speech sought to identify minimal acoustic correlates of each consonant and vowel. In explaining perception, this view designated momentary components of an acoustic spectrum as cues to the recognition of elementary phonemes. This conceptualization of speech perception is untenable given the findings of phonetic sensitivity to modulation independent of the acoustic and auditory form of the carrier. The empirical key is provided by studies of the perceptual organization of speech, a low-level integrative function that finds and follows the sensory effects of speech amid concurrent events. These projects have shown that the perceptual organization of speech is keyed to modulation; fast; unlearned; nonsymbolic; indifferent to short-term auditory properties; and organization requires attention. The ineluctably multisensory nature of speech perception also imposes conditions that distinguish language among cognitive systems. WIREs Cogn Sci 2013, 4:213–223. doi: 10.1002/wcs.1213
This dissertation addresses the problem of speech synthesis and speech production modelling based on the fundamental principles of human speech production. Unlike the conventional source-filter model, which assumes the independence of the excitation and the acoustic filter, we treat the entire vocal apparatus as one system consisting of a fluid dynamic aspect and a mechanical part. We model the vocal tract by a three-dimensional moving geometry. We also model the sound propagation inside the vocal apparatus as a three-dimensional nonplane-wave propagation inside a viscous fluid described by Navier-Stokes equations. In our work, we first propose a combined minimum energy and minimum jerk criterion to estimate the dynamic vocal tract movements during speech production. Both theoretical error bound analysis and experimental results show that this method can achieve very close match at the target points and avoid the abrupt change in articulatory trajectory at the same time. Second, a mechanical vocal fold model is used to compute the excitation signal of the vocal tract. The advantage of this model is that it is closely coupled with the vocal tract system based on fundamental aerodynamics. As a result, we can obtain an excitation signal with much more detail than the conventional parametric vocal fold excitation model. Furthermore, strong evidence of source-tract interaction is observed. Finally, we propose a computational model of the fricative and stop types of sounds based on the physical principles of speech production. The advantage of this model is that it uses an exogenous process to model the additional nonsteady and nonlinear effects due to the flow mode, which are ignored by the conventional source- filter speech production model. A recursive algorithm is used to estimate the model parameters. Experimental results show that this model is able to synthesize good quality fricative and stop types of sounds. Based on our dissertation work, we carefully argue that the articulatory speech production model has the potential to flexibly synthesize natural-quality speech sounds and to provide a compact computational model for speech production that can be beneficial to a wide range of areas in speech signal processing.
Possibilities for acoustical dialogs with electronic data processing equipment were investigated. Speech recognition is posed as recognizing word groups. An economical, multistage classifier for word string segmentation is presented and its reliability in dealing with continuous speech (problems of temporal normalization and context) is discussed. Speech synthesis is considered in terms of German linguistics and phonetics. Preprocessing algorithms for total synthesis of written texts were developed. A macrolanguage, MUSTER, is used to implement this processing in an acoustic data information system (ADES).
Speech perception (SP) most commonly refers to the perceptual mapping from the highly variable acoustic speech signal to a linguistic representation,\\u000a whether it be phonemes, diphones, syllables, or words. This is an example of categorization, in that potentially discriminable speech sounds are assigned to functionally equivalent classes. In this tutorial, we present\\u000a some of the main challenges to our understanding
A non-mathematical introduction is provided to the speech signal. The production of speech is first described, including a survey of the categories into which speech sounds are grouped. This is followed by an account of some properties of human perception of sounds in general and of speech in particular. Speech is then compared with other signals. It is argued that it is more complex than artificial message bearing signals, and that unlike such signals speech contains no easily identified context-independent units that can be used in bottom-up decoding. Words and phonemes are examined, and phonemes are shown to have no simple manifestation in the acoustic signal. Speech communication is presented as an interactive process, in which the listner actively reconstructs the message from a combination of acoustic cues and prior knowledge, and the speaker takes the listner's capacities into account in deciding how much acoustic information to provide. The final section compares speech and text, arguing that cultural emphasis on written communication causes projection of the properties of text onto speech and that there are large differences between the styles of language appropriate for the two modes of communication. These differences are often ignored, with unfortunate results.
Speech recognition performance was measured in normal-hearing and cochlear-implant listeners with maskers consisting of either steady-state speech-spectrum-shaped noise or a competing sentence. Target sentences from a male talker were presented in the presence of one of three competing talkers (same male, different male, or female) or speech-spectrum-shaped noise generated from this talker at several target-to-masker ratios. For the normal-hearing listeners, target-masker combinations were processed through a noise-excited vocoder designed to simulate a cochlear implant. With unprocessed stimuli, a normal-hearing control group maintained high levels of intelligibility down to target-to-masker ratios as low as 0 dB and showed a release from masking, producing better performance with single-talker maskers than with steady-state noise. In contrast, no masking release was observed in either implant or normal-hearing subjects listening through an implant simulation. The performance of the simulation and implant groups did not improve when the single-talker masker was a different talker compared to the same talker as the target speech, as was found in the normal-hearing control. These results are interpreted as evidence for a significant role of informational masking and modulation interference in cochlear implant speech recognition with fluctuating maskers. This informational masking may originate from increased target-masker similarity when spectral resolution is reduced.
Stickney, Ginger S.; Zeng, Fan-Gang; Litovsky, Ruth; Assmann, Peter
|In spite of the diversity of subjects subsumed under the generic term speech, all areas of this discipline are based on oral communication with its essential elements--voice, action, thought, and language. Speech may be viewed as a community of persons with a common tradition participating in a common dialog, described in part by the memberships…
|Authoritarian teaching practices in ballet inhibit the use of private speech. This paper highlights the critical importance of private speech in the cognitive development of young ballet students, within what is largely a non-verbal art form. It draws upon research by Russian psychologist Lev Vygotsky and contemporary socioculturalists, to…
Written for students in the fields of speech correction and audiology, the text deals with the following: structures involved in respiration; the skeleton and the processes of inhalation and exhalation; phonation and pitch, the larynx, and esophageal speech; muscles involved in articulation; muscles involved in resonance; and the anatomy of the…
This paper presents the use of the wavelet transform for noise reduction in noisy speech signals. The use of different wavelets and different orders have been evaluated for their suitability as a transform for speech noise removal. The wavelets evaluated are the biorthogonal wavelets, Daubechies wavelets, coiflets as well as symlets. Also two different means of filtering the noise in
|This article describes a procedure to aid in the clinical appraisal of child speech. The approach, based on the work by Dinnsen, Chin, Elbert, and Powell (1990; Some constraints on functionally disordered phonologies: Phonetic inventories and phonotactics. "Journal of Speech and Hearing Research", 33, 28-37), uses a railway idiom to track gains…
|The 11 articles in this collection deal with theoretical and practical freedom of speech issues. The topics covered are (1) the United States Supreme Court and communication theory; (2) truth, knowledge, and a democratic respect for diversity; (3) denial of freedom of speech in Jock Yablonski's campaign for the presidency of the United Mine…
|The nine articles in this collection deal with theoretical and practical freedom of speech issues. Topics discussed include the following: (1) freedom of expression in Thailand and India; (2) metaphors and analogues in several landmark free speech cases; (3) Supreme Court Justice William O. Douglas's views of the First Amendment; (4) the San…
|This issue of the "Free Speech Yearbook" contains the following: "Between Rhetoric and Disloyalty: Free Speech Standards for the Sunshire Soldier" by Richard A. Parker; "William A. Rehnquist: Ideologist on the Bench" by Peter E. Kane; "The First Amendment's Weakest Link: Government Regulation of Controversial Advertising" by Patricia Goss:…
|The articles collected in this book originated at a conference at which legal and economic scholars discussed the issue of First Amendment protection for commercial speech. The first article, in arguing for freedom for commercial speech, finds inconsistent and untenable the arguments of those who advocate freedom from regulation for political…
|This publication is the first of quarterly bulletins to be published by the Association of Departments and Administrators in Speech Communication (ADASC). Featured articles in this issue concern: non-academic careers for communications majors; the current employment situation facing those with both undergraduate and graduate degrees in speech;…
|The purpose of the study was to examine the relationship of the limitation and outcome of simultaneous speech to those dimensions of personality indexes by Cattell's 16PF Questionnaire. More than 500 conversations of 24 female college students were computer-analyzed for instances of simultaneous speech, and the frequencies with which they…
|This issue of "Free Speech" contains the following articles: "Daniel Schoor Relieved of Reporting Duties" by Laurence Stern, "The Sellout at CBS" by Michael Harrington, "Defending Dan Schorr" by Tome Wicker, "Speech to the Washington Press Club, February 25, 1976" by Daniel Schorr, "Funds Voted For Schorr Inquiry" by Richard Lyons, "Erosion of…
Jibbigo is a speech-to-speech translation application for iPhone, iPod touch, and iPad devices. Jibbigo allows the user to simply speak a sentence, and it speaks the sentence aloud in the other language, much like a personal human interpreter would. The speech-to-speech translation is bidirectional for a two way dialog between participants.
Multimodal speech and speaker modeling and recognition are widely accepted as vital aspects of state of the art human-machine inter- action systems. While correlations between speech and lip motion as well as speech and facial expressions are widely studied, rela- tively little work has been done to investigate the correlations be- tween speech and gesture. Detection and modeling of head,
Mehmet Emre Sargin; Oya Aran; Alexey Karpov; Ferda Ofli; Yelena Yasinnik; Stephen Wilson; Engin Erzin; Yücel Yemez; A. Murat Tekalp
Objectives A common complaint of many older adults is difficulty communicating in situations where they must focus on one talker in the presence of other people speaking. In listening environments containing multiple talkers, age-related changes may be caused by increased sensitivity to energetic masking, increased susceptibility to informational masking (e.g., confusion between the target voice and masking voices), and/or cognitive deficits. The purpose of the present study was to tease out these contributions to the difficulties that older adults experience in speech-on-speech masking situations. Design Groups of younger, normal-hearing individuals and older adults with varying degrees of hearing sensitivity (n = 12 per group) participated in a study of sentence recognition in the presence of four types of maskers: a two-talker masker consisting of voices of the same sex as the target voice, a two-talker masker of voices of the opposite sex as the target, a signal-envelope-modulated noise derived from the two-talker complex, and a speech-shaped steady noise. Subjects also completed a voice discrimination task to determine the extent to which they were able to incidentally learn to tell apart the target voice from the same-sex masking voices and to examine whether this ability influenced speech-on-speech masking. Results Results showed that older adults had significantly poorer performance in the presence of all four types of maskers, with the largest absolute difference for the same-sex masking condition. When the data were analyzed in terms of relative group differences (i.e., adjusting for absolute performance) the greatest effect was found for the opposite-sex masker. Degree of hearing loss was significantly related to performance in several listening conditions. Some older subjects demonstrated a reduced ability to discriminate between the masking and target voices; performance on this task was not related to speech recognition ability. Conclusions The overall pattern of results suggests that although amount of informational masking does not seem to differ between older and younger listeners, older adults (particularly those with hearing loss) evidence a deficit in the ability to selectively attend to a target voice, even when the masking voices are from talkers of the opposite sex. Possible explanations for these findings include problems understanding speech in the presence of a masker with temporal and spectral fluctuations and/or age-related changes in cognitive function.
This paper introduces an informatic applications for speech therapy in Spanish language based on the use of Speech Technology.\\u000a The objective of this work is to help children with different speech impairments to improve their communication skills. Speech\\u000a technology provides methods which can help children who suffer from speech disorders to develop pre-Language and Language.\\u000a For pre-Language development the informatic
William Ricardo Rodríguez Dueñas; Carlos Vaquero; Oscar Saz; Eduardo Lleida
|The aim of the investigation is to compare voice and speech quality in alaryngeal patients using esophageal speech (ESOP, eight subjects), electroacoustical speech aid (EACA, six subjects) and tracheoesophageal voice prosthesis (TEVP, three subjects). The subjects reading a short story were recorded in the sound-proof booth and the speech samples…
In this paper, we present approaches used in text summarization, showing how they can be adapted for speech summarization and where they fall short. Informal style and apparent lack of structure in speech mean that the typical approaches used for text summarization must be extended for use with speech. We illustrate how features derived from speech can help determine summary
Kathleen McKeown; Julia Hirschberg; Michel Galley; Sameer Maskey
We report on the case of a woman with jargon aphasic seizures who provided a careful written report of inner speech jargon occurring during her seizures. This inner speech jargon description is an unusual finding since in most aphasic disorders, patients also suffer from anosognosia. This case report may suggest that jargon could also involve inner speech and could be innerly detected as such. It provides an argument supporting the idea that common mechanisms may underlie both "overt" and "covert" production of jargon during aphasia. PMID:23523813
|The effects of SpeechEasy on stuttering frequency, stuttering severity self-ratings, speech rate, and speech naturalness for 31 adults who stutter were examined. Speech measures were compared for samples obtained with and without the device in place in a dispensing setting. Mean stuttering frequencies were reduced by 79% and 61% for the device…
... patients. Work Environment Most speech-language pathologists work in schools or healthcare facilities. Some work in patientsâ€™ homes. ... on physicians and surgeons , social workers , and psychologists . In schools, they work with teachers, special educators, other school ...
This new companion site from PBS offers an excellent collection of speeches, some with audio and video clips, from many of the nation's "most influential and poignant speakers of the recorded age." In the Speech Archives, users will find a timeline of significant 20th-century events interspersed with the texts of over 90 speeches, some of which also offer background and audio or video clips. Additional sections of the site include numerous activities for students: two quizzes in the American History Challenge, Pop-Up Trivia, A Wordsmith Challenge, Critics' Corner and Could You be a Politician? which allows visitors to try their hand at reading a speech off of a teleprompter.
The report describes a technique to automatically evaluate the intelligibility of speech transmitted over a communication channel. The technique is called CORODIM (Correlation Of the Recognition Of Degradation with Intelligibility Measurements). It transm...
A nonlinear transmission line model of the cochlea (Zweig 1988) is proposed as the basis for a novel speech preprocessor. Sounds of different intensities, such as voiced and unvoiced speech, are preprocessed in radically different ways. The Q's of the preprocessor's nonlinear filters vary with input amplitude, higher Q's (longer integration times) corresponding to quieter sounds. Like the cochlea, the preprocessor acts as a ''subthreshold laser'' that traps and amplifies low level signals, thereby aiding in their detection and analysis. 17 refs.
Until recently, research in speech perception and speech production has largely focused on the search for psychological and phonetic evidence of discrete, abstract, context-free symbolic units corresponding to phonological segments or phonemes. Despite this common conceptual goal and intimately related objects of study, however, research in these two domains of speech communication has progressed more or less independently for more than 60 years. In this article, we present an overview of the foundational works and current trends in the two fields, specifically discussing the progress made in both lines of inquiry as well as the basic fundamental issues that neither has been able to resolve satisfactorily so far. We then discuss theoretical models and recent experimental evidence that point to the deep, pervasive connections between speech perception and production. We conclude that although research focusing on each domain individually has been vital in increasing our basic understanding of spoken language processing, the human capacity for speech communication is so complex that gaining a full understanding will not be possible until speech perception and production are conceptually reunited in a joint approach to problems shared by both modes.
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Listening to a speech recording is much more difficult than visually scanning a document because of the transient and temporal nature of audio. Audio recordings capture the richness of speech, yet it is difficult to directly browse the stored information. This article describes techniques for structuring, filtering, and presenting recorded speech, allowing a user to navigate and interactively find information
This paper describes an emotional speech synthesis system based on HMMs and related modeling techniques. For con- catenative speech synthesis, we require all of the concatena- tion units that will be used to be recorded beforehand and made available at synthesis time. To adopt this approach for synthe- sizing the wide variety of human emotions possible in speech, implies that
Various aspects of canned speech generation were examined. In this approach, brief tactical messages are generated by concatenating the speech waveforms corresponding to the individual words. According to the tests conducted, listeners unanimously preferred canned speech over synthetic speech generated by a text-to-speech converter. They selected canned speech not only for its higher intelligibility, but they also felt that canned
Common rationales for free speech are offered in legal writing across many countries, even though their laws regulating speech differ markedly. This article suggests another way of thinking about speech, based on particular qualities of speech which help to explain why public speech – or at least public speech perceived as valuable for cultural, political or other purposes – is
The Latin American Network Information Center at the University of Texas provides access to a searchable and browsable database of speeches by Cuban Leader Fidel Castro. It contains "full text of English translations of speeches, interviews, and press conferences by Castro, based upon the records of the Foreign Broadcast Information Service (FBIS), a US government agency responsible for monitoring broadcast and print media in countries throughout the world." Users should note that the search interface, while allowing searching on any of nine types of documents, as well as keyword and date, lacks user guidance. Documents are organized by date. While this is not a repository of all of Castro's speeches, the amount of material at the site makes it valuable to researchers.
Patty Amick, Cheryl Hawkins, and Lori Trumbo of Greenville Technical College created this resource to connect the art of public speaking with the task of demographic data collection. This course will help students create and interpret charts and graphs using mean, median, mode, and percentages. It will also allow students to recognize flawed surveys and design their own in order to produce valid data, all while writing a persuasive speech to incorporate their findings. This is a great website for educators looking to combine speech communication and math in a very hands-on way.
Twenty-six patients with the speech disorder of Parkinson's disease received daily speech therapy (prosodic exercises) at home for 2 to 3 weeks. There were significant improvements in speech as assessed by scores for prosodic abnormality and intelligibility' and these were maintained in part for up to 3 months. The degree of improvement was clinically and psychologically important, and relatives commented
|Talking to oneself can be silent (inner speech) or vocalized for others to hear (private speech, or soliloquy). We investigated these two types of self-communication in 28 deaf signers and 28 hearing adults. With a questionnaire specifically developed for this study, we established the visible analog of vocalized private speech in deaf signers.…
In this paper, I shall analyze US Presidential Barack Obama's South Carolina victory speech from the perspective of pragmemes. In particular, I shall explore the idea that this speech is constituted by many voices (in other words, it displays polyphony, to use an idea due to Bakhtin, 1981, 1986) and that the audience is part of this speech event, adding
This paper explores the expression of emotion in synthesized speech for an anthropomorphic robot. We have adapted several key emotional correlates of human speech to the robot's speech synthesizer to allow the robot to speak in either an angry, calm, disg...
The long-range objectives of the Packet Speech Systems Technology Program are to develop and demonstrate techniques for efficient digital speech communications on networks suitable for both voice and data, and to investigate and develop techniques for integrated voice and data communication in packetized networks, including wideband common-user satellite links. Specific areas of concern are: the concentration of statistically fluctuating volumes of voice traffic, the adaptation of communication strategies to varying conditions of network links and traffic volume, and the interconnection of wideband satellite networks to terrestrial systems. Previous efforts in this area have led to new vocoder structures for improved narrowband voice performance and multiple-rate transmission, and to demonstrations of conversational speech and conferencing on the ARPANET and the Atlantic Packet Satellite Network. The current program has two major thrusts: i.e., the development and refinement of practical low-cost, robust, narrowband, and variable-rate speech algorithms and voice terminal structures; and the establishment of an experimental wideband satellite network to serve as a unique facility for the realistic investigation of voice/data networking strategies.
In a speech, Looking Ahead in Vocational Education", to a group of Hamilton educators, D.O. Davis, Vice-President, Engineering, Dominion Foundries and Steel Limited, Hamilton, Ontario spoke of the challenge of change and what educators and industry must do to help the future of vocational education. (Editor)
... bilingual home affect my child's language and speech? The brain has to work harder to interpret and use 2 languages, so it may take longer for children to start using either one or both of the languages they're learning. It's not unusual ...
|Objective: To integrate speaking practice with rhetorical theory. Type of speech: Persuasive. Point value: 100 points (i.e., 30 points based on peer evaluations, 30 points based on individual performance, 40 points based on the group presentation), which is 25% of course grade. Requirements: (a) References: 7-10; (b) Length: 20-30 minutes; (c)…
|This instructor's manual discusses the use of three videotapes and three slide sets that were produced for the purpose of triggering, or stimulating, students to say and use particular sounds frequently in conversational speech. The manual provides statements of the purpose and objective of the media, a list of components, student performance…
Even though we are discussing a case that was not decided on the merits, Nike v. Kasky is an important case because it crystallizes two of the essential critiques about the commercial speech doctrine, critiques that have run through this doctrine from before its advent in 1976 to today. The fundamental debate Nike triggered over what constitutes \\
Ronald Collins; Mark Lopez; Tamara Piety; David C. Vladeck
Prof. C. A. Angell from Arizona State University read the following short and simple speech, saying the sentences in Italics in the best Japanese he could manage (after earnest coaching from a Japanese colleague). The rest was translated on the bus ride, and then spoken, as I spoke, by Ms. Yukako Endo- to whom the author is very grateful.
This article considers the use of filtering software (“censor-ware”) and the World Wide Web. It argues that the United States government, religious groups, and corporations should not deter freedom of speech and access to information in libraries. The article describes how governments and religious groups try to prevent Web sites from reaching Internet users, and explains how corporations sell and
While the auditory-only aspects of Mandarin speech are heavily-researched and well-known in the field, this dissertation addresses its lesser-known aspects: The visual and audio-visual perception of Mandarin segmental information and lexical-tone information. Chapter II of this dissertation focuses on the audiovisual perception of Mandarin…
|While the auditory-only aspects of Mandarin speech are heavily-researched and well-known in the field, this dissertation addresses its lesser-known aspects: The visual and audio-visual perception of Mandarin segmental information and lexical-tone information. Chapter II of this dissertation focuses on the audiovisual perception of Mandarin…
Velopharyngeal defects are common problem encountered in dentistry, different prosthetic designs can be used; however, the prosthesis should be comfortable and function properly. Surgical treatment is the treatment of choice, but there are few patients who are not able to receive the surgical intervention; in such cases prosthetic management is considered using speech bulb prosthesis to overcome nasal twang and nasal regurgitation. PMID:23861283
The Dutton Award speech consisted of Mr. Adlof's research history [the analysis/ characterization of nylon-9 precursors/(1968-1970), using deuterium-labeled fats to study their metabolism in humans (with Ed Emken; 1975-1998), and relating triglyceride structure(s) to a food's sensory properties and ...
This research guide provides an overview of the basic concepts as well as a comprehensive list of references that can help both the beginning and advanced speaker. Emphasis in this document is placed on researching, writing, and delivering a speech. In addition to information on finding and developing a topic, the book contains a chart of the…
In its 1942 ruling in the "Valentine vs. Christensen" case, the Supreme Court established the doctrine that commercial speech is not protected by the First Amendment. In 1975, in the "Bigelow vs. Virginia" case, the Supreme Court took a decisive step toward abrogating that doctrine, by ruling that advertising is not stripped of all First Amendment…
Common evaluation standards are critical to making progress in any field, but they can also distort research by shifting all the attention to a limited subset of the problem. Here, we consider the problem of evaluat- ing algorithms for speech separation and acoustic scene analysis, noting some weaknesses of existing measures, and making some suggestions for future evaluations. We take
In this paper I will advance a simple argument against anti-obscenity legislation. This argument has both moral and legal force, as it appeals to the moral basis underlying the First Amendment protection of speech. In addition to defending this argument, I am concerned to show how it (1) renders moot a number of considerations commonly urged in favor of antiobscenity
This report concludes our work for the past two years on speech compression and synthesis. A real-time variable-frame-rate LPC vocoder was implemented operating at an average rate of 2000 bits/s. We also tested our mixed-source model as part of the vocoder. To improve the reliability of the extraction of LPC parameters, we implemented and tested a range of adaptive lattice and autocorrelation algoritms. For data rates above 5000 bits/s, we developed and tested a new high-frequency regeneration technique, spectral duplication, which reduces the roughness in the synthesized speech. As the first part of our effort towards a very-low-rate (VLR) vocoder, we implemented a phonetic synthesis program that would be compatible with our initial design for a phonetic recognition program. We also recorded and partially labeled a large data base of diphone templates. During the second year we continued our work toward a VLR vocoder and also developed a multirate embedded-coding speech compression program that could transmit speech at rates varying from 9600 to 2400 b/s. The phonetic synthesis program and the labeling of the diphone template network were completed. There are currently 2845 diphone templates. We also implemented an initial version of a phonetic recognizer based on a network representation of diphone templates. The recognizer allows for incremental training of the network by modification of existing templates or addition of new templates.
Berouti, M.; Makoul, J.; Schwartz, R.; Sorenson, J.
This is the first Semiannual Technical Summary Report on the Network Speech Processing Program to be submitted to the Defense Communications Agency. It covers the period 1 October 1976 through 31 March 1977 and reports on the following topics: Secure Voic...
|The seven articles in this collection deal with theoretical and practical freedom of speech issues. Topics covered are: the United States Supreme Court, motion picture censorship, and the color line; judicial decision making; the established scientific community's suppression of the ideas of Immanuel Velikovsky; the problems of avant-garde jazz,…
Effective communication between staff members is key to patient safety in hospitals. A variety of patient care activities including admittance, evaluation, and treatment rely on oral communication. Surprisingly, published information on speech intelligibility in hospitals is extremely limited. In this study, speech intelligibility measurements and occupant evaluations were conducted in 20 units of five different U.S. hospitals. A variety of unit types and locations were studied. Results show that overall, no unit had "good" intelligibility based on the speech intelligibility index (SII?>?0.75) and several locations found to have "poor" intelligibility (SII?0.45). Further, occupied spaces were found to have 10%-15% lower SII than unoccupied spaces on average. Additionally, staff perception of communication problems at nurse stations was significantly correlated with SII ratings. In a targeted second phase, a unit treated with sound absorption had higher SII ratings for a larger percentage of time as compared to an identical untreated unit. Taken as a whole, the study provides an extensive baseline evaluation of speech intelligibility across a variety of hospitals and unit types, offers some evidence of the positive impact of absorption on intelligibility, and identifies areas for future research. PMID:23862833
With the success of Uni-Speech chip designed for advance speech recognition system, there is an effort to develop a cost efficient version of the Uni-Speech chip targeted for mass consumer market speech-recognition related applications with special focus in improving power consumption. In this paper, a new low cost speech device called UniSpeech-Lite is described. The UniSpeech-Lite is an integrated SoC
This new experimental search engine from Compaq indexes over 2,500 hours of content from 20 popular American radio shows. Using its speech recognition software, Compaq creates "a time-aligned 'transcript' of the program and build[s] an index of the words spoken during the program." Users can then search the index by keyword or advanced search. Search returns include the text of the clip, a link to a longer transcript, the relevant audio clip in RealPlayer format, the entire program in RealPlayer format, and a link to the radio show's Website. The index is updated daily. Please note that, while SpeechBot worked fine on Windows/NT machines, the Scout Project was unable to access the audio clips using Macs.
This research investigates the design and performance of the Speech Graffiti interface for spoken interaction with simple machines. Speech Graffiti is a standardized interface designed to address issues inherent in the current state-of-the-art in spoken dialog systems such as high word-error rates and the difficulty of developing natural language systems. This article describes the general characteristics of Speech Graffiti, provides
Stefanie Tomko; Thomas K. Harris; Arthur Toth; James Sanders; Alexander I. Rudnicky; Roni Rosenfeld
In this paper, we describe the IBM MASTOR systems which handle spontaneous free-form speech-to-speech translation on both laptop and hand-held PDAs. Challenges include speech recognition and machine translation in adverse environments, lack of data and linguistic resources for under-studied languages, and the need to rapidly develop capabilities for new languages. Importantly, the code and models must fit within the limited
|THIS GUIDE TO HIGH SCHOOL SPEECH FOCUSES ON SPEECH AS ORAL COMPOSITION, STRESSING THE IMPORTANCE OF CLEAR THINKING AND COMMUNICATION. THE PROPOSED 1-SEMESTER BASIC COURSE IN SPEECH ATTEMPTS TO IMPROVE THE STUDENT'S ABILITY TO COMPOSE AND DELIVER SPEECHES, TO THINK AND LISTEN CRITICALLY, AND TO UNDERSTAND THE SOCIAL FUNCTION OF SPEECH. IN ADDITION…
|The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…
Recent interest in nonlinear modeling of speech has brought up the need to re-assess the performance limitations of linear speech models. While nonlinearity is essential in the production mechanism of speech, it need not be reflected in a speech-signal model. This paper addresses the question what perceptual quality can be achieved for unvoiced speech by a linear model with white
Speech production, like other sensorimotor behaviors, relies on multiple sensory inputs—audition, proprioceptive inputs from muscle spindles and cutaneous inputs from mechanoreceptors in the skin and soft tissues of the vocal tract. However, the capacity for intelligible speech by deaf speakers suggests that somatosensory input alone may contribute to speech motor control and perhaps even to speech learning. We assessed speech
In the internet telephony, loss of IP packets causes instantaneous discontinuities in the received speech. In this paper, we have focused on finding an error resilient method for this problem. Our proposed method creates artificial correlation between speech samples that pre-distorts the speech signal. The receiver uses this correlation to reconstruct the lost speech pack- ets. An appropriate speech enhancement
Information about the curricular and co?curricular speech programs in 90% of the Pennsylvania high schools was secured through two questionnaires?one to teachers and one to principals. Most of the people teaching speech are English teachers who are not certified in speech, who do not belong to any professional speech associations, who do not regularly read any speech journal, and who
Efficient, realistic face animation is still a challenge. A system is proposed that yields realistic animations for speech. It starts from real 3D face dynamics, observed at a frame rate of 25 fps for thousands of points on the faces of speaking actors. When asked to animate a face it replicates the visemes that is has learned, and adds the necessary coarticulation effects. The speech animation could be based on as few as 16 modes, extracted through Independent Component Analysis from the observed face dynamics. Also faces for which only a static, neutral 3D model is available, can be animated. Rather then animating via verbatim copying other faces’ deformation fields, the visemes are adapted to the shape of the new face. By localising this face in a ‘Face Space’, where also the locations of the example faces are known, visemes are adapted automatically according to the relative distance with respect to these examples. The animation tool proposes a good speech-based face animation as a point of departure for animators, who also get support by the system to then make further changes as desired.
Kalberer, Gregor A.; Mueller, Pascal; Van Gool, Luc J.
Talking silently to ourselves occupies much of our mental lives, yet the mechanisms underlying this experience remain unclear. The following experiments provide behavioral evidence that the auditory content of inner speech is provided by corollary discharge. Corollary discharge is the motor system's prediction of the sensory consequences of its actions. This prediction can bias perception of other sensations, pushing percepts to match with prediction. The two experiments below show this bias induced by inner speech, demonstrating that inner speech causes external sounds to be heard as similar to the imagined speech, and that this bias operates on subphonemic content. PMID:23556693
Scott, Mark; Yeung, H Henny; Gick, Bryan; Werker, Janet F
Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.
The effects of SpeechEasy on stuttering frequency, stuttering severity self-ratings, speech rate, and speech naturalness for 31 adults who stutter were examined. Speech measures were compared for samples obtained with and without the device in place in a dispensing setting. Mean stuttering frequencies were reduced by 79% and 61% for the device compared to the control conditions on reading and monologue tasks, respectively. Mean severity self-ratings decreased by 3.5 points for oral reading and 2.7 for monologue on a 9-point scale. Despite dramatic reductions in stuttering frequency, mean global speech rates in the device condition increased by only 8% in the reading task and 15% for the monologue task, and were well below normal. Further, complete elimination of stuttering was not associated with normalized speech rates. Nevertheless, mean ratings of speech naturalness improved markedly in the device compared to the control condition and, at 3.3 and 3.2 for reading and monologue, respectively, were only slightly outside the normal range. These results show that SpeechEasy produced improved speech outcomes in an assessment setting. However, findings raise the issue of a possible contribution of slowed speech rate to the stuttering reduction effect, especially given participants' instructions to speak chorally with the delayed signal as part of the active listening instructions of the device protocol. Study of device effects in situations of daily living over the long term is necessary to fully explore its treatment potential, especially with respect to long-term stability. Educational objectives: The reader will be able to discuss and evaluate: (1) issues pertinent to evaluating treatment benefits of fluency aids and (2) the effects of SpeechEasy on stuttering frequency, speech rate, and speech naturalness during testing in a dispensing setting for a relatively large sample of adults who stutter. PMID:18617052
The first working draft of the World Wide Web Consortium's (W3C) Semantic Interpretation for Speech Recognition is now available. The document "defines the process of Semantic Interpretation for Speech Recognition and the syntax and semantics of semantic interpretation tags that can be added to speech recognition grammars." The document is a draft, open for suggestions from W3C members and other interested users.
Lernout & Hauspie Speech Products.; Tichelen, Luc V.
The current generation of continuous speech recognition systems claims to offer high accuracy (greater than 95 percent) speech recognition at natural speech rates (150 words per minute) on low-cost (under $2000) platforms. This paper presents a state-of-the-technology summary, along with insights the authors have gained through testing one such product extensively and other products superficially.The authors have identified a number
This final report describes research on real-time speech recognition. The authors developed, under other DARPA-funded contracts, a system for continuous speech recognition. BYBLOS, the BBN Continuous Speech Recognition System, consists of a general paradi...
Our long-term research goal is the development and implementation of speaker-independent continuous speech recognition systems. It is our conviction that proper utilization of speech-specific knowledge is essential for advanced speech recognition systems....
This article provides an overview of Web sites of historical audio recordings. Selected Web sites are reviewed from the standpoint of providing a library of historical speeches. Major sites reviewed include American Memory\\/The Nation's Forum Collection, the History Channel “Great Speeches,” Oyez, Oyez, Oyez, US History Out Loud, and Doughss: Archives of American Public Address.
This study aims to identify aspects of speech-in-noise recognition that are susceptible to training, focusing on whether listeners can learn to adapt to target talkers ("tune in") and learn to better cope with various maskers ("tune out") after short-term training. Listeners received training on English sentence recognition in speech-shaped noise…
The noise degrades the performance of Automatic Speech Recognition systems mainly due to the mismatch between the training and recognition conditions it introduces. The noise causes a distortion of the feature space which usually presents a non-linear behavior. In order to reduce this mismatch, the methods proposed for robust speech recognition try to compensate the noise effect either by obtaining
Angel de la Torre; Antonio M. Peinado; Jose C. Segura
This paper describes a method of compensating for nonlinear distortions in speech representation caused by noise. The method described here is based on the histogram equalization method often used in digital image processing. Histogram equalization is applied to each component of the feature vector in order to improve the robustness of speech recognition systems. The paper describes how the proposed
Ángel De La Torre; Antonio M. Peinado; José C. Segura; José L. Pérez-córdoba; M. Carmen Benítez; Antonio J. Rubio
Mainstream automatic speech recognition has focused almost exclusively on the acoustic signal. The performance of these systems degrades considerably in the real world in the presence of noise. On the other hand, most human listeners, both hearing-impaired and normal hearing, make use of visual information to improve speech perception in acoustically hostile environments. Motivated by humans' ability to lipread, the
X. Zhang; R. M. Mersereau; M. Clements; C. C. Broun
In the typical public speaking course, instructors or assistants videotape or digitally record at least one of the term's speeches in class or lab to offer students additional presentation feedback. Students often watch and self-critique their speeches on their own. Peers often give only written feedback on classroom presentations or completed…
This final report provides research results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing (NLP), and information retrieval (IR).
\\u000a Instant messaging (IM) is commonly viewed as a “spoken” medium, in light of its reputation for informality, non-standard spelling\\u000a and punctuation, and use of lexical shortenings and emoticons. However, the actual nature of IM is an empirical issue that\\u000a bears linguistic analysis.\\u000a \\u000a \\u000a To understand the linguistic character of IM, this article begins by considering differences between face-to-face speech and\\u000a conventional
Human vocalizations are sounds made exclusively by a human vocal tract. Among other vocalizations, for example, laughs or screams, speech is the most important. Speech is the primary medium of that supremely human symbolic communication system called language. One of the functions of a voice, perhaps the main one, is to realize language, by conveying some of the speaker's thoughts in linguistic form. Speech is language made audible. Moreover, when phoneticians compare and describe voices, they usually do so with respect to linguistic units, especially speech sounds, like vowels or consonants. It is therefore necessary to understand the structure as well as nature of speech sounds and how they are described. In order to understand and evaluate the speech, it is important to have at least a basic understanding of science of speech acoustics: how the acoustics of speech are produced, how they are described, and how differences, both between speakers and within speakers, arise in an acoustic output. One of the aims of this article is try to facilitate this understanding.
|Purpose: This study investigates the ability to understand degraded speech signals and explores the correlation between this capacity and the functional characteristics of the peripheral auditory system. Method: The authors evaluated the capability of 50 normal-hearing native French speakers to restore time-reversed speech. The task required them…
This paper presents two new algorithms for robust speech pause detection (SPD) in noise. Our approach was to formulate SPD into a statistical decision theory problem for the optimal detection of noise-only segments, using the framework of model-based speech enhancement (MBSE). The advantages of this approach are that it performs well in high noise conditions, all necessary information is available
We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements. PMID:15868833
In order to examine whether children adjust their phonetic speech categories, children of two age groups, five-year-olds and eight-year-olds, were exposed to a video of a face saying /aba/ or /ada/ accompanied by an auditory ambiguous speech sound halfway between /b/ and /d/. The effect of exposure to these audiovisual stimuli was measured on…
This article reviews major approaches to the transcription of disordered speech using the International Alphabet (IPA). Application of selected symbols for transcribing non-English sounds is highlighted in clinical examples, as are commonly used diacritic symbols. Included is an overview of the IPA extensions for transcription of atypical speech,…
Purpose: This study investigates the ability to understand degraded speech signals and explores the correlation between this capacity and the functional characteristics of the peripheral auditory system. Method: The authors evaluated the capability of 50 normal-hearing native French speakers to restore time-reversed speech. The task required them…
|Indicates that (1) males with low interpersonal orientation (IO) were least vocally active and expressive and least consistent in their speech performances, and (2) high IO males and low IO females tended to demonstrate greater speech convergence than either low IO males or high IO females. (JD)|
American slang reflects diversity, imagination, self-confidence, and optimism of the American people. Its vitality is due in part to the guarantee of free speech and lack of a national academy of language or of any official attempt to purify American speech, in part to Americans' historic geographic mobility. Such "folksay" includes riddles and…
|EFFORTS WERE MADE IN THIS STUDY TO (1) RELATE THE AMOUNT OF SILENT SPEECH DURING SILENT READING TO LEVEL OF READING PROFICIENCY, INTELLIGENCE, AGE, AND GRADE PLACEMENT OF SUBJECTS, AND (2) DETERMINE WHETHER THE AMOUNT OF SILENT SPEECH DURING SILENT READING IS AFFECTED BY THE LEVEL OF DIFFICULTY OF PROSE READ AND BY THE READING OF A FOREIGN…
Relations between a child's spontaneous social speech and pre-sleep monologue speech were investigated, as well as an attempt to determine the role of pre-sleep monologues in the child's acquisition of his mother tongue. Analysis to date suggest that chil...
|In order to examine whether children adjust their phonetic speech categories, children of two age groups, five-year-olds and eight-year-olds, were exposed to a video of a face saying /aba/ or /ada/ accompanied by an auditory ambiguous speech sound halfway between /b/ and /d/. The effect of exposure to these audiovisual stimuli was measured on…
|Explores the complex issues inherent in the tension between hate speech and free speech, focusing on the phenomenon of hate speech on college campuses. Describes the challenges to hate speech made by critical race theorists and explains how a feminist critique can reorient the parameters of hate speech. (SLD)|
Restriction of normal speech from Chinese whispered speech based on radial basis function neural network (RBF NN) is proposed in this paper. Firstly, capture the nonlinear mapping of spectral envelope between whispered and normal speech by RBF NN; secondly, modify the spectral envelope of the whispered speech by adopting the trained neural network; finally, convert the whispered speech into normal
Language is a catalyst of cognitive change during early childhood, and identification and assessment are crucial in speech therapy, especially in the early years. It is very important to identify and assess children with speech problem as early as possible. However, Estonian speech therapists do not have any scientifically validated standardised tests for determining the level of speech and language
|Richard Nixon's 1952 "Checkers" speech was an innovative use of television for political communication. Like television news itself, the campaign fund crisis behind the speech can be thought of in the same terms as other television melodrama, with the speech serving as its climactic episode. The speech adapted well to television because it was…
Purpose: Speech perception in participants with auditory neuropathy (AN) was systematically studied to answer the following 2 questions: Does noise present a particular problem for people with AN? Can clear speech and cochlear implants alleviate this problem? Method: The researchers evaluated the advantage in intelligibility of clear speech over conversational speech in 13 participants with AN. Of these participants, 7
There are reasons to believe that infant-directed (ID) speech may make language ac- quisition easier for infants. However, the effects of ID speech on infants' learning re- main poorly understood. The experiments reported here assess whether ID speech fa- cilitates word segmentation from fluent speech. One group of infants heard a set of nonsense sentences spoken with intonation contours characteristic
Despite decades of research, the functional neuroanatomy of speech processing has been difficult to characterize. A major impediment to progress may have been the failure to consider task effects when mapping speech-related processing systems. We outline a dual-stream model of speech processing that remedies this situation. In this model, a ventral stream processes speech signals for comprehension, and a dorsal
Abstract This paper introduces a novel ,speech ,enhancement ,system based on a wavelet denoising framework. In this system, the noisy speech is first preprocessed using a generalized spectral subtraction method ,to initially lower ,the noise level with negligible speech distortion. A perceptual ,wavelet transform isthen,used to decompose ,the resulting speech signal into critical bands. Threshold estimation is implemented ,that is
|The ability to decode atypical and degraded speech signals as intelligible is a hallmark of speech perception. Human adults can perceive sounds as speech even when they are generated by a variety of nonhuman sources including computers and parrots. We examined how infants perceive the speech-like vocalizations of a parrot. Further, we examined…
|Upon hearing an ambiguous speech sound dubbed onto lipread speech, listeners adjust their phonetic categories in accordance with the lipread information (recalibration) that tells what the phoneme should be. Here we used sine wave speech (SWS) to show that this tuning effect occurs if the SWS sounds are perceived as speech, but not if the sounds…
Multi-modal speech and speaker modelling and recognition are widely accepted as vital aspects of state of the art human-machine interaction systems. While correlations between speech and lip motion as well as speech and facial expressions are widely studied, relatively little work has been done to investigate the correlations between speech and gesture. Detection and modelling of head, hand and arm
Mehmet Emre Sargin; Ferda Ofli; Yelena Yasinnik; Oya Aran; Alexey Karpov; Stephen Wilson; Yucel Yemez; Engin Erzin; A. Murat Tekalp
The ability to decode atypical and degraded speech signals as intelligible is a hallmark of speech perception. Human adults can perceive sounds as speech even when they are generated by a variety of nonhuman sources including computers and parrots. We examined how infants perceive the speech-like vocalizations of a parrot. Further, we examined how…
|IN THIS ARTICLE THE NATURE OF THE DISCIPLINE OF SPEECH SCIENCE IS CONSIDERED AND THE VARIOUS BASIC AND APPLIED AREAS OF THE DISCIPLINE ARE DISCUSSED. THE BASIC AREAS ENCOMPASS THE VARIOUS PROCESSES OF THE PHYSIOLOGY OF SPEECH PRODUCTION, THE ACOUSTICAL CHARACTERISTICS OF SPEECH, INCLUDING THE SPEECH WAVE TYPES AND THE INFORMATION-BEARING ACOUSTIC…
Normal hearing listeners are able to understand speech with different types of degradation, because speech has redundancy in the spectro-temporal domains. On the other hand, hearing impaired listeners have less such capability. Because of this, speech signal processing for hearing impairment needs to preserve important landmarks when enhancing a speech signal. The hearing impairments are characterized by high-frequency hearing loss,
The paper describes the development of a trainable speech synthesis system, based on hidden Markov models. An approach to speech signal generation using a source-filter model is presented. Inputs into the synthesis system are speech utterances and their phone level transcriptions. A method using context dependent acoustic models and Croatian phonetic rules for speech synthesis is proposed. Croatian HMM based
It is proposed that the acquisition and maintenance of fluent speech depend on the rapid temporal integration of motor feedforward and polysensory (auditory and somatosensory) feedback signals. In a functional magnetic resonance imaging study on 21 healthy right-handed, English-speaking volunteers, we investigated activity within these motor and sensory pathways and their integration during speech. Four motor conditions were studied: two speech conditions (propositional and nonpropositional speech) and two silent conditions requiring repetitive movement of the principal articulators (jaw and tongue movements). The scanning technique was adapted to minimize artifact associated with overt speech production. Our result indicates that this multimodal convergence occurs within the left and right supratemporal planes (STPs), with peaks of activity at their posteromedial extents, in regions classically considered as unimodal auditory association cortex. This cortical specialization contrasted sharply with the response of somatosensory association cortex (SII), in which activity was suppressed during speech but not during the silent repetitive movement of the principal articulators. It was also clearly distinct from the response of lateral auditory association cortex, which responded to auditory feedback alone, and from that within a left lateralized ventrolateral temporal and inferior frontal system, which served lexical- and sentence-level language retrieval. This response of cortical regions related to speech production is not predicted by the classical model of hierarchical cortical processing, providing new insights into the role of the STP in polysensory integration and into the modulation of activity in SII during normal speech production. These findings have novel implications for the acquisition and maintenance of fluent speech. PMID:18829954
Dhanjal, Novraj S; Handunnetthi, Lahiru; Patel, Maneesh C; Wise, Richard J S
There is provided an apparatus and method for assisting speech recovery in people with inability to speak due to aphasia, apraxia or another condition with similar effect. A hollow, rigid, thin-walled tube with semi-circular or semi-elliptical cut out shapes at each open end is positioned such that one end mates with the throat/voice box area of the neck of the assistor and the other end mates with the throat/voice box area of the assisted. The speaking person (assistor) makes sounds that produce standing wave vibrations at the same frequency in the vocal cords of the assisted person. Driving the assisted person's vocal cords with the assisted person being able to hear the correct tone enables the assisted person to speak by simply amplifying the vibration of membranes in their throat.
Silog is a biometrie authentication system that extends the conventional PC logon process using voice verification. Users enter their ID and password using a conventional Windows logon procedure but then the biometrie authentication stage makes a Voice over IP (VoIP) call to a VoiceXML (VXML) server. User interaction with this speech-enabled component then allows the user's voice characteristics to be extracted as part of a simple user/system spoken dialogue. If the captured voice characteristics match those of a previously registered voice profile, then network access is granted. If no match is possible, then a potential unauthorised system access has been detected and the logon process is aborted.
The Center for Spoken Language Understanding (CSLU)provides free language resources to researchers and educators inall areas of speech and hearing science. These resources are ofgreat potential value to speech scientists for analyzing speech,for diagnosing and treating speech and language problems, forresearching and evaluating language technologies, and fortraining students in the theory and practice of speech science.This article describes language resources
In this paper, we explore the concept of dual-purpose speech: speech that is socially appropriate in the context of a human-to-human conversation which also provides meaningful input to a computer. We motivate the use of dual-purpose speech and explore issues of privacy and technological challenges related to mobile speech recognition. We present three applications that utilize dual-purpose speech to assist
Kent Lyons; Christopher Skeels; Thad Starner; Cornelis M. Snoeck; Benjamin A. Wong; Daniel Ashbrook
Is computer-synthesized speech as persuasive as the human voice when presenting an argument? After completing an attitude pretest, 193 participants were randomly assigned to listen to a persuasive appeal under three conditions: a high-quality synthesized speech system (DECtalk Express), a low-quality synthesized speech system (Monologue), and a tape recording of a human voice. Following the appeal, participants completed a posttest attitude survey and a series of questionnaires designed to assess perceptions of speech qualities, perceptions of the speaker, and perceptions of the message. The human voice was generally perceived more favorably than the computer-synthesized voice, and the speaker was perceived more favorably when the voice was a human voice than when it was computer synthesized. There was, however, no evidence that computerized speech, as compared with the human voice, affected persuasion or perceptions of the message. Actual or potential applications of this research include issues that should be considered when designing synthetic speech systems. PMID:10774129
A new speech discrimination test was formulated by extracting selected words from recordings of the Harvard PB-50 lists. Fifty words were chosen which had been found to be significantly more difficult for patients with sensorineural hearing losses to reco...
Abstract-As researchers continue to improve speech in noisy II environments, more interest is being placed on sensors with modalities that can be fused with traditional acoustic sensors. The standard literature has shown that electromagnetic sensors can b...
E. F. Greneker I. Chuckpaiwong J. L. Geisheimer S. A. Billington
A model of the speech recognition process was presented in an attempt to find adequate measures of intelligibility and intelligence. An experiment was designed to test the model and the validity of the measures. Four trained subjects responded to distorte...
The research identified the overall requirements of an end to end phonetic speech recognition process for a speech recognition software product (SPSR). The effort has focused on expanding the existing Standard Objects for Phonetic Speech Recognition techn...
In this research we aim to detect subjective sentences in spon- taneous speech and label them for polarity. We introduce a novel technique wherein subjective patterns are learned from both labeled and unlabeled data, using n-grams with varying levels of lexical instantiation. Applying this technique t o meet- ing speech, we gain significant improvement over state-of-t he- art approaches and
Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a maximum a-posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from automatic speech recognition (ASR). The system consists of an ¿acoustic model¿ (AM) for prosodic features (actually pause duration) and a ¿language model¿ (LM) for text-only features. The LM combines
This study investigated how native language background influences the intelligibility of speech by non-native talkers for non-native listeners from either the same or a different native language background as the talker. Native talkers of Chinese (n=2), Korean (n=2), and English (n=1) were recorded reading simple English sentences. Native listeners of English (n=21), Chinese (n=21), Korean (n=10), and a mixed group from various native language backgrounds (n=12) then performed a sentence recognition task with the recordings from the five talkers. Results showed that for native English listeners, the native English talker was most intelligible. However, for non-native listeners, speech from a relatively high proficiency non-native talker from the same native language background was as intelligible as speech from a native talker, giving rise to the ``matched interlanguage speech intelligibility benefit.'' Furthermore, this interlanguage intelligibility benefit extended to the situation where the non-native talker and listeners came from different language backgrounds, giving rise to the ``mismatched interlanguage speech intelligibility benefit.'' These findings shed light on the nature of the talker-listener interaction during speech communication.
Three types of exaggerated speech are thought to be systematic responses to accommodate the needs of the listener: child-directed speech (CDS), hyperspeech, and the Lombard response. CDS (e.g., Kuhl et al., 1997) occurs in interactions with young children and infants. Hyperspeech (Johnson et al., 1993) is a modification in response to listeners difficulties in recovering the intended message. The Lombard response (e.g., Lane et al., 1970) is a compensation for increased noise in the signal. While all three result from adaptations to accommodate the needs of the listener, and therefore should share some features, the triggering conditions are quite different, and therefore should exhibit differences in their phonetic outcomes. While CDS has been the subject of a variety of acoustic studies, it has never been studied in the broader context of the other ``exaggerated'' speech styles. A large crosslinguistic study was undertaken that compares speech produced under four conditions: spontaneous conversations, CDS aimed at 6-9-month-old infants, hyperarticulated speech, and speech in noise. This talk will present some findings for North American English as spoken in the Pacific Northwest. The measures include f0, vowel duration, F1 and F2 at vowel midpoint, and intensity.
Wright, Richard; Carmichael, Lesley; Beckford Wassink, Alicia; Galvin, Lisa
It has been reported that listeners can benefit from a release in masking when the masker speech is spoken in a language that differs from the target speech compared to when the target and masker speech are spoken in the same language [Freyman, R. L. et al. (1999). J. Acoust. Soc. Am. 106, 3578–3588; Van Engen, K., and Bradlow, A. (2007), J. Acoust. Soc. Am. 121, 519–526]. It is unclear whether listeners benefit from this release in masking due to the lack of linguistic interference of the masker speech, from acoustic and phonetic differences between the target and masker languages, or a combination of these differences. In the following series of experiments, listeners’ sentence recognition was evaluated using speech and noise maskers that varied in the amount of linguistic content, including native-English, Mandarin-accented English, and Mandarin speech. Results from three experiments indicated that the majority of differences observed between the linguistic maskers could be explained by spectral differences between the masker conditions. However, when the recognition task increased in difficulty, i.e., at a more challenging signal-to-noise ratio, a greater decrease in performance was observed for the maskers with more linguistically relevant information than what could be explained by spectral differences alone.
Calandruccio, Lauren; Dhar, Sumitrajit; Bradlow, Ann R.
Previous researchers interested in physical assessment of speech intelligibility have largely based their predictions on preservation of spectral shape. A new approach is presented in which intelligibility is predicted to be preserved only if a transformation modifies relevant speech parameters in a consistent manner. In particular, the parameters from each short-time interval are described by one of a finite number of symbols formed by quantizing the output of an auditory model, and preservation of intelligibility is modeled as the extent to which a one-to-one correspondence exists between the symbols of the input to the transformation, and those of the output. In this paper, a consistency-measurement system is designed and applied to prediction of intelligibility of linearly filtered speech and speech degraded by additive noise. Results were obtained for two parameter sets: one consisting of band-energy values, and the other based on the ensemble interval histogram (EIH) model. Predictions within a class of transformation varied monotonically with the amount of degradation. Across classes of transformation, the predicted effect of additive-noise transformations was more severe than typical perceptual effects. With respect to the goal of achieving predictions that varied monotonically with human speech-perception scores, performance was slightly better with the EIH parameter set. PMID:8969488
The present study sought an acoustic signature for the speech disturbance recognized in cerebellar degeneration. Magnetic resonance imaging was used for a radiological rating of cerebellar involvement in six cerebellar ataxic dysarthric speakers. Acoustic measures of the [pap] syllables in contrastive prosodic conditions and of normal vs. brain-damaged patients were used to further our understanding both of the speech degeneration that accompanies cerebellar pathology and of speech motor control and movement in general. Pair-wise comparisons of the prosodic conditions within the normal group showed statistically significant differences for four prosodic contrasts. For three of the four contrasts analyzed, the normal speakers showed both longer durations and higher formant and fundamental frequency values in the more prominent first condition of the contrast. The acoustic measures of the normal prosodic contrast values were then used as a model to measure the degree of speech deterioration for individual cerebellar subjects. This estimate of speech deterioration as determined by individual differences between cerebellar and normal subjects' acoustic values of the four prosodic contrasts was used in correlation analyses with MRI ratings. Moderate correlations between speech deterioration and cerebellar atrophy were found in the measures of syllable duration and f0. A strong negative correlation was found for F1. Moreover, the normal model presented by these acoustic data allows for a description of the flexibility of task- oriented behavior in normal speech motor control. These data challenge spatio-temporal theory which explains movement as an artifact of time wherein longer durations predict more extreme movements and give further evidence for gestural internal dynamics of movement in which time emerges from articulatory events rather than dictating those events. This model provides a sensitive index of cerebellar pathology with quantitative acoustic analyses.
Defines speech melody, with special attention to the distinction between its prosodic and paralinguistic domains. Discusses the role of the prosodic characteristics (stress, center, juncture, pitch direction, pitch height, utterance unit, and utterance group) in producing meaning in speech. (JMF)
Presented herein are systems and methods for generating an adaptive noise codebook for use with electronic speech systems. The noise codebook includes a plurality of entries which may be updated based on environmental noise sounds. The speech system inclu...
This report presents the results and conclusions of a 1972 study performed by the Task Force on Speech Pathology and Audiology. Thirteen educational institutions offering degrees in speech pathology and audiology in Louisiana were surveyed, and completed ...
This paper examines how people communicate with computers using speech. Automatic speech recognition (ASR) transforms speech into text, while automatic speech synthesis [or text-to-speech (TTS)] performs the reverse task. ASR has been largely developed based on speech coding theory, while simulating certain spectral analyses performed by the ear. Typically, a Fourier transform is employed, but following the auditory Bark scale
A neural network based model is developed to quantify speech intelligibility by blind-estimating speech transmission index, an objective rating index for speech intelligibility of transmission channels, from transmitted speech signals without resort to knowledge of original speech signals. It consists of a Hilbert transform processor for speech envelope detection, a Welch average periodogram algorithm for envelope spectrum estimation, a principal
Information extraction from speech is a crucial step on the way from speech recognition to speech understanding. A prelimi- nary step toward speech understanding is the detection of topic boundaries, sentence boundaries, and proper names in speech recognizer output. This is important since speech recognizer output lacks the usual textual cues to these entities (such as head- ers, paragraphs, sentence
Dilek Z. Hakkani-Tür; Gökhan Tür; Andreas Stolcke; Elizabeth Shriberg
Information extraction from speech is a crucial step on the way from speech recognition to speech understanding. A prelimi-nary step toward speech understanding is the detection of topic boundaries, sentence boundaries, and proper names in speech recognizer output. This is important since speech recognizer output lacks the usual textual cues to these entities (such as head-ers, paragraphs, sentence punctuation, and
Information extraction from speech is a crucial step on the way from speech recognition to speech understanding. A prelimi- nary step toward speech understanding is the detection of topic boundaries, sentence boundaries, and proper names in speech recognizer output. This is important since speech recognizer output lacks the usual textual cues to these entities (such a s head- ers, paragraphs,
Dilek Hakkani-Tur; Gokhan Tur; Andreas Stolcke; Elizabeth Shriberg
THIS GUIDE FOR THE ELEMENTARY SCHOOL CLASSROOM TEACHER DISCUSSES HER ROLE IN A PROGRAM OF SPEECH THERAPY OR SPEECH IMPROVEMENT, WHETHER IN COOPERATION WITH A SPEECH THERAPIST OR ALONE. GOOD SPEECH AND DEFECTIVE SPEECH ARE DEFINED, AND ACTIVITIES TO ENCOURAGE SPEECH IN THE CLASSROOM ARE LISTED. SPECIFIC DIAGNOSTIC TECHNIQUES AND THERAPEUTIC…
Speech quality and intelligibility might significantly deteriorate in the presence of background noise, especially when the speech signal is subject to subsequent processing. In particular, speech coders and automatic speech recognition (ASR) systems that were designed or trained to act on clean speech signals might be rendered useless in the presence of background noise. Speech enhancement algorithms have therefore attracted
This paper presents the Croatian context-dependent acoustic modelling used in speech recognition and in speech synthesis. The proposed acoustic model is based on context-dependent triphone hidden Markov models and Croatian phonetic rules. For speech recognition and speech synthesis system modelling and testing the Croatian speech corpus VEPRAD was used. The ex- periments have shown that Croatian speech corpus, Croatian phonetic
When we speak, we spontaneously produce gestures (co- speech gestures). Co-speech gestures and speech production are closely interlinked. However, the exact nature of the link is still under debate. To addressed the question that whether co-speech gestures originate from the speech production system or from a system independent of the speech production, the present study examined the relationship between co-speech
The purpose of this review was to examine the different treatment approaches for persons with Parkinson's Disease (PD) and to examine the effects of these treatments on speech. Treatment methods reviewed include speech therapy, pharmacological, and surgical. Research from the 1950s through the 1970s had not demonstrated significant improvements following speech therapy. Recent research has shown that speech therapy (when
A multimedia speech therapy system should be able to be used for customized speech therapy for different problems and for different ages. The speech recognition must be designed to work with high inter- and intra-speaker variability. In addition to displaying text on a screen, recording the voice reading the text, analyzing the recorded spoken signal and performing speech recognition which
\\u000a \\u000a \\u000a \\u000a \\u000a • \\u000a \\u000a \\u000a The speech language pathologist (SLP) and otolaryngologist may collaborate effectively in the assessment and treatment of\\u000a children with speech, resonance, and feeding\\/swallowing disorders.\\u000a \\u000a \\u000a \\u000a \\u000a • \\u000a \\u000a \\u000a Assessment protocols may involve multidisciplinary efforts, such as a feeding or airway team.\\u000a \\u000a \\u000a \\u000a \\u000a • \\u000a \\u000a \\u000a Knowledge of speech and feeding developmental milestones can help guide the practitioner in identification of problems and\\u000a the referral process to
Contemporary approaches to speech and speaker recognition decompose the problem into four components: feature extraction, acoustic modeling, language modeling and search. Statistical signal processing is an integral part of each of these components, and Bayes Rule is used to merge these components into a single optimal choice. Acoustic models typically use hidden Markov models based on Gaussian mixture models for state output probabilities. This popular approach suffers from an inherent assumption of linearity in speech signal dynamics. Language models often employ a variety of maximum entropy techniques, but can employ many of the same statistical techniques used for acoustic models. In this paper, we focus on introducing nonlinear statistical models to the feature extraction and acoustic modeling problems as a first step towards speech and speaker recognition systems based on notions of chaos and strange attractors. Our goal in this work is to improve the generalization and robustness properties of a speech recognition system. Three nonlinear invariants are proposed for feature extraction: Lyapunov exponents, correlation fractal dimension, and correlation entropy. We demonstrate an 11% relative improvement on speech recorded under noise-free conditions, but show a comparable degradation occurs for mismatched training conditions on noisy speech. We conjecture that the degradation is due to difficulties in estimating invariants reliably from noisy data. To circumvent these problems, we introduce two dynamic models to the acoustic modeling problem: (1) a linear dynamic model (LDM) that uses a state space-like formulation to explicitly model the evolution of hidden states using an autoregressive process, and (2) a data-dependent mixture of autoregressive (MixAR) models. Results show that LDM and MixAR models can achieve comparable performance with HMM systems while using significantly fewer parameters. Currently we are developing Bayesian parameter estimation and discriminative training algorithms for these new models to improve noise robustness.
Srinivasan, S.; Ma, T.; May, D.; Lazarou, G.; Picone, J.
This paper reviews the psychophysical basis for auditory models and discusses their application to automatic speech recognition. First an overview of the human auditory system is presented, followed by a review of current knowledge gleaned from neurological and psychoacoustic experimentation. Next, a general framework describes established peripheral auditory models which are based on well-understood properties of the peripheral auditory system. This is followed by a discussion of current enhancements to that models to include nonlinearities and synchrony information as well as other higher auditory functions. Finally, the initial performance of auditory models in the task of speech recognition is examined and additional applications are mentioned.
|Written for those interested in speech pathology and audiology, the text presents the anatomical, physiological, and neurological bases for speech and hearing. Anatomical nomenclature used in the speech and hearing sciences is introduced and the breathing mechanism is defined and discussed in terms of the respiratory passage, the framework and…
Almost all theories of child speech development assume that an infant learns speech sounds by direct imitation, performing an acoustic matching of adult output to his own speech. Some theories also postulate an innate link between perception and production. We present a computer model which has no requirement for acoustic matching on the part of the infant and which treats
|With detailed discussion and invaluable video footage of 23 treatment interventions for speech sound disorders (SSDs) in children, this textbook and DVD set should be part of every speech-language pathologist's professional preparation. Focusing on children with functional or motor-based speech disorders from early childhood through the early…
Pulsed ultrasound was used to study tongue movements in the speech of children from 3 to 11 years of age. Speech data attained were characteristic of systems that can be described by second-order differential equations. Relationships observed in these systems may indicate that speech control involves tonic and phasic muscle inputs. (Author/RH)
|Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…
Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta
In this paper we present an overview of the state-of-the-art approaches for speech recognition of the Russian language. Since Russian is a highly inflective language with a complex mechanism of word formation, the main approaches for English speech recognition are not optimally applicable for Russian speech, at least directly and without appropriate adaptation. We make here an overview of some
Sergey Zablotskiy; Kseniya Zablotskaya; Wolfgang Minker
The purpose of this report is to compare speech which is loud as a consequence of noise exposure (Lombard speech) with speech which is loud deliberately. One male speaker was recorded in six speaking conditions: (1) ambient noise, (2) high noise, and (3) ...
... of speech. Commercial products, programs, apps or kits can be […] Read More What To Look for In an SLP for Your Child In the United States, speech-language pathologists (SLP) are certified by the American Speech Language and Hearing Association (ASHA). Once the ...
|Most college and university speech codes would not survive a legal challenge, according to a report released in December by the Foundation for Individual Rights in Education, a watchdog group for free speech on campuses. The report labeled many speech codes as overly broad or vague, and cited examples such as Furman University's prohibition of…
This article examines caregiver speech to young children. The authors obtained several measures of the speech used to children during early language development (14–30 months). For all measures, they found substantial variation across individuals and subgroups. Speech patterns vary with caregiver education, and the differences are maintained over time. While there are distinct levels of complexity for different caregivers, there
Janellen Huttenlocher; Marina Vasilyeva; Heidi R. Waterfall; Jack L. Vevea; Larry V. Hedges
As part of a pilot data collection for DARPA's Continuous Speech Recognition (CSR) speech corpus, SRI International experimented with the collection of spontaneous speech material. The bulk of the CSR pilot data was read versions of news articles from the (WSJ), and the spontaneous sentences were to be similar material, but spontaneously dictated. In the first pilot portion of the
In this paper we describe a generic architecture for single channel speech enhancement. We assume processing in frequency domain and suppression based speech enhancement methods. The framework consists of a two stage voice activity detector, noise variance estimator, a suppression rule, and an uncertain presence of the speech signal modifier. The evaluation corpus is a synthetic mixture of a clean
According to K. S. Lashley (1951) and U. Neisser (1967), models of speech cognition frequently stress a rhythmic organizing principle, indicating that speech processing is intimately related to the processing of rhythmic patterns. However, in agreement with clinical data, dichotic listening studies establish that while speech stimuli are processed by the left hemisphere, other nonspeech auditory stimuli are processed by
This paper considers the estimation of speech parameters in an all-pole model when the speech has been degraded by additive background noise. The procedure, based on maximum a posteriori (MAP) estimation techniques is first developed in the absence of noise and related to linear prediction analysis of speech. The modification in the presence of background noise is shown to be
|This article examines caregiver speech to young children. The authors obtained several measures of the speech used to children during early language development (14-30 months). For all measures, they found substantial variation across individuals and subgroups. Speech patterns vary with caregiver education, and the differences are maintained over…
Huttenlocher, Janellen; Vasilyeva, Marina; Waterfall, Heidi R.; Vevea, Jack L.; Hedges, Larry V.
Speech has recently made headway towards becoming a more mainstream interface modality. For example, there is an increasing number of call center applications, especially in the airline and banking industries. However, speech still has many properties that cause its use to be problematic, such as its inappropriateness in both very quiet and very noisy environments, and the tendency of speech
Frankie James; Jennifer Lai; Bernhard Suhm; Bruce Balentine; John Makhoul; Clifford Nass; Ben Shneiderman
The classification of schizophrenic patients into thought disordered (TD) and non-thought disordered (NTD) is questioned. It is suggested that the terms speech disordered and non-speech disordered are more exact, and that viewing deviant speech structurally with no assumptions as to the thought behind it may prove valuable.
Written for those interested in speech pathology and audiology, the text presents the anatomical, physiological, and neurological bases for speech and hearing. Anatomical nomenclature used in the speech and hearing sciences is introduced and the breathing mechanism is defined and discussed in terms of the respiratory passage, the framework and…
|This book is written as a guide to the understanding of the processes involved in human speech communication. Ten authorities contributed material to provide an introduction to the physiological aspects of speech production and reception, the acoustical aspects of speech production and transmission, the psychophysics of sound reception, the…
|Pitch is a psychoacoustic construct crucial in the production and perception of speech and songs. This article is an exploration of the interface of speech and song performance of Chinese speakers. Although parallels might be drawn from the prosodic and sound structures of the linguistic and musical systems, perceiving and producing speech and…
Speech recognition is usually regarded as a problem in the field of pattern recognition, where one first estimates the probability density function of each pattern to be recognized and then uses Bayes theorem to identify the pattern which provides the highest likelihood for the observed speech data. In this paper, we take a different approach to this problem. In speech
Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…
Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta
Since January 1993, the authors have been working to refine and extend Sphinx-II technologies in order to develop practical speech recognition at Microsoft. The result of that work has been the Whisper (Windows Highly Intelligent Speech Recognizer). Whisper represents significantly improved recognition efficiency, usability, and accuracy, when compared with the Sphinx-II system. In addition Whisper offers speech input capabilities for
Xuedong Huang; Alex Acero; Fil Alleva; Mei-Yuh Hwang; Li Jiang; Milind Mahajan
Listening to speech recruits a network of fronto-temporo-parietal cortical areas. Classical models consider anterior (motor) sites to be involved in speech production whereas posterior sites are considered to be involved in comprehension. This functional segregation is challenged by action-perception theories suggesting that brain circuits for speech articulation and speech perception are functionally dependent. Although recent data show that speech listening elicits motor activities analogous to production, it's still debated whether motor circuits play a causal contribution to the perception of speech. Here we administered transcranial magnetic stimulation (TMS) to motor cortex controlling lips and tongue during the discrimination of lip- and tongue-articulated phonemes. We found a neurofunctional double dissociation in speech sound discrimination, supporting the idea that motor structures provide a specific functional contribution to the perception of speech sounds. Moreover, our findings show a fine-grained motor somatotopy for speech comprehension. We discuss our results in light of a modified "motor theory of speech perception" according to which speech comprehension is grounded in motor circuits not exclusively involved in speech production. PMID:19217297
Describes a new multiparametric approach based on techniques for simultaneous analysis of oromandibular EMG signals and speech in various phonatory reaction tasks, to investigate the speech motor system. Data on latencies of EMG signals (EMG reaction time (RT)) and speech (voice RT) measured during phonatory reaction tasks from both normal subjects of various ages and pathological Parkinsonian subjects are presented.
F. Grandori; P. Pinelli; P. Ravazzani; F. Ceriani; G. Miscio; F. Pisano; R. Colombo; S. Insalaco; G. Tognola
Three experiments elicited phonological speech errors using the SLIP procedure to investigate whether there is a tendency for speech errors on specific words to reoccur, and whether this effect can be attributed to implicit learning of an incorrect mapping from lemma to phonology for that word. In Experiment 1, when speakers made a phonological speech error in the study phase
Karin R. Humphreys; Heather Menzies; Johanna K. Lake
This study investigated the overall intelligibility of speech produced during simultaneous communication (SC). Four hearing, experienced sign language users were recorded under SC and speech alone (SA) conditions speaking Boothroyd's (1985) forced-choice phonetic contrast material designed for measurement of speech intelligibility. Twelve…
Whitehead, Robert L.; Schiavetti, Nicholas; MacKenzie, Douglas J.; Metz, Dale Evan
In this work, the RWTH automatic speech recognition systems developed for the second TC-STAR evaluation campaign 2006 are presented. The systems were designed to transcribe parliamen- tary speeches taken from the European Parliament Plenary Ses- sions (EPPS) in European English and Spanish, as well as speeches from the Spanish Parliament. The RWTH systems apply a two pass search strategy with
Jonas Lööf; Maximilian Bisani; Christian Gollan; Georg Heigold; Björn Hoffmeister; Christian Plahl; Ralf Schlüter; Hermann Ney
The creation of the network TIMIT (NTIMIT) database, which is the result of transmitting the TIMIT database over the telephone network, is described. A brief description of the TIMIT database is given, including characteristics useful for speech analysis and recognition. The hardware and software required for the transmission of the database is described. The geographic distribution of the TIMIT utterances
Malay speech therapy assistance tools (MSTAT) is a system which assists the therapist to diagnose children for language disorder and to train children with stuttering problem. The main engine behind it is the speech technologies; consist of speech recognition system, Malay text-to-speech system and Malay talking head by Tan, T.S. (2003). In this project, speech recognition system utilizes the hidden
Tian-Swee Tan; Helbin-Liboh; A. K. Ariff; Chee-Ming Ting; S.-H. Salleh
A distinguishing feature of Broca's aphasia is non-fluent halting speech typically involving one to three words per utterance. Yet, despite such profound impairments, some patients can mimic audio-visual speech stimuli enabling them to produce fluent speech in real time. We call this effect 'speech entrainment' and reveal its neural mechanism as well as explore its usefulness as a treatment for speech production in Broca's aphasia. In Experiment 1, 13 patients with Broca's aphasia were tested in three conditions: (i) speech entrainment with audio-visual feedback where they attempted to mimic a speaker whose mouth was seen on an iPod screen; (ii) speech entrainment with audio-only feedback where patients mimicked heard speech; and (iii) spontaneous speech where patients spoke freely about assigned topics. The patients produced a greater variety of words using audio-visual feedback compared with audio-only feedback and spontaneous speech. No difference was found between audio-only feedback and spontaneous speech. In Experiment 2, 10 of the 13 patients included in Experiment 1 and 20 control subjects underwent functional magnetic resonance imaging to determine the neural mechanism that supports speech entrainment. Group results with patients and controls revealed greater bilateral cortical activation for speech produced during speech entrainment compared with spontaneous speech at the junction of the anterior insula and Brodmann area 47, in Brodmann area 37, and unilaterally in the left middle temporal gyrus and the dorsal portion of Broca's area. Probabilistic white matter tracts constructed for these regions in the normal subjects revealed a structural network connected via the corpus callosum and ventral fibres through the extreme capsule. Unilateral areas were connected via the arcuate fasciculus. In Experiment 3, all patients included in Experiment 1 participated in a 6-week treatment phase using speech entrainment to improve speech production. Behavioural and functional magnetic resonance imaging data were collected before and after the treatment phase. Patients were able to produce a greater variety of words with and without speech entrainment at 1 and 6 weeks after training. Treatment-related decrease in cortical activation associated with speech entrainment was found in areas of the left posterior-inferior parietal lobe. We conclude that speech entrainment allows patients with Broca's aphasia to double their speech output compared with spontaneous speech. Neuroimaging results suggest that speech entrainment allows patients to produce fluent speech by providing an external gating mechanism that yokes a ventral language network that encodes conceptual aspects of speech. Preliminary results suggest that training with speech entrainment improves speech production in Broca's aphasia providing a potential therapeutic method for a disorder that has been shown to be particularly resistant to treatment. PMID:23250889
Fridriksson, Julius; Hubbard, H Isabel; Hudspeth, Sarah Grace; Holland, Audrey L; Bonilha, Leonardo; Fromm, Davida; Rorden, Chris
This paper describes aspects of the semantic component of the speech understanding system currently being developed jointly by SRI and SDC. The semantic component consists of two major parts: a semantic network coding a model of the task domain and a batt...
Speech recognition is a classic example of a human/machine interface, typifying many of the difficulties and opportunities of human/machine interaction. In this paper, speech recognition is used as an example of applying turbo processing principles to the general problem of human/machine interface. Speech recognizers frequently involve a model representing phonemic information at a local level, followed by a language model representing information at a nonlocal level. This structure is analogous to the local (e.g., equalizer) and nonlocal (e.g., error correction decoding) elements common in digital communications. Drawing from the analogy of turbo processing for digital communications, turbo speech processing iteratively feeds back the output of the language model to be used as prior probabilities for the phonemic model. This analogy is developed here, and the performance of this turbo model is characterized by using an artificial language model. Using turbo processing, the relative error rate improves significantly, especially in high-noise settings. PMID:23757535
Moon, Todd K; Gunther, Jacob H; Broadus, Cortnie; Hou, Wendy; Nelson, Nils
This research was directed toward the correlation of minute variations in the pitch periodicity of human speech (pitch perturbations) with several types of physical lesion of the larynx. A sample of 144 laryngeal patients and 35 normal controls, all of wh...
This paper deals with forenames in Cameroon English speech. It examines the structure of people's names, the sources and uses of forenames, the formation of pet-names, the orthographic representations of forenames and their phonological realisations. The graphic data come from the birth certificates of thousands of undergraduate students aged seventeen and above, and the phonic data are tape- recordings of
State-of-the-art automatic speech recognition (ASR) systems are usually based on hidden Markov models (HMMs) that emit cepstral-based features which are assumed to be piecewise stationary. While not really robust to noise, these features are also known to be very sensitive to \\
Todd A. Stephenson; Mathew Magimai-doss; Hervé Bourlard
|The study summarized in this paper deals with the grammatical analysis of the spontaneous speech of approximately 150 children who are classified as mentally disabled; educable (I.Q. range 50-80). The performance of these mentally disadvantaged children is compared with the performance of 200 normally developing children by using a clinical…
In this paper, we will describe work in progress at the MITRE Cor- poration on embedding speech-enabled interfaces in Web browsers. This research is part of our work to establish the infrastructure to create Web-hosted versions of prototype multimodal interfaces, both intelligent and otherwise. Like many others, we believe that the Web is the best potential delivery and distribution vehicle
This study compares highly committed members of the Free Speech Movement at Berkeley with the student population at large on three sociopsychological foci: general biographical data, religious orientation, and rigidity-flexibility. Questionnaires were administered to 172 FSM members selected by chance from the ten to twelve hundred who entered and \\
|Discusses the increased popularity of communication as an academic major and career choice. Suggests that the change reflects the shift in the United States workplace to an information orientation. Reports results of a study of the jobs held by recipients of communication degrees. Concludes that speech communication as a major offers flexibility…
|Prosodic features in spontaneous speech help disambiguate implied meaning not explicit in linguistic surface structure, but little research has examined how these signals manifest themselves in real conversations. Spontaneously produced verbal irony utterances generated between familiar speakers in conversational dyads were acoustically analyzed…
Reports that K. Lashley’s (1951) article on serial order in behavior inspired several theoretical accounts of rhythm as an organizing feature in speech processing. However, experimental evidence supporting this position has been sparse and indirect. Recent work by B. Hamill (1976) suggests that the placement of monosyllabic content words in grammatical strings is correlated with the rhythmic pattern of the
|Recent work by Hamill (1976) suggests that placement of monosyllabic content words in grammatical strings is correlated with the rhythmic pattern of the string. Using classic paradigms, this research shows that rhythmic organization is also evident in speech material without grammatical structure or the content word/function word distinction.…
The properties and interrelationships among four measures of distance in speech processing are theoretically and experimentally discussed. The root mean square (rms) log spectral distance, cepstral distance, likelihood ratio (minimum residual principle or delta coding (DELCO) algorithm), and a cosh measure (based upon two nonsymmetrical likelihood ratios) are considered. It is shown that the cepstral measure bounds the rms log
The technique uses the auditory masking threshold to extract information for the audible noise components. Those components are then removed using adaptive nonlinear spectral modification. The main advantage of such an approach is that the speech signal is not affected by processing. In addition, very little information on the features of the noise is required. The proposed method was found
|Attempts to account for many of the differences between vowels and consonants and between speech and nonspeech, in terms of the range of the contexts in which they are set. Explanations of the different overall levels of discriminability in vowels and consonants in various tasks in terms of their "encodedness" are replaced by a general model…
Haptic use of infotainment (information and entertainment) systems in cars, de- flect the driver from his primary task, the driving. Therefore we need new Human- Machine Interfaces (HMI), which do not require the driver's full attention, for controlling infotainment systems in cars. Gestures, speech and sounds provide an intuitive addition for existing haptical controls in automotive environments. While sounds are
THe Mathematical Intelligence (MI) Development System was implemented in a DEC PDP 11/70 computer, an FPS 5210 array processor, and a Ramtek 9465 color graphics display. The two major components of the system are the Digital Speech Processing Tool and the...
The authors review converging lines of evidence from behavioral, kinematic, and neuroimaging data that point to limitations in speech motor skills in people who stutter (PWS). From their review, they conclude that PWS differ from those who do not in terms of their ability to improve with practice and retain practiced changes in the long term, and that they are
According to the U.S. National Institutes of Health, approximately 500,000 Americans have Parkinson's disease (PD), with roughly another 50,000 receiving new diagnoses each year. 70%–90% of these people also have the hypokinetic dysarthria associated with PD. Deep brain stimulation (DBS) substantially relieves motor symptoms in advanced-stage patients for whom medication produces disabling dyskinesias. This study investigated speech changes as a result of DBS settings chosen to maximize motor performance. The speech of 10 PD patients and 12 normal controls was analyzed for syllable rate and variability, syllable length patterning, vowel fraction, voice-onset time variability, and spirantization. These were normalized by the controls' standard deviation to represent distance from normal and combined into a composite measure. Results show that DBS settings relieving motor symptoms can improve speech, making it up to three standard deviations closer to normal. However, the clinically motivated settings evaluated here show greater capacity to impair, rather than improve, speech. A feedback device developed from these findings could be useful to clinicians adjusting DBS parameters, as a means for ensuring they do not unwittingly choose DBS settings which impair patients' communication.
Chenausky, Karen; MacAuslan, Joel; Goldhor, Richard
Background: Three experiments investigated the role of inner speech deficit in cognitive performances of children with autism. Methods: Experiment 1 compared children with autism with ability-matched controls on a verbal recall task presenting pictures and words. Experiment 2 used pictures for which the typical names were either single syllable or…
Whitehouse, Andrew J. O.; Maybery, Murray T.; Durkin, Kevin
In this study, we compare regional cerebral blood flow (rCBF) while French monolingual subjects listen to continuous speech in an unknown language, to lists of French words, or to meaningful and distorted stories in French. Our results show that, in addition to regions devoted to single-word comprehension, processing of meaningful stories activates the left middle temporal gyrus, the left and
B. M. Mazoyer; N. Tzourio; V. Frak; A. Syrota; N. Murayama; O. Levrier; G. Salamon; S. Dehaene; L. Cohen; J. Mehler
A database of spoken Japanese sound has been collected for use in designing and evaluating algorithms for automatic speech recognition. The database is composed of 323 words. It is a special feature of this database that all samples are uttered four times by each speaker (i. e. four tokens per word). Seventy-five male and 75 female data are collected at
|Dell, Burger, and Svec (1997) proposed that the proportion of speech errors classified as anticipations (e.g., "moot and mouth") can be predicted solely from the overall error rate, such that the greater the error rate, the lower the anticipatory proportion (AP) of errors. We report a study examining whether this effect applies to changes in…
Galantucci, Fowler, and Turvey (2006) have claimed that perceiving speech is perceiving gestures and that the motor system is recruited for perceiving speech. We make the counter argument that perceiving speech is not perceiving gestures, that the motor system is not recruitedfor perceiving speech, and that speech perception can be adequately described by a prototypical pattern recognition model, the fuzzy logical model of perception (FLMP). Empirical evidence taken as support for gesture and motor theory is reconsidered in more detail and in the framework of the FLMR Additional theoretical and logical arguments are made to challenge gesture and motor theory. PMID:18488668
The aim of the study was to discuss physiology and pathology of speech and review of the literature on speech disorders in Parkinson disease. Additionally, the most effective methods to diagnose the speech disorders in Parkinson disease were also stressed. Afterward, articulatory, respiratory, acoustic and pragmatic factors contributing to the exacerbation of the speech disorders were discussed. Furthermore, the study dealt with the most important types of speech treatment techniques available (pharmacological and behavioral) and a significance of Lee Silverman Voice Treatment was highlighted. PMID:23821424
Amplitude modulation (AM) and frequency modulation (FM) are commonly used in communication, but their relative contributions to speech recognition have not been fully explored. To bridge this gap, we derived slowly varying AM and FM from speech sounds and conducted listening tests using stimuli with different modulations in normal-hearing and cochlear-implant subjects. We found that although AM from a limited number of spectral bands may be sufficient for speech recognition in quiet, FM significantly enhances speech recognition in noise, as well as speaker and tone recognition. Additional speech reception threshold measures revealed that FM is particularly critical for speech recognition with a competing voice and is independent of spectral resolution and similarity. These results suggest that AM and FM provide independent yet complementary contributions to support robust speech recognition under realistic listening situations. Encoding FM may improve auditory scene analysis, cochlear-implant, and audiocoding performance. auditory analysis | cochlear implant | neural code | phase | scene analysis
Zeng, Fan-Gang; Nie, Kaibao; Stickney, Ginger S.; Kong, Ying-Yee; Vongphoe, Michael; Bhargave, Ashish; Wei, Chaogang; Cao, Keli
Speech production, like other sensorimotor behaviors, relies on multiple sensory inputs--audition, proprioceptive inputs from muscle spindles and cutaneous inputs from mechanoreceptors in the skin and soft tissues of the vocal tract. However, the capacity for intelligible speech by deaf speakers suggests that somatosensory input alone may contribute to speech motor control and perhaps even to speech learning. We assessed speech motor learning in cochlear implant recipients who were tested with their implants turned off. A robotic device was used to alter somatosensory feedback by displacing the jaw during speech. We found that implant subjects progressively adapted to the mechanical perturbation with training. Moreover, the corrections that we observed were for movement deviations that were exceedingly small, on the order of millimeters, indicating that speakers have precise somatosensory expectations. Speech motor learning is substantially dependent on somatosensory input. PMID:18794839
An adaptive redundant speech transmission (ARST) approach to improve the perceived speech quality (PSQ) of speech streaming applications over wireless multimedia sensor networks (WMSNs) is proposed in this paper. The proposed approach estimates the PSQ as well as the packet loss rate (PLR) from the received speech data. Subsequently, it decides whether the transmission of redundant speech data (RSD) is required in order to assist a speech decoder to reconstruct lost speech signals for high PLRs. According to the decision, the proposed ARST approach controls the RSD transmission, then it optimizes the bitrate of speech coding to encode the current speech data (CSD) and RSD bitstream in order to maintain the speech quality under packet loss conditions. The effectiveness of the proposed ARST approach is then demonstrated using the adaptive multirate-narrowband (AMR-NB) speech codec and ITU-T Recommendation P.563 as a scalable speech codec and the PSQ estimation, respectively. It is shown from the experiments that a speech streaming application employing the proposed ARST approach significantly improves speech quality under packet loss conditions in WMSNs. PMID:22164086
The hearing impaired have always had difficulties learning to speak because their auditory feedback is either damaged or missing.\\u000a The SpeechMaster software package provides real-time visual feedback as a substitute for this. Within the package the forms\\u000a of the feedback are clear and simple. For instance in the first phase of vowel learning the software uses an effective phoneme\\u000a recognizer
S-MINDS is a speech translation engine, which allows an English speaker to communi- cate with a non-English speaker easily within a question-and-answer, interview-style format. It can handle limited dialogs such as medical triage or hospital admissions. We have built and tested an English-Korean system for do- ing medical triage with a translation accuracy of 79.8% (for English) and 78.3% (for
Farzad Ehsani; Jim Kimzey; Demitrios Master; Karen Sudre
The Gifts of Speech site brings together speeches given by women from all around the world. The site is under the direction of Liz Linton Kent Leon, who is the electronic resources librarian at Sweet Briar College. First-time users may wish to click on the How ToÃ¢ÂÂ¦ area to learn how to navigate the site. Of course, the FAQ area is a great way to learn about the site as well, and it should not be missed as it tells about the origin story for the site. In the Collections area, visitors can listen in to all of the Nobel Lectures delivered by female recipients and look at a list of the top 100 speeches in American history as determined by a group of researchers at the University of Wisconsin-Madison and Texas A & M University. Users will also want to use the Browse area to look over talks by women from Robin Abrams to Begum Kahaleda Zia, the former prime minster of the People's Republic of Bangladesh.
Speech and language delay in children is associated with increased difficulty with reading, writing, attention, and socialization. Although physicians should be alert to parental concerns and to whether children are meeting expected developmental milestones, there currently is insufficient evidence to recommend for or against routine use of formal screening instruments in primary care to detect speech and language delay. In children not meeting the expected milestones for speech and language, a comprehensive developmental evaluation is essential, because atypical language development can be a secondary characteristic of other physical and developmental problems that may first manifest as language problems. Types of primary speech and language delay include developmental speech and language delay, expressive language disorder, and receptive language disorder. Secondary speech and language delays are attributable to another condition such as hearing loss, intellectual disability, autism spectrum disorder, physical speech problems, or selective mutism. When speech and language delay is suspected, the primary care physician should discuss this concern with the parents and recommend referral to a speech-language pathologist and an audiologist. There is good evidence that speech-language therapy is helpful, particularly for children with expressive language disorder. PMID:21568252
Differences in speech articulation among four emotion types, neutral, anger, sadness, and happiness are investigated by analyzing tongue tip, jaw, and lip movement data collected from one male and one female speaker of American English. The data were collected using an electromagnetic articulography (EMA) system while subjects produce simulated emotional speech. Pitch, root-mean-square (rms) energy and the first three formants were estimated for vowel segments. For both speakers, angry speech exhibited the largest rms energy and largest articulatory activity in terms of displacement range and movement speed. Happy speech is characterized by largest pitch variability. It has higher rms energy than neutral speech but articulatory activity is rather comparable to, or less than, neutral speech. That is, happy speech is more prominent in voicing activity than in articulation. Sad speech exhibits longest sentence duration and lower rms energy. However, its articulatory activity is no less than neutral speech. Interestingly, for the male speaker, articulation for vowels in sad speech is consistently more peripheral (i.e., more forwarded displacements) when compared to other emotions. However, this does not hold for female subject. These and other results will be discussed in detail with associated acoustics and perceived emotional qualities. [Work supported by NIH.
Perception of intersensory temporal order is particularly difficult for (continuous) audiovisual speech, as perceivers may find it difficult to notice substantial timing differences between speech sounds and lip movements. Here we tested whether this occurs because audiovisual speech is strongly paired ("unity assumption"). Participants made temporal order judgments (TOJ) and simultaneity judgments (SJ) about sine-wave speech (SWS) replicas of pseudowords and the corresponding video of the face. Listeners in speech and non-speech mode were equally sensitive judging audiovisual temporal order. Yet, using the McGurk effect, we could demonstrate that the sound was more likely integrated with lipread speech if heard as speech than non-speech. Judging temporal order in audiovisual speech is thus unaffected by whether the auditory and visual streams are paired. Conceivably, previously found differences between speech and non-speech stimuli are not due to the putative "special" nature of speech, but rather reflect low-level stimulus differences. PMID:21035795
This report considers language understanding techniques and control strategies that can be applied to provide higher-level support to aid in the understanding of spoken utterances. The discussion is illustrated with concepts and examples from the BBN speech understanding system, HWIM (Hear What I Mean). The HWIM system was conceived as an assistant to a travel budget manager, a system that would store information about planned and taken trips, travel budgets and their planning. The system was able to respond to commands and answer questions spoken into a microphone, and was able to synthesize spoken responses as output. HWIM was a prototype system used to drive speech understanding research. It used a phonetic-based approach, with no speaker training, a large vocabulary, and a relatively unconstraining English grammar. Discussed here is the control structure of the HWIM and the parsing algorithm used to parse sentences from the middle-out, using an ATN grammar.
Schizophrenic speech has been studied both at the clinical and linguistic level. Nevertheless, the statistical methods used in these studies do not specifically take into account the dynamical aspects of language. In the present study, we quantify the dynamical properties of linguistic production in schizophrenic and control subjects. Subjects' recall of a short story was encoded according to the succession of macro- and micro-propositions, and symbolic dynamical methods were used to analyze these data. Our results show the presence of a significant temporal organization in subjects' speech. Taking this structure into account, we show that schizophrenics connect micro-propositions significantly more often than controls. This impairment in accessing language at the highest level supports the hypothesis of a deficit in maintaining a discourse plan in schizophrenia. PMID:15740992
Leroy, Fabrice; Pezard, Laurent; Nandrino, Jean-Louis; Beaune, Daniel
In 1995, AT&T Research (then within Bell Labs) began work on a software-only automated speech recognition system named Watson(TM). The goal was ambitious; Watson was to serve as a single code base supporting applications ranging from PC-desktop command and control through to scaleable telephony interactive voice services. Furthermore, the software was to be the new code base for the research
R. Douglas Sharp; Enrico Bocchieri; Cecilia Castillo; S. Parthasarathy; Chris Rath; Michael Riley; James Rowland
|This report describes three extensions to a classification system for paediatric speech sound disorders termed the Speech Disorders Classification System (SDCS). Part I describes a classification extension to the SDCS to differentiate motor speech disorders from speech delay and to differentiate among three sub-types of motor speech disorders.…
Shriberg, Lawrence D.; Fourakis, Marios; Hall, Sheryl D.; Karlsson, Heather B.; Lohmeier, Heather L.; McSweeny, Jane L.; Potter, Nancy L.; Scheer-Cohen, Alison R.; Strand, Edythe A.; Tilkens, Christie M.; Wilson, David L.
Individuals on the autism spectrum often have difficulties producing intelligible speech with either high or low speech rate, and atypical pitch and\\/or amplitude affect. In this study, we present a novel intervention towards customizing speech enabled games to help them produce intelligible speech. In this approach, we clinically and computationally identify the areas of speech production difficulties of our participants.
Mohammed E. Hoque; Rana El Kaliouby; Matthew S. Goodwin; Rosalind W. Picard
|Most computational models of word segmentation are trained and tested on transcripts of speech, rather than the speech itself, and assume that speech is converted into a sequence of symbols prior to word segmentation. We present a way of representing speech corpora that avoids this assumption, and preserves acoustic variation present in speech.…
Rytting, C. Anton; Brew, Chris; Fosler-Lussier, Eric
Pitch determination and speech segmentation are two important parts of speech recognition and speech processing in general. This paper proposes a time-based event detection method for finding the pitch period of a speech signal. Based on the discrete wavelet transform, it detects voiced speech, which is local in frequency, and determines the pitch period. This method is computationally inexpensive and
This paper describes an HMM-based speech synthesis system (HTS), in which the speech waveform is generated from HMM themselves, and applies it to English speech synthesis using the general speech synthesis architecture of Festival. Similarly to other data-driven speech synthesis approaches, HTS has a compact language dependent module: a list of contextual factors. Thus, it could easily be extended to
A REVIEW OF THE RESEARCH ON THE COMPREHENSION OF RAPID SPEECH BY THE BLIND IDENTIFIES FIVE METHODS OF SPEECH COMPRESSION--SPEECH CHANGING, ELECTROMECHANICAL SAMPLING, COMPUTER SAMPLING, SPEECH SYNTHESIS, AND FREQUENCY DIVIDING WITH THE HARMONIC COMPRESSOR. THE SPEECH CHANGING AND ELECTROMECHANICAL SAMPLING METHODS AND THE NECESSARY APPARATUS HAVE…
The objective of the program is the analysis of design parameters required for the simulation and evaluation of phonetic speech recognition methods. An approach to the segmentation of continuous speech into phonemes is described, and the results of segmen...
The primary focus of speech production research is directed towards obtaining improved understanding and quantitative characterization of the articulatory dynamics, acoustics, and cognition of both normal and pathological human speech. Such efforts are, however, frequently challenged by the lack of appropriate physical and physiological data. A great deal of attention is, hence, given to the development of novel measurement/instrumentation techniques which are desirably non invasive, safe, and do not interfere with normal speech production. Several imaging techniques have been successfully employed for studying speech production. In the first part of this paper, an overview of the various imaging techniques used in speech research such as x-rays, ultrasound, structural and functional magnetic resonance imaging, glossometry, palatography, video fibroscopy and imaging is presented. In the second part of the paper, we describe the results of our efforts to understand and model speech production mechanisms of vowels, fricatives, and lateral and rhotic consonants based on MRI data.
|Background: The term "speech usage" refers to what people want or need to do with their speech to fulfil the communication demands in their life roles. Speech-language pathologists (SLPs) need to know about clients' speech usage to plan appropriate interventions to meet their life participation goals. The Levels of Speech Usage is a categorical…
Viewing hand gestures during face-to-face communication affects speech perception and comprehension. Despite the visible role played by gesture in social interactions, relatively little is known about how the brain integrates hand gestures with co-occurring speech. Here we used functional magnetic resonance imaging (fMRI) and an ecologically valid paradigm to investigate how beat gesture – a fundamental type of hand gesture that marks speech prosody – might impact speech perception at the neural level. Subjects underwent fMRI while listening to spontaneously-produced speech accompanied by beat gesture, nonsense hand movement, or a still body; as additional control conditions, subjects also viewed beat gesture, nonsense hand movement, or a still body all presented without speech. Validating behavioral evidence that gesture affects speech perception, bilateral nonprimary auditory cortex showed greater activity when speech was accompanied by beat gesture than when speech was presented alone. Further, the left superior temporal gyrus/sulcus showed stronger activity when speech was accompanied by beat gesture than when speech was accompanied by nonsense hand movement. Finally, the right planum temporale was identified as a putative multisensory integration site for beat gesture and speech (i.e., here activity in response to speech accompanied by beat gesture was greater than the summed responses to speech alone and beat gesture alone), indicating that this area may be pivotally involved in synthesizing the rhythmic aspects of both speech and gesture. Taken together, these findings suggest a common neural substrate for processing speech and gesture, likely reflecting their joint communicative role in social interactions.
Hubbard, Amy L.; Wilson, Stephen M.; Callan, Daniel E.; Dapretto, Mirella
The purpose of this experiment was to test the effectiveness of including speech production into naturalistic conversation training for 2 children with speech production disabilities. A multiple baseline design across behaviors (target phonemes) and across subjects (for the same phoneme) indicated that naturalistic conversation training resulted in improved spontaneous speech production. The implications of these findings are discussed relative to existing models of speech production training and other aspects of communication disorders.
This paper presents a novel agent-based design for Arabic speech recognition. We define the Arabic speech recognition as a multi-agent-system where each agent has a specific goal and deals with that goal only. Once all the small tasks are accomplished the big task is too. A number of agents are required in order to recast Arabic speech recognition, namely the
As speech synthesis techniques become more advanced, we are able to consider building high-quality voices from data col- lected outside the usual highly-controlled recording studio en- vironment. This presents new challenges that are not present in conventional text-to-speech synthesis: the available speech data are not perfectly clean, the recording conditions are not consistent, and\\/or the phonetic balance of the material
To investigate the neural network of overt speech production, event-related fMRI was performed in 9 young healthy adult volunteers. A clustered image acquisition technique was chosen to minimize speech-related movement artifacts. Functional images were acquired during the production of oral movements and of speech of increasing complexity (isolated vowel as well as monosyllabic and trisyllabic utterances). This imaging technique and
Peter Sörös; Lisa Guttman Sokoloff; Arpita Bose; Anthony R. McIntosh; Simon J. Graham; Donald T. Stuss
This paper presents an integrated speech enhancement (SE) method for the noisy MRI environment. We show that the performance of SE system improves considerably when the speech signal dominated by MRI acoustic noise at very low SNR is enhanced in two successive stages using two-channel SE methods followed by a single-channel post processing SE algorithm. Actual MRI noisy speech data are used in our experiments showing the improved performance of the proposed SE method. PMID:19964964
Pathak, Nishank; Milani, Ali A; Panahi, Issa; Briggs, Richard
A multi-language, portable text-to-speech system has been developed at the Royal Institute of Technology in Stockholm. The system contains a formant speech synthesizer on a signal processing chip, a powerful microcomputer and a variety of text input equipment. A special attachment is a 500-symbol Bliss Board. Swedish and English Bliss-to-speech programs transform a symbol string to the corresponding well-formed sentence.
Cepstral features derived from power spectrum are widely used for automatic speech recognition. Very little work, if any,hasbeendoneinspeechresearchtoexplorephase-based representations. In this paper, an attempt is made to in- vestigate the use of phase function in the analytic signal of critical-band filtered speech for deriving a representation of frequencies present in the speech signal. Results are pre- sented which show the
The basic concepts of speech production and analysis have been described. Speech is an acoustic signal produced by air pressure\\u000a changes originating from the vocal production system s. The anatomical, physiological, and functional aspects of this process\\u000a have been discussed from a quantitative perspective. With a description of various models of speech production, we have provided\\u000a background information with which
Carlos Avendaño; Li Deng; Hynek Hermansky; Ben Gold
This study proposes a new set of feature parameters based on subband analysis of the speech signal for classification of speech under stress. The new speech features are scale energy (SE), autocorrelation-scale-energy (ACSE), subband based cepstral parameters (SC), and autocorrelation-SC (ACSC). The parameters' ability to capture different stress types is compared to widely used mel-scale cepstrum based representations: mel-frequency cepstral
The aim of this paper is to help the communication of two people, one hearing impaired and one visually impaired by converting\\u000a speech to fingerspelling and fingerspelling to speech. Fingerspelling is a subset of sign language, and uses finger signs\\u000a to spell letters of the spoken or written language. We aim to convert finger spelled words to speech and vice
Marek Hrúz; Pavel Campr; Erinç Dikici; Ahmet Alp K?nd?ro?lu; Zden?k Kr?oul; Alexander Ronzhin; Ha?im Sak; Daniel Schorno; Hülya Yalç?n; Lale Akarun; Oya Aran; Alexey Karpov; Murat Saraçlar; Milos Železný
The motor regions that control movements of the articulators activate during listening to speech and contribute to performance in demanding speech recognition and discrimination tasks. Whether the articulatory motor cortex modulates auditory processing of speech sounds is unknown. Here, we aimed to determine whether the articulatory motor cortex affects the auditory mechanisms underlying discrimination of speech sounds in the absence of demanding speech tasks. Using electroencephalography, we recorded responses to changes in sound sequences, while participants watched a silent video. We also disrupted the lip or the hand representation in left motor cortex using transcranial magnetic stimulation. Disruption of the lip representation suppressed responses to changes in speech sounds, but not piano tones. In contrast, disruption of the hand representation had no effect on responses to changes in speech sounds. These findings show that disruptions within, but not outside, the articulatory motor cortex impair automatic auditory discrimination of speech sounds. The findings provide evidence for the importance of auditory-motor processes in efficient neural analysis of speech sounds. PMID:22581846
Möttönen, Riikka; Dutton, Rebekah; Watkins, Kate E
Study of the speech disorders of Parkinsonism provides a paradigm of the integration of phonation, articulation and language in the production of speech. The initial defect in the untreated patient is a failure to control respiration for the purpose of speech and there follows a forward progression of articulatory symptoms involving larynx, pharynx, tongue and finally lips. There is evidence that the integration of speech production is organised asymmetrically at thalamic level. Experimental or therapeutic lesions in the region of the inferior medial portion of ventro-lateral thalamus may influence the initiation, respiratory control, rate and prosody of speech. Higher language functions may also be involved in thalamic integration: different forms of anomia are reported with pulvinar and ventrolateral thalamic lesions and transient aphasia may follow stereotaxis. The results of treatment with levodopa indicates that neurotransmitter substances enhance the clarity, volume and persistence of phonation and the latency and smoothness of articulation. The improvement of speech performance is not necessarily in phase with locomotor changes. The dose-related dyskinetic effects of levodopa, which appear to have a physiological basis in observations previously made in post-encephalitic Parkinsonism, not only influence the prosody of speech with near-mutism, hesitancy and dysfluency but may affect work-finding ability and in instances of excitement (erethism) even involve the association of long-term memory with speech. In future, neurologists will need to examine more closely the role of neurotransmitters in speech production and formulation.
The ability to accurately synthesize electrolarynx (EL) speech may provide a basis for better understanding the acoustic deficits that contribute to its poor quality. Such information could also lead to the development of acoustic enhancement methods that would improve EL speech quality. This effort was initiated with an analysis-by-synthesis approach that used the Klatt formant synthesizer to study vowels at the end of utterances spoken by the same subjects both before and after laryngectomy (normal versus EL speech). The temporal and spectral features of the original speech waveforms were analyzed and the results were used to guide synthesis and to identify parameters for modification. EL speech consistently displayed utterance-final fixed (mono) pitch and normal-like falling amplitude across different vowels and subjects. Subsequent experiments demonstrated that it was possible to closely match the acoustic characteristics and perceptual quality of both normal and EL speech with synthesized replicas. It was also shown that the perceived quality of the synthesized EL speech could be improved by modification of pitch parameters to more closely resemble normal speech. Some potential approaches for modifying the pitch of EL speech in real time will be discussed. [Funded by NIH Grant R01 DC006449.
We evaluated the long-term speech intelligibility of young deaf children after cochlear implantation (CI). A prospective study on 47 consecutively implanted deaf children with up to 5 years cochlear implant use was performed. The study was conducted at a pediatric tertiary referral center for CI. All children in the study were deaf prelingually. They each receive implant before the program of auditory verbal therapy. A speech intelligibility rating scale evaluated the spontaneous speech of each child before and at frequent interval for 5 years after implantation. After cochlear implantation, the difference between the speech intelligibility, rating increased significantly each year for 3 years (P < 0.05). For the first year, the average rating remained "prerecognizable words" or "unintelligible speech". After 2 year of implantation the children had intelligible speech if someone concentrates and lip-reads (category 3). At the 4- and 5-year interval, 71.5 and 78% of children had intelligible speech to all listeners (category 5), respectively. So, 5 years after rehabilitation mode and median of speech intelligibility rating was five. Congenital and prelingually deaf children gradually develop intelligible speech that does not plateau 5 years after implantation. PMID:17639444
Bakhshaee, Mehdi; Ghasemi, Mohammad Mahdi; Shakeri, Mohammad Taghi; Razmara, Narjes; Tayarani, Hamid; Tale, Mohammad Reza
Since 1990 the DRA Speech Research Unit has conducted research into applications of speech recognition technology to speech and language development for young children. This has been done in collaboration with Hereford and Worcester County Council Education Department (HWCC) and, more recently, with Sherston Software Limited, one of the UK's leading independent educational software publishers. An initial project, known as
Martin Russell; Catherine Brown; Adrian Skilling; Julie L. Wallace; Bill Bohnam; Paul Barker
|Background: The previous literature has largely focused on speech analysis systems and ignored process issues, such as the nature of adequate speech samples, data acquisition, recording and playback. Although there has been recognition of the need for training on tools used in speech analysis associated with cleft palate, little attention has…
|Stimuli used in research on the perception of the speech signal have often been obtained from simple filtering and distortion of the speech waveform, sometimes accompanied by noise. However, for more complex stimulus generation, the parameters of speech can be manipulated, after analysis and before synthesis, using various types of algorithms to…
|The aim of this study was to investigate the consistency and composition of functional synergies for speech movements in children with developmental speech disorders. Kinematic data were collected on the reiterated productions of syllables spa(/spa[image omitted]/) and paas(/pa[image omitted]s/) by 10 6- to 9-year-olds with developmental speech…
Terband, H.; Maassen, B.; van Lieshout, P.; Nijland, L.
Eleven patients with Parkinson's disease were tested for prosodic abnormality, on three tests of speech production (of angry, questioning, and neutral statement forms), and four tests of appreciation of the prosodic features of speech and facial expression. The tests were repeated after a control period of two weeks without speech therapy and were not substantially different. After two weeks of
|To further investigate the possible regulatory role of private and inner speech in the context of referential social speech communications, a set of clear and systematically applied measures is needed. This study addresses this need by introducing a rigorous method for identifying private speech and certain sharply defined instances of inaudible…
San Martin Martinez, Conchi; Boada i Calbet, Humbert; Feigenbaum, Peter
In this paper a time-frequency estimator for enhancement of noisy speech signals in the DFT domain is introduced. This estimator is based on modeling the time-varying correlation of the temporal trajectories of the short time (ST) DFT components of the noisy speech signal using autoregressive (AR) models. The time- varying trajectory of the DFT components of speech in each channel
The aim of this study was to investigate the consistency and composition of functional synergies for speech movements in children with developmental speech disorders. Kinematic data were collected on the reiterated productions of syllables spa(/spa[image omitted]/) and paas(/pa[image omitted]s/) by 10 6- to 9-year-olds with developmental speech…
Terband, H.; Maassen, B.; van Lieshout, P.; Nijland, L.
|Purpose: This study compared parents with histories of speech sound disorders (SSD) to parents without known histories on measures of speech sound production, phonological processing, language, reading, and spelling. Familial aggregation for speech and language disorders was also examined. Method: The participants were 147 parents of children…
Lewis, Barbara A.; Freebairn, Lisa A.; Hansen, Amy J.; Miscimarra, Lara; Iyengar, Sudha K.; Taylor, H. Gerry
This article describes a neural network model of speech motor skill acquisition and speech production that explains a wide range of data on contextual variability, motor equivalence, coarticulation, and speaking rate effects. Model parameters are learned during a babbling phase. To explain how infants learn phoneme- specific and language-specific limits on acceptable articulatory variability, the learned speech sound targets take
Childhood apraxia of speech (CAS) is a highly controversial clinical entity, with respect to both clinical signs and underlying neuromotor deficit. In the current paper, we advocate a modeling approach in which a computational neural model of speech acquisition and production is utilized in order to find the neuromotor deficits that underlie the diversity of phonological and speech-motor symptoms of
Evidence from the study of human language understanding is presented suggesting that our ability to perceive visible speech can greatly influence our ability to understand and remember spoken language. A view of the speaker's face can greatly aid in the perception of ambiguous or noisy speech and can aid cognitive processing of speech leading to better understanding and recall. Some
|As Lecturer of Speech in the Institute for Writing and Rhetoric at Dartmouth College, I have joined an ongoing conversation about speech that spans disciplines. This article takes a step back from looking at communication across the curriculum as a program and instead looks at one of the earliest stages of the process--conversations about speech…
Last year, the Foundation for Individual Rights in Education (FIRE) conducted its first-ever comprehensive study of restrictions on speech at America's colleges and universities, "Spotlight on Speech Codes 2006: The State of Free Speech on our Nation's Campuses." In light of the essentiality of free expression to a truly liberal education, its…
Foundation for Individual Rights in Education (NJ1), 2007
Studies on infant speech perception have shown that infant-directed speech (motherese) exhibits exaggerated acoustic properties, which are assumed to guide infants in the acquisition of phonemic categories. Training an automatic speech recognizer on such data might similarly lead to improved performance since classes can be expected to be more clearly separated in the training material. This claim was tested by training automatic speech recognizers on adult-directed (AD) versus infant-directed (ID) speech and testing them under identical versus mismatched conditions. 32 mother-infant conversations and 32 mother-adult conversations were used as training and test data. Both sets of conversations included a set of cue words containing unreduced vowels (e.g., sheep, boot, top, etc.), which mothers were encouraged to use repeatedly. Experiments on continuous speech recognition of the entire data set showed that recognizers trained on infant-directed speech did perform significantly better than those trained on adult-directed speech. However, isolated word recognition experiments focusing on the above-mentioned cue words showed that the drop in performance of the ID-trained speech recognizer on AD test speech was significantly smaller than vice versa, suggesting that speech with over-emphasized phonetic contrasts may indeed constitute better training material for speech recognition. [Work supported by CMBL, University of Washington.
Children with residual speech sound errors are often underserved clinically, yet there has been a lack of recent research elucidating the specific deficits in this population. Adolescents aged 10-14 with residual speech sound errors (RE) that included rhotics were compared to normally speaking peers on tasks assessing speed and accuracy of speech…
Stimuli used in research on the perception of the speech signal have often been obtained from simple filtering and distortion of the speech waveform, sometimes accompanied by noise. However, for more complex stimulus generation, the parameters of speech can be manipulated, after analysis and before synthesis, using various types of algorithms to…
We describe a novel approach to collecting orthographically transcribed continuous speech data through the use of an online educational game called Voice Scatter, in which players study flashcards by using speech to match terms with their definitions. We analyze a corpus of 30,938 utterances, totaling 27.63 hours of speech, collected during the first 22 days that Voice Scat- ter was
Alexander Gruenstein; Ian McGraw; Andrew Sutherland
The objective of this research was to develop accurate mathematical models of speech sounds for the purpose of large-vocabulary continuous speech recognition. The research focussed on three areas: developing better speech models to improve recognition acc...
C. Barry F. Kubala J. Makhoul R. Schwartz S. Austin
A central difficulty with automatic speech recognition is the temporally inaccurate nature of the speech signal. Despite this, speech has been traditionally modeled as a purely sequential (albeit probabilistic) process. The usefulness of accurate sequence...
D. C. LeCompte (1994) showed that the irrelevant speech effect--that is, the impairment of performance by the presentation of irrelevant background speech--extends to free recall, recognition, and cued recall. The present experiments extended the irrelevant speech effect to the missing-item task (Experiments 1 and 2), thereby contradicting a key prediction of the changing state hypothesis, which states that tasks that do not involve serial rehearsal should not be affected by the presence of irrelevant speech. Temporal distinctiveness theory provides an alternative explanation of the irrelevant speech effect. Experiment 3 tested and confirmed a unique prediction of this theory. PMID:8805820
The speech enhancement problem is approached by the Fourier-Bessel (FB) expansion of the speech and noise. The application of this expansion to the enhancement of the speech signals degraded with additive noise is described. The method is based on the subtraction of the FB coefficients of the estimated noise from the coefficients of the noisy speech. The difference in two sets of coefficients is then used to synthesize the enhanced speech. The inherent lowpass filtering property of the FB reconstruction with a finite number of coefficients is revealed.
The present experiment was designed to investigate and understand the causes of failures of the Articulation Index as a predictive tool. An electroacoustic system was used in which: (1) The frequency response was optimally flattened at the listener's ear. (2) An ear-insert earphone was designed to give close electroacoustic control. (3) An infinite-impulse-response digital filter was used to filter the speech signal from a pre-recorded nonsense syllable test. (4) Four formant regions were filtered in fourteen different ways. It was found that the results agreed with past experiments in that: (1) The Articulation Index fails as a predictive tool when using band-pass filters. (2) Low frequencies seem to mask higher frequencies causing a decrease in intelligibility. It was concluded that: (1) It is inappropriate to relate the total fraction of the speech spectrum to a specific intelligibility score since the fraction remaining after filtering may be in the low-, mid-, or high-frequency range. (2) The relationship between intelligibility and the total area under the spectral curve is not monotonic. (3) The fourth formant region (2925Hz to 4200Hz) enhanced intelligibility when included with other formant regions. Methods for relating spectral regions and intelligibility were discussed.
The effect of perceived spatial differences on masking release was examined using a 4AFC speech detection paradigm. Targets were 20 words produced by a female talker. Maskers were recordings of continuous streams of nonsense sentences spoken by two female talkers and mixed into each of two channels (two talker, and the same masker time reversed). Two masker spatial conditions were employed: “RF” with a 4 ms time lead to the loudspeaker 60° horizontally to the right, and “FR” with the time lead to the front (0°) loudspeaker. The reference nonspatial “F” masker was presented from the front loudspeaker only. Target presentation was always from the front loudspeaker. In Experiment 1, target detection threshold for both natural and time-reversed spatial maskers was 17–20 dB lower than that for the nonspatial masker, suggesting that significant release from informational masking occurs with spatial speech maskers regardless of masker understandability. In Experiment 2, the effectiveness of the FR and RF maskers was evaluated as the right loudspeaker output was attenuated until the two-source maskers were indistinguishable from the F masker, as measured independently in a discrimination task. Results indicated that spatial release from masking can be observed with barely noticeable target-masker spatial differences.
This paper introduces the session on advanced speech recognition technology. The two papers comprising this session argue that current technology yields a performance that is only an order of magnitude in error rate away from human performance and that incremental improvements will bring us to that desired level. I argue that, to the contrary, present performance is far removed from human performance and a revolution in our thinking is required to achieve the goal. It is further asserted that to bring about the revolution more effort should be expended on basic research and less on trying to prematurely commercialize a deficient technology.
In this paper, we present a novel approach for speech signal modeling using fractional calculus. This approach is contrasted with the celebrated Linear Predictive Coding (LPC) approach which is based on integer order models. It is demonstrated via numerical simulations that by using a few integrals of fractional orders as basis functions, the speech signal can be modeled accurately. The
|Previous research has shown that in different languages ironic speech is acoustically modulated compared to literal speech, and these modulations are assumed to aid the listener in the comprehension process by acting as cues that mark utterances as ironic. The present study was conducted to identify paraverbal features of German "ironic…
|The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…
Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret
An aging population still needs to access information, such as bus schedules. It is evident that they will be doing so using computers and especially interfaces using speech input and output. This is a preliminary study to the use of synthetic speech for the elderly. In it twenty persons between the ages of 60 and 80 were asked to listen
We systematically determined which spectrotemporal modulations in speech are necessary for comprehension by human listeners. Speech comprehension has been shown to be robust to spectral and temporal degradations, but the specific relevance of particular degradations is arguable due to the complexity of the joint spectral and temporal information in the speech signal. We applied a novel modulation filtering technique to recorded sentences to restrict acoustic information quantitatively and to obtain a joint spectrotemporal modulation transfer function for speech comprehension, the speech MTF. For American English, the speech MTF showed the criticality of low modulation frequencies in both time and frequency. Comprehension was significantly impaired when temporal modulations <12 Hz or spectral modulations <4 cycles/kHz were removed. More specifically, the MTF was bandpass in temporal modulations and low-pass in spectral modulations: temporal modulations from 1 to 7 Hz and spectral modulations <1 cycles/kHz were the most important. We evaluated the importance of spectrotemporal modulations for vocal gender identification and found a different region of interest: removing spectral modulations between 3 and 7 cycles/kHz significantly increases gender misidentifications of female speakers. The determination of the speech MTF furnishes an additional method for producing speech signals with reduced bandwidth but high intelligibility. Such compression could be used for audio applications such as file compression or noise removal and for clinical applications such as signal processing for cochlear implants.
A speech problem can be caused by different reasons, from psychological to organic. The existing diagnostic of speech pathologies relies on skilled doctors who can often diagnose by simply listening to the patient. We show that neural networks can simulate this ability and thus provide an automated (preliminary) diagnosis
Antonio P. Salvatore; N. Thome; C. M. Gorss; Michael P. Cannito
Although speech-language pathologists are expected to be able to administer and interpret oral examinations, there are currently no screening tests available that provide careful administration instructions and data for intra-examiner and inter-examiner reliability. The Oral Speech Mechanism Screening Examination (OSMSE) is designed primarily for…
Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information. Temporal envelopes of speech were extracted from broad frequency bands and were used to modulate noises of the same bandwidths. This manipulation preserved temporal envelope cues in each band but restricted the listener to severely degraded information on the distribution of spectral energy. The identification of consonants,
Robert V. Shannon; Fan-Gang Zeng; Vivek Kamath; John Wygonski; Michael Ekelid
We present a new approach to automatic speech recognition (ASR) based on the formalism of Bayesian networks. We put the foundations\\u000a of new ASR systems for which the robustness relies on the fidelity in speech modeling and on the information contained in\\u000a training data.
This study compared self reported speech anxiety of students who were asked to visualize themselves making an effective speech with those who were not asked to visualize themselves making an effective presentation. Students who were asked to visualize reported lower anxiety levels than those who were not asked to do so. It is argued that visualization is an effective, nondisruptive
Discusses a study of concord phenomena in spoken Brazilian Portuguese. Findings indicate the presence of disfluencies, including apparent corrections, in about 15% of the relevant tokens in the corpus of recorded speech data. It is concluded that speech is not overly laden with errors, and there is nothing in the data to mislead the language…
Naro, Anthony Julius; Scherre, Maria Marta Pereira
The role of attention in speech comprehension is not well understood. We used fMRI to study the neural correlates of auditory word, pseudoword, and nonspeech (spectrally rotated speech) perception during a bimodal (auditory, visual) selective attention task. In three conditions, Attend Auditory (ignore visual), Ignore Auditory (attend visual), and Visual (no auditory stimulation), 28 subjects performed a one-back matching task
Merav Sabri; Jeffrey R. Binder; Rutvik Desai; David A. Medler; Michael D. Leitl; Einat Liebenthal
|Generally, the speaking aspect is not properly debated when discussing the positive and negative effects of television (TV), especially on children. So, to highlight this point, this study was first initialized by asking the question: "What are the effects of TV on speech?" and secondly, to transform the effects that TV has on speech in a…
The majority of automatic speech recognition (ASR) systems rely on hidden Markov models, in which Gaussian mixtures model the output distributions associated with sub- phone states. This approach, whilst successful, models consecutive feature vectors (augmented to include derivative information) as statistically independent. Furthermore, spatial correlations present in speech parameters are frequently ignored through the use of diagonal covariance matrices. This
This paper reports a simple nonlinear approach to online acoustic speech pathology detection for automatic screening purposes. Straightforward linear preprocessing followed by two nonlinear measures, based parsimoniously upon the biophysics of speech production, combined with subsequent linear classification, achieves an overall normal\\/pathological detection performance of 91.4%, and over 99% with rejection of 15% ambiguous cases. This compares favourably with more
A novel technique for speaker independent automated speech recognition is proposed. We take a segment model approach to Automated Speech Recognition (ASR), considering the trajectory of an utterance in vector space, then classify using a modified Probabilistic Neural Network (PNN) and maximum likelihood rule. The system performs favourably with established techniques. Our system achieves in excess of 94% with isolated
The currently dominant speech recognition technology, hidden Markov modeling, has long been criticized for its simplistic assumptions about speech, and especially for the naive Bayes combination rule inherent in it. Many sophisticated alternative models have been suggested over the last decade. These, however, have demonstrated only modest improvements and brought no paradigm shift in technology. The goal of this paper
|The author discusses the content in John Milton's "Areopagitica: A Speech for the Liberty of Unlicensed Printing to the Parliament of England" (1985) and provides parallelism to censorship practiced in higher education. Originally published in 1644, "Areopagitica" makes a powerful--and precocious--argument for freedom of speech and against…
|The difficulties and the advantages of giving oral critiques of student speeches are discussed. The advantages of using oral critiques are such that many speech teachers will want to include them. It is argued in this paper that the difficulties associated with oral critiques can be overcome by using communication messages that are intended to…
According to Atkinson, speakers at political meetings invite applause through rhetorical devices, which indicate when and where applause is appropriate. Hence, speech and applause are characterized by a high degree of synchronization. Thus, incidences of unsynchronized applause are of considerable theoretical interest. An analysis of such mismatches is reported based on six speeches delivered by the three leaders of the
|In order to test the internal evaluative processes and not merely the final reactions of an audience to a speaker, 97 Caucasian college students expressed their attitudes toward Malcolm X while listening to a 25-minute tape-recorded speech by him. Eight 30-second silent intervals at natural pauses in the speech gave the students time to respond…
|The study investigated the Lombard effect (evoking increased speech intensity by applying masking noise to ears of talker) on the speech of esophageal talkers, artificial larynx users, and normal speakers. The noise condition produced the highest intensity increase in the esophageal speakers. (Author/DB)|
|Impromptu speech is characterized by the simultaneous processes of ideation (the elaboration and structuring of reasoning by the speaker as he improvises) and expression in the speaker. Other elements accompany this characteristic: division of speech flow into short segments, acoustic relief in the form of word stress following a pause, and both…
|Purpose: To evaluate the effects of sequential and alternating repetition on speech-sound discrimination. Method: Typically hearing adults' discrimination of 3 pairs of speech-sound contrasts was assessed at 3 signal-to-noise ratios using the change/no-change procedure. On change trials, the standard and comparison stimuli differ; on no-change…
In the speech technology research community there is an increasing trend to use open source solutions. We present a new tool in that spirit, WaveSurfer, which has been developed at the Centre for Speech Technology at KTH. It has been designed for tasks such as viewing, editing, and labeling of audio data. WaveSurfer is built around a small core to
Models of speech perception are in general agreement with respect to the major cortical regions involved, but lack precision with regard to localization and lateralization of processing units. To refine these models we conducted two Activation Likelihood Estimation (ALE) meta-analyses of the neuroimaging literature on sublexical speech perception.…
Parametric watermarking is effected by modifying the linear pre- dictor coefficients of speech. In this work, the parameter noise is analyzed when watermarked speech is subjected to additive white and colored noise in the time domain. The paper presents two detection techniques for parametric watermarking. The first ap- proach uses the Neyman-Pearson criterion to solve a binary de- cision problem.
|Every language has some way of reporting what someone else has said. To express what Jakobson [Jakobson, R., 1990. "Shifters, categories, and the Russian verb. Selected writings". "Word and Language". Mouton, The Hague, Paris, pp. 130-153] called "speech within speech", the speaker can use their own words, recasting the original text as their…
Speech is the most important form of human communication but ambient sounds and competing talkers often degrade its acoustics. Fortunately the brain can use visual information, especially its highly precise spatial information, to improve speech comprehension in noisy environments. Previous studies have demonstrated that audiovisual integration depends strongly on spatiotemporal factors. However, some integrative phenomena such as McGurk interference persist
Despite their remarkable clinical success, cochlear-implant listeners today still receive spectrally degraded information. Much research has examined normally hearing adult listeners' ability to interpret spectrally degraded signals, primarily using noise-vocoded speech to simulate cochlear implant processing. Far less research has explored infants' and toddlers' ability to interpret spectrally degraded signals, despite the fact that children in this age range are frequently implanted. This study examines 27-month-old typically developing toddlers' recognition of noise-vocoded speech in a language-guided looking study. Children saw two images on each trial and heard a voice instructing them to look at one item ("Find the cat!"). Full-spectrum sentences or their noise-vocoded versions were presented with varying numbers of spectral channels. Toddlers showed equivalent proportions of looking to the target object with full-speech and 24- or 8-channel noise-vocoded speech; they failed to look appropriately with 2-channel noise-vocoded speech and showed variable performance with 4-channel noise-vocoded speech. Despite accurate looking performance for speech with at least eight channels, children were slower to respond appropriately as the number of channels decreased. These results indicate that 2-yr-olds have developed the ability to interpret vocoded speech, even without practice, but that doing so requires additional processing. These findings have important implications for pediatric cochlear implantation. PMID:23297920
Some of our research efforts towards building Automatic Speech Recognition (ASR) systems designed to work in real-world conditions are presented. The methods we pro- pose exhibit improved performance in noisy environments and offer robustness against speaker variability. Advanced nonlinear signal processing techniques, modulation- and chaotic-based, are utilized for auditory feature extraction. The auditory features are complemented with visual speech cues
D. Dimitriadis; N. Katsamanis; P. Maragos; G. Papandreou; V. Pitsikalis
|The purpose of this survey was to assess the current supply of speech and hearing manpower, and to develop a profile of characteristics for these professionals in the state of Washington. A mail-back questionnaire containing both open-ended and structured questions was used to survey all the active and inactive speech and hearing professionals in…
|Intended for parents, the booklet focuses on the speech and language development of children with language delays. The following topics are among those considered: the parent's role in the initial diagnosis of deafness, intellectual handicap, and neurological difficulties; diagnoses and single causes of difficultiy with speech; what to say to…
|We present the first experimental evidence of a phenomenon in speech communication we call "analog acoustic expression." Speech is generally thought of as conveying information in two distinct ways: discrete linguistic-symbolic units such as words and sentences represent linguistic meaning, and continuous prosodic forms convey information about…
|The concept of "monitoring" refers to our ability to control our actions on-line. Monitoring involved in speech production is often described in psycholinguistic models as an inherent part of the language system. We probed the specificity of speech monitoring in two psycholinguistic experiments where electroencephalographic activities were…
Ries, Stephanie; Janssen, Niels; Dufau, Stephane; Alario, F.-Xavier; Burle, Boris
|Proposes a taxonomy of functions for direct reports of speech (and of writing and thought) in focus-group discussions. Reported speech both depicts the experience of the original utterance and detaches the reported utterance from the reporting speaker. (Author/VWL)|
IntroductionWhen compared to human speech, synthesized speech is distinguished by insufficient intelligibility,inappropriate prosody and inadequate expressiveness. These are serious drawbacks for conversationalcomputer systems. Intelligibility is basic --- intelligible phonemes are necessary for word recognition.Prosody --- intonation (melody) and rhythm --- clarifies syntax and semantics and aids in discourseflow control. Expressiveness, or affect, provides information about the...
We report a 53-year-old patient (AWF) who has an acquired deficit of audiovisual speech integration, characterized by a perceived temporal mismatch between speech sounds and the sight of moving lips. AWF was less accurate on an auditory digit span task with vision of a speaker's face as compared to a condition in which no visual information from…
Hamilton, Roy H.; Shenton, Jeffrey T.; Coslett, H. Branch
|The back-propagation neural network learning procedure was applied to the analysis and recognition of speech. Because this learning procedure requires only examples of input-output pairs, it is not necessary to provide it with any initial description of speech features. Rather, the network develops on its own set of representational features…
In this paper we present an overview of LIPS2008: Visual Speech Synthesis Challenge. The aim of this challenge is to bring together researchers in the field of visual speech synthesis to firstly evaluate their systems within a common framework, and secondly to identify the needs of the wider community in terms of evaluation. In doing so we hope to better
Amplitude modulation (AM) and frequency modulation (FM) are commonly used in communication, but their relative contributions to speech recognition have not been fully explored. To bridge this gap, we derived slowly varying AM and FM from speech sounds and conducted listening tests using stimuli with different modulations in normal-hearing and cochlear-implant subjects. We found that although AM from a limited
Fan-Gang Zeng; Kaibao Nie; Ginger S. Stickney; Ying-Yee Kong; Michael Vongphoe; Ashish Bhargave; Chaogang Wei; Keli Cao
This paper reviews past work comparing modern speech recognition systems and humans to determine how far recent dramatic advances in technology have progressed towards the goal of human-like performance. Comparisons use six modern speech corpora with vocabularies ranging from 10 to more than 65,000 words and content ranging from read isolated words to spontaneous conversations. Error rates of machines are
Generally, the speaking aspect is not properly debated when discussing the positive and negative effects of television (TV), especially on children. So, to highlight this point, this study was first initialized by asking the question: "What are the effects of TV on speech?" and secondly, to transform the effects that TV has on speech in a…
|DESIGNED TO PROVIDE AN ESSENTIAL CORE OF INFORMATION, THIS BOOK TREATS NORMAL AND ABNORMAL DEVELOPMENT, STRUCTURE, AND FUNCTION OF THE LIPS AND PALATE AND THEIR RELATIONSHIPS TO CLEFT LIP AND CLEFT PALATE SPEECH. PROBLEMS OF PERSONAL AND SOCIAL ADJUSTMENT, HEARING, AND SPEECH IN CLEFT LIP OR CLEFT PALATE INDIVIDUALS ARE DISCUSSED. NASAL RESONANCE…
Speech emotion is high semantic information and its automatic analysis may have many applications such as smart human-computer interactions or multimedia indexing. As a pattern recognition problem, the feature selection and the structure of the classifier are two important aspects for automatic speech emotion classification. In this paper, we propose a novel feature selection scheme based on the evidence theory.
In this paper we develop a subspace speech enhancement approach for estimating a signal which has been degraded by additive uncorrelated noise. This problem has numerous applications such as in hearing aids and automatic speech recognition in noisy environments. The proposed approach utilizes the multistage Wiener filter (MWF). This filter is constructed from a Krylov subspace associated with the Wiener
New challenges of telecommunications operators include the development of methods and tools in order to improve customer knowledge and care. The diversity of usages, of commercial bids and the high degree of competitiveness in the voice services sector makes more complex than ever the relationship between customer satisfaction and speech quality. A review of how speech quality and its assessment
This paper outlines the general strategies followed in developing the CMU (Carnegie Mellon University) speech understanding system. Our system is oriented toward the extraction of information relevant to a task. It uses a flexible frame-based parser. Our system handles phenomena that are natural in spontaneous speech, for example, restarts, repeats and grammatically ill-formed utterances. It maintains a history of the
The purposes of this paper are (1) to review the background and nature of hypnosis, (2) to synthesize research on hypnosis related to speech communication, and (3) to delineate and compare two potential techniques for reducing speech anxiety--hypnosis and systematic desensitization. Hypnosis has been defined as a mental state characterised by…
The synthesis of child speech presents challenges both in the collection of data and in the building of a synthesiser from that data. Because only limited data can be collected, and the domain of that data is constrained, it is dicult to ob- tain the type of phonetically-balanced corpus usually used in speech synthesis. As a consequence, building a synthe-
Oliver Watts; Junichi Yamagishi; Kay Berkling; Simon King
In this paper we present a method for extracting a speech signal of target speaker from noisy convolutive mixtures of target speech and an interference source, when training utterances of the target speake r are available. We incorporate a statistical latent variable model into blind source separation (BSS), where we make use of spectral bases learned from the training utterances
In this paper we present a method for extracting a speech signal of target speaker from noisy convolutive mixtures of target speech and an interference source, when training utterances of the target speaker are available. We incorporate a statistical latent variable model into blind source separation (BSS), where we make use of spectral bases learned from the training utterances of
Relatively little is known about the acoustical modifications speakers employ to meet the various constraints-auditory, linguistic and otherwise-of their listeners. Similarly, the manner by which perceived listener constraints interact with speakers' adoption of specialized speech registers is poorly Hypo (H&H) theory offers a framework for examining the relationship between speech production and output-oriented goals for communication, suggesting that under certain circumstances speakers may attempt to minimize phonetic ambiguity by employing a ``hyperarticulated'' speaking style (Lindblom, 1990). It remains unclear, however, what the acoustic correlates of hyperarticulated speech are, and how, if at all, we might expect phonetic properties to change respective to different listener-constrained conditions. This paper is part of a preliminary investigation concerned with comparing the prosodic characteristics of speech produced across a range of listener constraints. Analyses are drawn from a corpus of read hyperarticulated speech data comprising eight adult, female speakers of English. Specialized registers include speech to foreigners, infant-directed speech, speech produced under noisy conditions, and human-machine interaction. The authors gratefully acknowledge financial support of the Irish Higher Education Authority, allocated to Fred Cummins for collaborative work with Media Lab Europe.
This study investigates pitch range variation in the affective speech of bilingual and monolingual children. Cross-linguistic differences in affective speech may lead bilingual children to express emotions differently in their two different languages. A cross-linguistically comparable corpus of 6 bilingual Scottish- French children and 12 monolingual peers was recorded ac- cording to the developed methodology. The results show that the
To discriminate and to recognize sound sources in a noisy, reverberant environment, listeners need to perceptually integrate the direct wave with the reflections of each sound source. It has been confirmed that perceptual fusion between direct and reflected waves of a speech sound helps listeners recognize this speech sound in a simulated reverberant environment with disrupting sound sources. When the
A study evaluated the Speech-Language Program in school district #68, Nanaimo, British Columbia, Canada. An external evaluator visited the district and spent 4 consecutive days in observing speech-language pathologists (SLPs), interviewing teachers, parents, administrators, and examining records. Results indicated an extremely positive response to…
Over the past several years there has been considerable attention focused on the problem of enhancement and bandwidth compression of speech degraded by additive background noise. This interest is motivated by several factors including a broad set of important applications, the apparent lack of robustness in current speech-compression systems and the development of several potentially promising and practical solutions. One
|Three experiments elicited phonological speech errors using the SLIP procedure to investigate whether there is a tendency for speech errors on specific words to reoccur, and whether this effect can be attributed to implicit learning of an incorrect mapping from lemma to phonology for that word. In Experiment 1, when speakers made a phonological…
Humphreys, Karin R.; Menzies, Heather; Lake, Johanna K.
Infants prefer speech to non-vocal sounds and to non-human vocalizations, and they prefer happy-sounding speech to neutral speech. They also exhibit an interest in singing, but there is little knowledge of their relative interest in speech and singing. The present study explored infants' attention to unfamiliar audio samples of speech and singing. In Experiment 1, infants 4-13 months of age were exposed to happy-sounding infant-directed speech vs. hummed lullabies by the same woman. They listened significantly longer to the speech, which had considerably greater acoustic variability and expressiveness, than to the lullabies. In Experiment 2, infants of comparable age who heard the lyrics of a Turkish children's song spoken vs. sung in a joyful/happy manner did not exhibit differential listening. Infants in Experiment 3 heard the happily sung lyrics of the Turkish children's song vs. a version that was spoken in an adult-directed or affectively neutral manner. They listened significantly longer to the sung version. Overall, happy voice quality rather than vocal mode (speech or singing) was the principal contributor to infant attention, regardless of age. PMID:23805119
Immediate serial recall of visually presented verbal stimuli is impaired by the presence of irrelevant auditory background speech, the so-called irrelevant speech effect. Two of the three main accounts of this effect place restrictions on when it will be observed, limiting its occurrence either to items processed by the phonological loop (the phonological loop hypothesis) or to items that are
Ian Neath; Katherine Guérard; Annie Jalbert; Tamra J. Bireta; Aimée M. Surprenant
A non-acoustic sensor is used to measure a user's speech and then broadcasts an obscuring acoustic signal diminishing the user's vocal acoustic output intensity and/or distorting the voice sounds making them unintelligible to persons nearby. The non-acoustic sensor is positioned proximate or contacting a user's neck or head skin tissue for sensing speech production information.
|The handbook contains State Education Department rules and regulations that govern speech-language pathology and audiology in New York State. The handbook also describes licensure and first registration as a licensed speech-language pathologist or audiologist. The introduction discusses professional regulation in New York State while the second…
New York State Education Dept., Albany. Office of the Professions.
In this paper we propose a technique for a syllable based speech synthesis system. While syllable based synthesizers produce better sounding speech than diphone and phone, the coverage of all syllables is a non-trivial issue. We address the issue of coverage of syllables through approximating the syllable when the required syllable is not found. To verify our hypothesis, we conducted
E. Veera Raghavendra; B. Yegnanarayana; Kishore Prahallad
The acoustic approach to speech recognition has an important advantage compared with pattern recognition approach: it presents a lower complexity because it doesn't require explicit structures such as the hidden Markov model. In this work, we show how to characterize some phonetic classes of the Italian language in order to obtain a speaker and vocabulary independent speech recognition system. A
Francesco Beritelli; Luca Borrometi; Antonino Cuce
Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as 'beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as 'coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system.
Neurofibromatosis type 1 (NF1) is an autosomal dominant neurocutaneous disorder with an estimated prevalence of about 1/3000. Several authors mention the occurrence of various types of speech abnormalities associated with NF1. The present study investigated speech fluency in 21 Dutch speaking adults with NF1. Speech samples were collected in five different speaking modalities (spontaneous speech, monologue, repetition, automatic series and reading) and subsequently analysed for type, number and distribution of dysfluencies. It was found that dysfluencies are a common feature of the speech of individuals with NF1. Although stuttering appears to occur in some NF1 patients, as a group, they display a dysfluency pattern that is not identical to stuttering. Educationalobjectives: The reader will be able to (1) summarize the clinical characteristics, prevalence and genetics of neurofibromatosis type 1 (NF1) and (2) describe the dysfluent behaviour displayed by individuals with NF1 regarding frequency, type and distribution of the dysfluencies. PMID:20412983
Cosyns, Marjan; Mortier, Geert; Janssens, Sandra; Saharan, Nidhi; Stevens, Elien; Van Borsel, John
The integration of speech recognition with natural language understanding raises issues of how to adapt natural language processing to the characteristics of spoken language; how to cope with errorful recognition output, including the use of natural language information to reduce recognition errors; and how to use information from the speech signal, beyond just the sequence of words, as an aid to understanding. This paper reviews current research addressing these questions in the Spoken Language Program sponsored by the Advanced Research Projects Agency (ARPA). I begin by reviewing some of the ways that spontaneous spoken language differs from standard written language and discuss methods of coping with the difficulties of spontaneous speech. I then look at how systems cope with errors in speech recognition and at attempts to use natural language information to reduce recognition errors. Finally, I discuss how prosodic information in the speech signal might be used to improve understanding.
Speech is the most important form of human communication but ambient sounds and competing talkers often degrade its acoustics. Fortunately the brain can use visual information, especially its highly precise spatial information, to improve speech comprehension in noisy environments. Previous studies have demonstrated that audiovisual integration depends strongly on spatiotemporal factors. However, some integrative phenomena such as McGurk interference persist even with gross spatial disparities, suggesting that spatial alignment is not necessary for robust integration of audiovisual place-of-articulation cues. It is therefore unclear how speech-cues interact with audiovisual spatial integration mechanisms. Here, we combine two well established psychophysical phenomena, the McGurk effect and the ventriloquist's illusion, to explore this dependency. Our results demonstrate that conflicting spatial cues may not interfere with audiovisual integration of speech, but conflicting speech-cues can impede integration in space. This suggests a direct but asymmetrical influence between ventral 'what' and dorsal 'where' pathways. PMID:21909378
Although adaptive coding is in widespread use, the availability of very large scale integrated digital signal processing chips makes filterbank analysis and synthesis of speech signals very economical. This ear modeling simplifies the application of masking properties of the ear. Experiments were conducted to determine the number of filterbank outputs needed to reconstruct speech signals using a logarithmic bandpass filterbank consisting of an analysis and a synthesis filterbank. With eight outputs, the original speech sentence is found to be very clear; addition of 32 more outputs restores the breathiness to the sentence. A surprising degree of intelligibility is retained even with one output. The bandwidth of the speech signals is limited. The use of these filterbanks for speech enhancement and modest bitrate transmission appears favorable.
How the human auditory system extracts perceptually relevant acoustic features of speech is unknown. To address this question, we used intracranial recordings from nonprimary auditory cortex in the human superior temporal gyrus to determine what acoustic information in speech sounds can be reconstructed from population neural activity. We found that slow and intermediate temporal fluctuations, such as those corresponding to syllable rate, were accurately reconstructed using a linear model based on the auditory spectrogram. However, reconstruction of fast temporal fluctuations, such as syllable onsets and offsets, required a nonlinear sound representation based on temporal modulation energy. Reconstruction accuracy was highest within the range of spectro-temporal fluctuations that have been found to be critical for speech intelligibility. The decoded speech representations allowed readout and identification of individual words directly from brain activity during single trial sound presentations. These findings reveal neural encoding mechanisms of speech acoustic parameters in higher order human auditory cortex.
Pasley, Brian N.; David, Stephen V.; Mesgarani, Nima; Flinker, Adeen; Shamma, Shihab A.; Crone, Nathan E.; Knight, Robert T.; Chang, Edward F.
Improvement or maintenance of speech intelligibility is a central aim in a whole range of conditions in speech-language therapy, both developmental and acquired. Best clinical practice and pursuance of the evidence base for interventions would suggest measurement of intelligibility forms a vital role in clinical decision-making and monitoring. However, what should be measured to gauge intelligibility and how this is achieved and relates to clinical planning continues to be a topic of debate. This review considers the strengths and weaknesses of selected clinical approaches to intelligibility assessment, stressing the importance of explanatory, diagnostic testing as both a more sensitive and a clinically informative method. The worth of this, and any approach, is predicated, though, on awareness and control of key design, elicitation, transcription and listening/listener variables to maximize validity and reliability of assessments. These are discussed. A distinction is drawn between signal-dependent and -independent factors in intelligibility evaluation. Discussion broaches how these different perspectives might be reconciled to deliver comprehensive insights into intelligibility levels and their clinical/educational significance. The paper ends with a call for wider implementation of best practice around intelligibility assessment. PMID:24119170
It is well established that in the majority of the population language processing is lateralized to the left hemisphere. Evidence suggests that lateralization is also present in the brainstem. In the current study, the syllable /da/ was presented monaurally to the right and left ears and electrophysiological responses from the brainstem were recorded in adults with symmetrical interaural click-evoked responses. Responses to the right-ear presentation occurred earlier than those to left-ear presentation in two peaks of the frequency following response (FFR) and approached significance for the third peak of the FFR and the offset peak. Interestingly, there were no differences in interpeak latencies indicating the response to right-ear presentation simply occurred earlier over this region. Analyses also showed more robust frequency encoding when stimuli were presented to the right ear than the left ear. The effect was found for the harmonics of the fundamental that correspond to the first formant of the stimulus, but was not seen in the fundamental frequency range. The results suggest that left lateralization of processing acoustic elements important for discriminating speech extends to the auditory brainstem and that these effects are speech specific.
There is a lack of agreement on the features used to differentiate Childhood Apraxia of Speech (CAS) from Phonological Disorders (PD). One criterion which has gained consensus is lexical inconsistency of speech (ASHA, 2007); however, no accepted measure of this feature has been defined. Although lexical assessment provides information about consistency of an item across repeated trials, it may not capture the magnitude of inconsistency within an item. In contrast, segmental analysis provides more extensive information about consistency of phoneme usage across multiple contexts and word-positions. The current research compared segmental and lexical inconsistency metrics in preschool-aged children with PD, CAS, and typical development (TD) to determine how inconsistency varies with age in typical and disordered speakers, and whether CAS and PD were differentiated equally well by both assessment levels. Whereas lexical and segmental analyses may be influenced by listener characteristics or speaker intelligibility, the acoustic signal is less vulnerable to these factors. In addition, the acoustic signal may reveal information which is not evident in the perceptual signal. A second focus of the current research was motivated by Blumstein et al.'s (1980) classic study on voice onset time (VOT) in adults with acquired apraxia of speech (AOS) which demonstrated a motor impairment underlying AOS. In the current study, VOT analyses were conducted to determine the relationship between age and group with the voicing distribution for bilabial and alveolar plosives. Findings revealed that 3-year-olds evidenced significantly higher inconsistency than 5-year-olds; segmental inconsistency approached 0% in 5-year-olds with TD, whereas it persisted in children with PD and CAS suggesting that for child in this age-range, inconsistency is a feature of speech disorder rather than typical development (Holm et al., 2007). Likewise, whereas segmental and lexical inconsistency were moderately-highly correlated, even the most highly-related segmental and lexical measures agreed on only 76% of classifications (i.e., to CAS and PD). Finally, VOT analyses revealed that CAS utilized a distinct distribution pattern relative to PD and TD. Discussion frames the current findings within a profile of CAS and provides a validated list of criteria for the differential diagnosis of CAS and PD.
This paper introduces a combinational feature extraction approach to improve speech recognition systems. The main idea is to simultaneously benefit from some features obtained from Poincaré section applied to speech reconstructed phase space (RPS) and typical Mel frequency cepstral coefficients (MFCCs) which have a proved role in speech recognition field. With an appropriate dimension, the reconstructed phase space of speech signal is assured to be topologically equivalent to the dynamics of the speech production system, and could therefore include information that may be absent in linear analysis approaches. Moreover, complicated systems such as speech production system can present cyclic and oscillatory patterns and Poincaré sections could be used as an effective tool in analysis of such trajectories. In this research, a statistical modeling approach based on Gaussian mixture models (GMMs) is applied to Poincaré sections of speech RPS. A final pruned feature set is obtained by applying an efficient feature selection approach to the combination of the parameters of the GMM model and MFCC-based features. A hidden Markov model-based speech recognition system and TIMIT speech database are used to evaluate the performance of the proposed feature set by conducting isolated and continuous speech recognition experiments. By the proposed feature set, 5.7% absolute isolated phoneme recognition improvement is obtained against only MFCC-based features.
Jafari, Ayyoob; Almasganj, Farshad; Bidhendi, Maryam Nabi
Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.
Surveillance registers monitor the prevalence of cerebral palsy and the severity of resulting impairments across time and place. The motor disorders of cerebral palsy can affect children's speech production and limit their intelligibility. We describe the development of a scale to classify children's speech performance for use in cerebral palsy surveillance registers, and its reliability across raters and across time. Speech and language therapists, other healthcare professionals and parents classified the speech of 139 children with cerebral palsy (85 boys, 54 girls; mean age 6.03 years, SD 1.09) from observation and previous knowledge of the children. Another group of health professionals rated children's speech from information in their medical notes. With the exception of parents, raters reclassified children's speech at least four weeks after their initial classification. Raters were asked to rate how easy the scale was to use and how well the scale described the child's speech production using Likert scales. Inter-rater reliability was moderate to substantial (k>.58 for all comparisons). Test-retest reliability was substantial to almost perfect for all groups (k>.68). Over 74% of raters found the scale easy or very easy to use; 66% of parents and over 70% of health care professionals judged the scale to describe children's speech well or very well. We conclude that the Viking Speech Scale is a reliable tool to describe the speech performance of children with cerebral palsy, which can be applied through direct observation of children or through case note review. PMID:23891732
Pennington, Lindsay; Virella, Daniel; Mjøen, Tone; da Graça Andrada, Maria; Murray, Janice; Colver, Allan; Himmelmann, Kate; Rackauskaite, Gija; Greitane, Andra; Prasauskiene, Audrone; Andersen, Guro; de la Cruz, Javier
Previous studies have shown that infant-directed speech (`motherese') exhibits overemphasized acoustic properties which may facilitate the acquisition of phonetic categories by infant learners. It has been suggested that the use of infant-directed data for training automatic speech recognition systems might also enhance the automatic learning and discrimination of phonetic categories. This study investigates the properties of infant-directed vs. adult-directed speech from the point of view of the statistical pattern recognition paradigm underlying automatic speech recognition. Isolated-word speech recognizers were trained on adult-directed vs. infant-directed data sets and were tested on both matched and mismatched data. Results show that recognizers trained on infant-directed speech did not always exhibit better recognition performance; however, their relative loss in performance on mismatched data was significantly less severe than that of recognizers trained on adult-directed speech and presented with infant-directed test data. An analysis of the statistical distributions of a subset of phonetic classes in both data sets showed that this pattern is caused by larger class overlaps in infant-directed speech. This finding has implications for both automatic speech recognition and theories of infant speech perception. .
In this paper, a new feature enhancement algorithm called model-based feature enhancement (MBFE) is introduced for noise robust speech recognition. In MBFE, statistical models (i.e., Gaussian HMM's) of the clean speech feature vectors and of the perturbing noise feature vectors are used to construct the optimal MMSE estimator of the clean speech feature vectors. The estimated clean speech features are
We present an algorithm to perform blind, one-microphone speech sep- aration. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech sep- aration as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We
This paper presents an overview of different approaches for providing automatic speech recognition (ASR) technology to mobile users. Three principal system architectures in terms of employing wireless communication link are analyzed: Embed- ded Speech Recognition Systems, Network Speech Recognition (NSR) and Distributed Speech Recognition (DSR). Overview of the solutions which became standards by now as well as some critical analysis
Human listeners are able to understand speech in the presence of a noisy background. How to simulate this perceptual ability remains a great challenge. This paper describes a preliminary evaluation of intelligibility of the output of a monaural speech segregation system. The system performs speech segregation in two stages. The first stage segregates voiced speech using supervised learning of harmonic
Ke Hu; Pierre Divenyi; Dan Ellis; Zhaozhang Jin; Barbara G. Shinn-Cunningham; DeLiang Wang
This paper presents an application of missing data techniques in speech enhancement. The enhancement system consists of two stages: the first stage uses a recurrent neural network, which is supplied with noisy speech and produces enhanced speech; whereas the second stage uses missing data techniques to further improve the quality of enhanced speech. The results suggest that combining missing data
This paper reviews results from a number of field trials assessing speech recognition feasibility for telecommunications services. Several applications incorporating speech automation are explored: Directory Assistance Call Completion (DACC), partial speech automation of Directory Assistance (OSF - Operator Store and Forward), banking over the telephone (Money Talks) and par- tial speech automation of a customer calling center ( PREVIU). The
Sara Basson; Stephen Springer; Cynthia Fong; Hong C. Leung; Edward Man; Michele Olson; John F. Pitrelli; Ranvir Singh; Suk Wong
Standardised tests of whole-word accuracy are popular in the speech pathology and developmental psychology literature as measures of children's speech performance. However, they may not be sensitive enough to measure changes in speech output in children with severe and persisting speech difficulties (SPSD). To identify the best ways of doing this,…
Newbold, Elisabeth Joy; Stackhouse, Joy; Wells, Bill
Reduced speech fluency is frequent in clinical paediatric populations, an unexplained finding. To investigate age related effects on speech fluency variables, we analysed samples of narrative speech (picture description) of 308 healthy children, aged 5 to 17 years, and studied its relation with verbal fluency tasks. All studied measures showed significant developmental effects. Speech rate and verbal fluency scores increased,
Isabel Pavão Martins; Rosário Vieira; Clara Loureiro; M. Emilia Santos
|Across all languages studied to date, audiovisual speech exhibits a consistent rhythmic structure. This rhythm is critical to speech perception. Some have suggested that the speech rhythm evolved "de novo" in humans. An alternative account--the one we explored here--is that the rhythm of speech evolved through the modification of rhythmic facial…
Morrill, Ryan J.; Paukner, Annika; Ferrari, Pier F.; Ghazanfar, Asif A.
Automatic speech animation remains a challenging problem that can be described as finding the optimal sequence of ani- mation parameter configurations given some speech. In this pa- per we present a novel technique to automatically synthesise lip motion trajectories from a speech signal. The developed sys- tem predicts lip motion units from the speech signal and gen- erates animation trajectories
Gregor Hofer; Junichi Yamagishi; Hiroshi Shimodaira
Comparative statistical data are presented on speech dynamic (as contrasted with lexical and rhetorical) aspects of major speech styles. Representative samples of story retelling, lectures, speeches, sermons, interviews, and panel discussions serve to determine posited differences between casual and careful speech. Data are drawn from 15,393…
In this paper, we review: (1) the acoustic and linguistic properties of children's speech for both read and spontaneous speech, and (2) the developments in automatic speech recognition for children with application to spoken dialogue and multimodal dialogue system design. First, the effect of developmental changes on the absolute values and variability of acoustic correlates is presented for read speech
Matteo Gerosa; Diego Giuliani; Shrikanth Narayanan; Alexandros Potamianos
|Two experiments investigate the effectiveness of audiovisual (AV) speech cues (cues derived from both seeing and hearing a talker speak) in facilitating perceptual learning of spectrally distorted speech. Speech was distorted through an eight channel noise-vocoder which shifted the spectral envelope of the speech signal to simulate the properties…
|Purpose: It is established that speaking clearly is an effective means of enhancing intelligibility. Because any signal-processing scheme modeled after known acoustic-phonetic features of clear speech will likely affect both target and competing speech, it is important to understand how speech recognition is affected when a competing speech…
Calandruccio, Lauren; Van Engen, Kristin; Dhar, Sumitrajit; Bradlow, Ann R.
Purpose The present study examined associations of 5 endophenotypes (i.e., measurable skills that are closely associated with speech sound disorders and are useful in detecting genetic influences on speech sound production), oral motor skills, phonological memory, phonological awareness, vocabulary, and speeded naming, with 3 clinical criteria for classifying speech sound disorders: severity of speech sound disorders, our previously reported clinical subtypes (speech sound disorders alone, speech sound disorders with language impairment, and childhood apraxia of speech), and the comorbid condition of reading disorders. Participants and Method Children with speech sound disorders and their siblings were assessed at early childhood (ages 4–7 years) on measures of the 5 endophenotypes. Severity of speech sound disorders was determined using the z score for Percent Consonants Correct—Revised (developed by Shriberg, Austin, Lewis, McSweeny, & Wilson, 1997). Analyses of variance were employed to determine how these endophenotypes differed among the clinical subtypes of speech sound disorders. Results and Conclusions Phonological memory was related to all 3 clinical classifications of speech sound disorders. Our previous subtypes of speech sound disorders and comorbid conditions of language impairment and reading disorder were associated with phonological awareness, while severity of speech sound disorders was weakly associated with this endophenotype. Vocabulary was associated with mild versus moderate speech sound disorders, as well as comorbid conditions of language impairment and reading disorder. These 3 endophenotypes proved useful in differentiating subtypes of speech sound disorders and in validating current clinical classifications of speech sound disorders.
Lewis, Barbara A.; Avrich, Allison A.; Freebairn, Lisa A.; Taylor, H. Gerry; Iyengar, Sudha K.; Stein, Catherine M.
Summary form only given, as follows. The possibility of automatic speech recognition on computers has fuelled dreams for many years. Indeed, automatic speech recognition holds great promise if the spoken words can be recognised correctly by machines in fluent speech. We review the current state-of-the-art in automatic speech recognition and point to the new directions that research in this field
|In "The Relation of Language to Mental Development and of Speech to Language Teaching," S.G. Davidson displayed several timeless insights into the role of speech in developing language and reasons for using speech as the basis for instruction for children who are deaf and hard of hearing. His understanding that speech includes more than merely…
The goal of the SpeechDat project is to develop spoken language resources for speech recognisers s uited to realise voice driven teleservices. SpeechDat created speech databases for all official languages of the European Union and some major dialectal varieties and minority languages. The size of the databases ranges between 500 and 5000 speakers. In total 20 d atabases are recorded
Harald Höge; Christoph Draxler; Henk van den Heuvel; Finn Tore Johansen; Eric Sanders; Herbert S. Tropf
The ability to transcribe disordered speech is a vital tool for speech-language pathologists, as accurate description of a client's speech output is needed for both diagnosis and effective intervention. Clients in the speech clinic often use sounds that are not part of the target sound system and which may, in some cases, be sounds not found in…
The majority of studies in second-language (L2) speech processing have involved unimodal (i.e., auditory) input; however, in many instances, speech communication involves both visual and auditory sources of information. Some researchers have argued that multimodal speech is the primary mode of speech perception (e.g., Rosenblum 2005). Research on…
The Center for Spoken Language Understanding (CSLU) provides free language resources to researchers and educators in all areas of speech and hearing science. These resources are of great potential value to speech scientists for analyzing speech, for diagnosing and treating speech and language problems, for researching and evaluating language technologies, and for training students in the theory and practice of
Predictive coding is a promising approach for speech coding. In this paper, we review the recent work on adaptive predictive coding of speech signals, with particular emphasis on achieving high speech quality at low bit rates (less than 10 kbits\\/s). Efficient prediction of the redundant structure in speech signals is obviously important for proper functioning of a predictive coder. It
|Purpose: Advances in neurobiology are providing new opportunities to investigate the neurological systems underlying motor speech control. This study explores the perceptual characteristics of the speech of three genotypes of spino-cerebellar ataxia (SCA) as manifest in four different speech tasks. Methods: Speech samples from 26 speakers with…
Sidtis, John J.; Ahn, Ji Sook; Gomez, Christopher; Sidtis, Diana
The purpose of this study is to visualize the similarities among multiple speech corpora. In order for users to easily utilize various speech corpora, we reported a visualization method based on the corpus attribute using MDS. We had proposed the eight attributes as the speech corpus features. However, these attributes contained no acoustical feature of the speech corpus. The acoustical
Kimiko Yamakawa; Hideaki Kikuchi; T. Matsui; Shuichi Itahashi
One of the drawbacks to the speech synthesis technique wherein speech parameters are directly generated from hidden Markov models (HMM-based speech synthesis) is the unnat- uralness of the synthesized speech. This problem occurs owing to the rough excitation model employed during the waveform generation stage. This report introduces a new excitation ap- proach that attempt to solve this problem. The
|Developed as a high school quinmester unit on persuasive speaking, this guide provides the teacher with teaching strategies for a course which analyzes speeches from "Vital Speeches of the Day," political speeches, TV commercials, and other types of speeches. Practical use of persuasive methods for school, community, county, state, and national…
|The majority of studies in second-language (L2) speech processing have involved unimodal (i.e., auditory) input; however, in many instances, speech communication involves both visual and auditory sources of information. Some researchers have argued that multimodal speech is the primary mode of speech perception (e.g., Rosenblum 2005). Research on…
This paper briefly reviews current silent speech methodologies for normal and disabled individuals. Current techniques utilizing electromyographic (EMG) recordings of vocal tract movements are useful for physically healthy individuals but fail for tetraplegic individuals who do not have accurate voluntary control over the speech articulators. Alternative methods utilizing EMG from other body parts (e.g., hand, arm, or facial muscles) or electroencephalography (EEG) can provide capable silent communication to severely paralyzed users, though current interfaces are extremely slow relative to normal conversation rates and require constant attention to a computer screen that provides visual feedback and/or cueing. We present a novel approach to the problem of silent speech via an intracortical microelectrode brain computer interface (BCI) to predict intended speech information directly from the activity of neurons involved in speech production. The predicted speech is synthesized and acoustically fed back to the user with a delay under 50 ms. We demonstrate that the Neurotrophic Electrode used in the BCI is capable of providing useful neural recordings for over 4 years, a necessary property for BCIs that need to remain viable over the lifespan of the user. Other design considerations include neural decoding techniques based on previous research involving BCIs for computer cursor or robotic arm control via prediction of intended movement kinematics from motor cortical signals in monkeys and humans. Initial results from a study of continuous speech production with instantaneous acoustic feedback show the BCI user was able to improve his control over an artificial speech synthesizer both within and across recording sessions. The success of this initial trial validates the potential of the intracortical microelectrode-based approach for providing a speech prosthesis that can allow much more rapid communication rates.
Brumberg, Jonathan S.; Nieto-Castanon, Alfonso; Kennedy, Philip R.; Guenther, Frank H.
The lateralization of visual speech perception was examined in 3 experiments. Participants were presented with a realistic computer-animated face articulating 1 of 4 consonant-vowel syllables without sound. The face appeared at 1 of 5 locations in the visual field. The participants' task was to identify each test syllable. To prevent eye movement during the presentation of the face, participants had to carry out a fixation task simultaneously with the speechreading task. In one study, an eccentricity effect was found along with a small but significant difference in favor of the right visual field (left hemisphere). The same results were found with the face articulating nonlinguistic mouth movements (e.g., kiss). These results suggest that the left-hemisphere advantage is based on the processing of dynamic visual information rather than on the extraction of linguistic significance from facial movements. PMID:9706714
Smeele, P M; Massaro, D W; Cohen, M M; Sittig, A C
Audio is an information-rich component of multimedia. Information can be extracted from audio in a number of different ways, and thus there are several established audio signal analysis research fields. These fields include speech recognition, speaker recognition, audio segmentation and classification, and audio finger-printing. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major audio analysis fields. The goal is to introduce enough back-ground for someone new in the field to quickly gain high-level understanding and to provide direction for further study.
Active disguising speech is one problem to be taken into account in forensic speaker verification or identification processes. The verification processes are usually carried out by comparison between unknown samples and known samples. Active disguising can be occurred on both samples. To simulate the condition of speech disguising, voices of Wayang Golek Puppeteer were used. It is assumed that wayang golek puppeteer is a master of disguise. He can manipulate his voice into many different types of character's voices. This paper discusses the speech characteristics of 2 puppeteers. Comparison was made between the voices of puppeteer's habitual voice with his manipulated voice.
Hakim, Faisal Abdul; Mandasari, Miranti Indar; Sarwono, Joko
The application of algorithms for automatic speech recognition is described. Parameters to describe a speech signal were searched for from which log area ratio parameters were chosen. Based on these parameters, a segmentation algorithm was developed. The recognition process was then discussed. A new method was developed based on a new model: the Recursive Markov Model (RMM) which is an extension of the existing Hiddem Markov Model (HMM) and has the advantage of shorter computations, hierarchical modeling, and state sharing. Results of tests of the implemented algorithms are given and the relative merits of the methods of speech recognition are considered.
The investigation of speech intelligibility under quiet and noisy conditions is essential to evaluate the auditory and verbal rehabilitation of cochlear implant patients. A set of audiological tests is introduced that has been validated and optimized by empirical investigations in adult cochlear implant users. The Kiel logatom test, the Freiburg speech intelligibility test and the adaptive measurement of speech perception threshold in noise with the Oldenburg sentence test are methods suggested for clinical use, which are also applicable for scientific investigations. The test battery provides results that can be interpreted by every professional involved in the rehabilitation process of cochlear implant patients. PMID:19517086
Although single-microphone noise reduction methods perform well in stationary noise environments, their performance in non-stationary conditions remains unsatisfactory. Use of prior knowledge about speech and noise power spectral densities in the form of trained codebooks has been previously shown to address this limitation. While it is possible to use trained speech codebooks in a practical system, the variety of noise types encountered in practice makes the use of trained noise codebooks less practical. This letter presents a method that uses a generic noise codebook for speech enhancement that can be generated on-the-fly and provides good performance. PMID:22894316
|Purpose: The purpose of this article was to understand the experience of speech impairment (speech sound disorders) in everyday life as described by children with speech impairment and their communication partners. Method: Interviews were undertaken with 13 preschool children with speech impairment (mild to severe) and 21 significant others…
McCormack, Jane; McLeod, Sharynne; McAllister, Lindy; Harrison, Linda J.
This paper studies the reliability and validity of naturalistic speech errors as a tool for language production research.\\u000a Possible biases when collecting naturalistic speech errors are identified and specific predictions derived. These patterns\\u000a are then contrasted with published reports from Germanic languages (English, German and Dutch) and one Romance language (Spanish).\\u000a Unlike findings in the Germanic languages, Spanish speech errors
Elvira Pérez; Julio Santiago; Alfonso Palma; Padraig G. O’Seaghdha
Abstract In this study, movement data from lips, jaw and tongue were acquired using the AG-100 EMMA system from a relatively young individual with apraxia of speech (AOS) and Broca’s aphasia. Two different analyses were performed. In the first analysis, kinematic and coordination data from error-free fluent speech samples were compared,to the same type of data from a group of
Pascal H. H. M. van Lieshout; Arpita Bose; Paula A. Square; Catriona M. Steele
Background: Speech perception is often considered specific to the auditory modality, despite convincing evidence that speech processing is bimodal. The theoretical and clinical roles of speech-reading for speech perception, however, have received little attention in speech-language therapy. Aims: The role of speech-read information for speech…
The present study examined the neural correlates of speech and hand gesture comprehension in a naturalistic context. Fifteen participants watched audiovisual segments of speech and gesture while event-related potentials (ERPs) were recorded to the speech. Gesture influenced the ERPs to the speech. Specifically, there was a right-lateralized N400 effect—reflecting semantic integration—when gestures mismatched versus matched the speech. In addition, early
Spencer D. Kelly; Corinne Kravitz; Michael Hopkins
Asymmetry in auditory cortical oscillations could play a role in speech perception by fostering hemispheric triage of information across the two hemispheres. Due to this asymmetry, fast speech temporal modulations relevant for phonemic analysis could be best perceived by the left auditory cortex, while slower modulations conveying vocal and paralinguistic information would be better captured by the right one. It is unclear, however, whether and how early oscillation-based selection influences speech perception. Using a dichotic listening paradigm in human participants, where we provided different parts of the speech envelope to each ear, we show that word recognition is facilitated when the temporal properties of speech match the rhythmic properties of auditory cortices. We further show that the interaction between speech envelope and auditory cortices rhythms translates in their level of neural activity (as measured with fMRI). In the left auditory cortex, the neural activity level related to stimulus-brain rhythm interaction predicts speech perception facilitation. These data demonstrate that speech interacts with auditory cortical rhythms differently in right and left auditory cortex, and that in the latter, the interaction directly impacts speech perception performance. PMID:22219289
Auditory and somatosensory systems play a key role in speech motor control. In the act of speaking, segmental speech movements are programmed to reach phonemic sensory goals, which in turn are used to estimate actual sensory feedback in order to further control production. The adult's tendency to automatically imitate a number of acoustic-phonetic characteristics in another speaker's speech however suggests that speech production not only relies on the intended phonemic sensory goals and actual sensory feedback but also on the processing of external speech inputs. These online adaptive changes in speech production, or phonetic convergence effects, are thought to facilitate conversational exchange by contributing to setting a common perceptuo-motor ground between the speaker and the listener. In line with previous studies on phonetic convergence, we here demonstrate, in a non-interactive situation of communication, online unintentional and voluntary imitative changes in relevant acoustic features of acoustic vowel targets (fundamental and first formant frequencies) during speech production and imitation. In addition, perceptuo-motor recalibration processes, or after-effects, occurred not only after vowel production and imitation but also after auditory categorization of the acoustic vowel targets. Altogether, these findings demonstrate adaptive plasticity of phonemic sensory-motor goals and suggest that, apart from sensory-motor knowledge, speech production continuously draws on perceptual learning from the external speech environment.
Electroglottography (EGG) has been used for investigating the functioning of the vocal folds during vibration. EGG is related to the patterns of vocal fold vibration during phonation and characterizes the temporal patterns of each vibratory cycle. The purpose of this study was to investigate the dynamic changes in EGG waveforms during continuous speech. Aerodynamic signals, air pressure and airflow were evaluated simultaneously with EGG waveforms. Fundamental frequency (F0), open quotient (OQ) and baseline shift of the EGG during speech production were measured for three types of Korean consonants using EGG waveforms. The glottal airway resistance during speech production was measured using aerodynamic waveforms, and evaluated for the relationship with F0, OQ and baseline shift. The results indicated that the EGG waveforms seem to be significantly affected by the articulatory activities of the larynx, airflow and subglottic pressure, and may be a useful method to describe dynamic laryngeal articulatory activity during continuous speech. PMID:9311157
A physiological front-end preprocessor for speech recognition was evaluated using a large isolated word database in noisy and quiet environments. The front end was based on the ensemble interval histogram (EIH) model developed by Oded Ghitza, which provid...
The purpose of this research was to improve performance in speech recognition. Specifically, a new approach was investigating by applying an integral transform known as the Mellin transform (MT) on the output of an auditory model to improve the recognitio...
The trajectory of this project parallels, in certain ways, the changing dynamics of speech and neuroscience research. When the project began in 2007, many aspects of these fields were formulated in concepts and methods originating more than 50 years ago. ...
Three inexpensive text-to-speech synthesizers are described, intelligibility data from a pilot experiment are reported, and\\u000a software is offered that has been written to facilitate the phonemic programming of the Heathkit-Votrax synthesizer.
Phillip L. Emerson; Doris C. Karnisky; Carla J. Kastanis
This report describes research carried out in three related projects investigating the function and limitations of attention in speech perception. The projects were directed at investigating the distribution of attention in time during phoneme recognition...
A key problem in computational auditory scene analysis (CASA) is monaural speech segregation, which has proven to be very challenging. For monaural mixtures, one can only utilize the intrinsic properties of speech or interference to segregate target speech from background noise. Ideal binary mask (IBM) has been proposed as a main goal of sound segregation in CASA and has led to substantial improvements of human speech intelligibility in noise. This study proposes a classification approach to estimate the IBM and employs support vector machines to classify time-frequency units as either target- or interference-dominant. A re-thresholding method is incorporated to improve classification results and maximize hit minus false alarm rates. An auditory segmentation stage is utilized to further improve estimated masks. Systematic evaluations show that the proposed approach produces high quality estimated IBMs and outperforms a recent system in terms of classification accuracy. PMID:23145627
A variety of automatic speech recognition experiments have been executed that support a measure of confidence for utterance classification. The confidence measure tested was the ratio of the two best \\
... not. Air no longer passes through the vocal folds so the person cannot produce sounds easily. In ... prevent enough air from moving through the vocal folds. This makes speech quite difficult. People with a ...
An initial stage of speech production is conceptual planning, where a speaker determines which information to convey first (the linearization problem). This fMRI study investigated the linearization process during the production of \\
Zheng Ye; Boukje Habets; Bernadette M. Jansma; Thomas F. Münte
This paper describes the development of the Speech Pathology Database for Best Interventions and Treatment Efficacy (speechBITE) at The University of Sydney. The speechBITE database is designed to provide better access to the intervention research relevant to speech pathology and to help clinicians interpret treatment research. The challenges speech pathologists face when locating research to support evidence-based practice have been
Katherine Smith; Patricia McCabe; Leanne Togher; Emma Power; Natalie Munro; Elizabeth Murray; Michelle Lincoln
The role of attention in speech comprehension is not well understood. We used fMRI to study the neural correlates of auditory word, pseudoword, and nonspeech (spectrally-rotated speech) perception during a bimodal (auditory, visual) selective attention task. In three conditions, Attend Auditory (ignore visual), Ignore Auditory (attend visual), and Visual (no auditory stimulation), 28 subjects performed a one-back matching task in the assigned attended modality. The visual task, attending to rapidly presented Japanese characters, was designed to be highly demanding in order to prevent attention to the simultaneously presented auditory stimuli. Regardless of stimulus type, attention to the auditory channel enhanced activation by the auditory stimuli (Attend Auditory > Ignore Auditory) in bilateral posterior superior temporal regions and left inferior frontal cortex. Across attentional conditions, there were main effects of speech processing (word + pseudoword > rotated speech) in left orbitofrontal cortex and several posterior right hemisphere regions, though these areas also showed strong interactions with attention (larger speech effects in the Attend Auditory than in the Ignore Auditory condition) and no significant speech effects in the Ignore Auditory condition. Several other regions, including the postcentral gyri, left supramarginal gyrus, and temporal lobes bilaterally, showed similar interactions due to the presence of speech effects only in the Attend Auditory condition. Main effects of lexicality (word > pseudoword) were isolated to a small region of the left lateral prefrontal cortex. Examination of this region showed significant word > pseudoword activation only in the Attend Auditory condition. Several other brain regions, including left ventromedial frontal lobe, left dorsal prefrontal cortex, and left middle temporal gyrus, showed attention × lexicality interactions due to the presence of lexical activation only in the Attend Auditory condition. These results support a model in which neutral speech presented in an unattended sensory channel undergoes relatively little processing beyond the early perceptual level. Specifically, processing of phonetic and lexical-semantic information appears to be very limited in such circumstances, consistent with prior behavioral studies.
Sabri, Merav; Binder, Jeffrey R.; Desai, Rutvik; Medler, David A.; Leitl, Michael D.; Liebenthal, Einat
We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using
Functional and linguistic aspects of the speech of Dutch-speaking mothers from three social classes to their 2-year-old children were studied. Mothers' speech in Dutch showed the same characteristics of simplicity and redundancy found in other languages. In a free play situation, both academic and lower middle class mothers produced more expansions and used fewer imperatives, more substantive deixis, and fewer
C. E. Snow; A. Arlman-Rupp; Y. Hassing; J. Jobse; J. Joosten; J. Vorster
Proper data selection for training a speech recognizer can be im-portant for reducing costs of developing systems on new tasks and exploratory experiments, but it is also useful for efficient leveraging of the increasingly large speech resources available for training large vocabulary systems. In this work, we investi-gate various sampling methods, comparing the likelihood crite-rion to new acoustic measures motivated
\\u000a Until the performance of automatic speech recognition (ASR) hardware surpasses human performance in accuracy and robustness,\\u000a we stand to gain by understanding the basic principles behind human speech recognition (HSR). This problem was studied exhaustively\\u000a at Bell Labs between the years of 1918 and 1950 by Harvey Fletcher and his colleagues. The motivation for these studies was\\u000a to quantify the
Provided by Prague Castle, the seat of the head of state of the Czech Republic, this site contains selected writings and speeches by Vaclav Havel, the playwright turned president. Havel, an unflinching critic of the communist regime in Czechoslovakia from the late 1960s into the 1980s, has been internationally recognized for his stance on the defense of freedom and human rights. These themes resonate strongly in this collection, which includes texts from speeches given around the world.
Freedom of speech is a fundamental liberty that imposes a stringent duty of tolerance. Tolerance is limited by direct incitements to violence. False notions and bad laws on speech have obscured our view of this freedom. Hence, perhaps, the self-righteous intolerance, incitements and threats in response to Giubilini and Minerva. Those who disagree have the right to argue back but their attempts to shut us up are morally wrong. PMID:23637438
\\u000a Speech technology has been regarded as one of the most interesting technologies for operating in-vehicle information systems.\\u000a Cameron  has pointed out that under at least one of the four criteria that people are using speech system more likely.\\u000a These four criteria are the following: (1) They are offered no choice; (2) it corresponds to the privacy of their surroundings;
Research in mobile service robotics aims on development of intuitive speech interfaces for human-robot interaction. We see\\u000a a service robot as a part of an intelligent environment and want to step forward discussing a concept where a robot does not\\u000a only offer its own features via natural speech interaction but also becomes a transactive agent featuring other services’\\u000a interfaces. The
Jan Koch; Holger Jung; Jens Wettach; Geza Nemeth; Karsten Berns
This paper presents a new text-to-speech synthesis system based on HMM which includes dynamic features, i.e., delta and delta-delta parameters of speech. The system uses triphone HMMs as the synthesis units. The triphone HMMs share less than 2,000 clustered states, each of which is modelled by a single Gaussian distribution. For a given text to be synthesized, a sentence HMM
A tutorial on signal processing in state-of-the-art speech recognition systems is presented, reviewing those techniques most commonly used. The four basic operations of signal modeling, i.e. spectral shaping, spectral analysis, parametric transformation, and statistical modeling, are discussed. Three important trends that have developed in the last five years in speech recognition are examined. First, heterogeneous parameter sets that mix absolute
The goal of this study was to explore the ability to discriminate languages using the visual correlates of speech (i.e., speech-reading).\\u000a Participants were presented with silent video clips of an actor pronouncing two sentences (in Catalan and\\/or Spanish) and\\u000a were asked to judge whether the sentences were in the same language or in different languages. Our results established that\\u000a Spanish—Catalan
Salvador Soto-Faraco; Jordi Navarra; Whitney M. Weikum; Athena Vouloumanos; Núria Sebastián-Gallés; Janet F. Werker
The performance of current speech recognition systems is far below that of humans. Neural nets offer the potential of providing massive parallelism, adaptation, and new algorithmic approaches to problems in speech recognition. Initial studies have demonstrated that multilayer networks with time delays can provide excellent discrimination between small sets of pre-segmented difficult-to-discriminate words, consonants, and vowels. Performance for these small
Sentences were reduced to an array of sixteen effectively rectangular bands (RBs) having center frequencies ranging from 0.25 to 8 kHz spaced at ?-octave intervals. Four arrays were employed, each having uniform subcritical bandwidths which ranged from 40 Hz to 5 Hz. The 40 Hz width array had intelligibility near ceiling, and the 5 Hz array about 1%. The finding of interest was that when the subcritical speech RBs were used to modulate RBs of noise having the same center frequency as the speech but having bandwidths increased to a critical (ERBn) bandwidth at each center frequency, these spectrally smeared arrays were considerably more intelligible in all but the 40 Hz (ceiling) condition. For example, when the 10 Hz bandwidth speech array having an intelligibility of 8% modulated the ERBn noise array, intelligibility increased to 48%. This six-fold increase occurred despite elimination of spectral fine structure and addition of stochastic fluctuation to speech envelope cues. (As anticipated, conventional vocoding with matching bandwidths of speech and noise reduced the 10-Hz-speech array intelligibility from 8% to 1%). These effects of smearing confirm findings by Bashford, Warren, and Lenz (2010) that optimal temporal processing requires stimulation of a critical bandwidth. [Supported by NIH
The increased use of lingual appliances has meant a continued evolution in the design of lingual brackets. These changes in appliance and bracket design have tended to focus on reducing bracket thickness, with the aim of making appliances more comfortable. A thinner bracket design appears to have had some positive effects on the quality of speech, as well as comfort whilst appliances are in place. However, despite these improvements, some patients do struggle with their speech during treatment, far more than others. It is important therefore, when consenting patients for lingual orthodontic treatment, to ensure that they are made aware of the potential for speech to be disturbed, particularly in the early stages of treatment. The purpose of this article is to outline some of the issues associated with speech problems and discomfort during lingual appliance treatment, so that practitioners are able to advise patients who may be considering this kind of treatment. Advice given during the consent process, including appliance selection, procedures for maintaining oral comfort and management of individual speech issues, will all help lingual patients cope with any speech problems they may experience during their treatment. PMID:24005949
Speech perception integrates auditory and visual information. This is evidenced by the McGurk illusion where seeing the talking face influences the auditory phonetic percept and by the audiovisual detection advantage where seeing the talking face influences the detectability of the acoustic speech signal. Here, we show that identification of phonetic content and detection can be dissociated as speech-specific and non-specific audiovisual integration effects. To this end, we employed synthetically modified stimuli, sine wave speech (SWS), which is an impoverished speech signal that only observers informed of its speech-like nature recognize as speech. While the McGurk illusion only occurred for informed observers, the audiovisual detection advantage occurred for naïve observers as well. This finding supports a multistage account of audiovisual integration of speech in which the many attributes of the audiovisual speech signal are integrated by separate integration processes. PMID:21188364
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias S
Inner speech is typically characterized as either the activation of abstract linguistic representations or a detailed articulatory simulation that lacks only the production of sound. We present a study of the ‘speech errors’ that occur during the inner recitation of tongue-twister like phrases. Two forms of inner speech were tested: inner speech without articulatory movements and articulated (mouthed) inner speech. While mouthing one’s inner speech could reasonably be assumed to require more articulatory planning, prominent theories assume that such planning should not affect the experience of inner speech and consequently the errors that are ‘heard’ during its production. The errors occurring in articulated inner speech exhibited the phonemic similarity effect and lexical bias effect, two speech-error phenomena that, in overt speech, have been localized to an articulatory-feature processing level and a lexical-phonological level, respectively. In contrast, errors in unarticulated inner speech did not exhibit the phonemic similarity effect—just the lexical bias effect. The results are interpreted as support for a flexible abstraction account of inner speech. This conclusion has ramifications for the embodiment of language and speech and for the theories of speech production.
When people speak, they often insinuate their intent indirectly rather than stating it as a bald proposition. Examples include sexual come-ons, veiled threats, polite requests, and concealed bribes. We propose a three-part theory of indirect speech, based on the idea that human communication involves a mixture of cooperation and conflict. First, indirect requests allow for plausible deniability, in which a cooperative listener can accept the request, but an uncooperative one cannot react adversarially to it. This intuition is supported by a game-theoretic model that predicts the costs and benefits to a speaker of direct and indirect requests. Second, language has two functions: to convey information and to negotiate the type of relationship holding between speaker and hearer (in particular, dominance, communality, or reciprocity). The emotional costs of a mismatch in the assumed relationship type can create a need for plausible deniability and, thereby, select for indirectness even when there are no tangible costs. Third, people perceive language as a digital medium, which allows a sentence to generate common knowledge, to propagate a message with high fidelity, and to serve as a reference point in coordination games. This feature makes an indirect request qualitatively different from a direct one even when the speaker and listener can infer each other's intentions with high confidence.
Speech recognition (SR) speeds patient care processes by reducing report turnaround times. However, concerns have emerged about prolonged training and an added secretarial burden for radiologists. We assessed how much proofing radiologists who have years of experience with SR and radiologists new to SR must perform, and estimated how quickly the new users become as skilled as the experienced users. We studied SR log entries for 0.25 million reports from 154 radiologists and after careful exclusions, defined a group of 11 experienced radiologists and 71 radiologists new to SR (24,833 and 122,093 reports, respectively). Data were analyzed for sound file and report lengths, character-based error rates, and words unknown to the SR's dictionary. Experienced radiologists corrected 6 characters for each report and for new users, 11. Some users presented a very unfavorable learning curve, with error rates not declining as expected. New users' reports were longer, and data for the experienced users indicates that their reports, initially equally lengthy, shortened over a period of several years. For most radiologists, only minor corrections of dictated reports were necessary. While new users adopted SR quickly, with a subset outperforming experienced users from the start, identification of users struggling with SR will help facilitate troubleshooting and support. PMID:23779151
Kauppinen, Tomi A; Kaipio, Johanna; Koivikko, Mika P
|Purpose: This study was designed to identify and describe between-word simplification patterns in the continuous speech of children with speech sound disorders. It was hypothesized that word combinations would reveal phonological changes that were unobserved with single words, possibly accounting for discrepancies between the intelligibility of…
We describe and experimentally evaluate an efficient method for automatically de- termining small clause boundaries in spon- taneous speech. Our method applies an ar- tificial neural network to information about part of speech and trigger words. We find that with a limited amount of data (less than 2500 words for the training set), a small sliding context window (+\\/-3 to-
This paper presents a time-frequency estimator for en- hancement of noisy speech in the DFT domain. The time- varying trajectories of the DFT of speech and noise in each channel are modelled by low order autoregressive processes incorporated in the state equation of Kalman filters. The parameters of the Kalman filters are estimated recursively from the signal and noise in
|This study presents survey data on 58 Dutch-speaking patients with neurogenic stuttering following various neurological injuries. Stroke was the most prevalent cause of stuttering in our patients, followed by traumatic brain injury, neurodegenerative diseases, and other causes. Speech and non-speech characteristics were analyzed separately for…
Theys, Catherine; van Wieringen, Astrid; De Nil, Luc F.
An examination of the impact of the situational salience of gender on males' and females' speech styles is identified as a lacuna in the sex and language literature. An experiment was conducted in which subjects' ratings of speakers for whom sex was more or less salient were employed to monitor real speech differences. Tape-recorded extracts of the spontaneous discourse of
For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal
This article describes a neural network model that addresses the acquisition of speaking skills by infants and subsequent motor equivalent production of speech sounds. The model learns two mappings during a babbling phase. A phonetic-to-orosensory mapping specifies a vocal tract target for each speech sound; these targets take the form of convex regions in orosensory coordinates defining the shape of
We investigated whether audiovisual synchrony perception for speech could change after observation of the audiovisual temporal mismatch. Previous studies have revealed that audiovisual synchrony perception is re-calibrated after exposure to a constant timing difference between auditory and visual signals in non-speech. In the present study, we examined whether this audiovisual temporal recalibration occurs at the perceptual level even for speech (monosyllables). In Experiment 1, participants performed an audiovisual simultaneity judgment task (i.e., a direct measurement of the audiovisual synchrony perception) in terms of the speech signal after observation of the speech stimuli which had a constant audiovisual lag. The results showed that the “simultaneous” responses (i.e., proportion of responses for which participants judged the auditory and visual stimuli to be synchronous) at least partly depended on exposure lag. In Experiment 2, we adopted the McGurk identification task (i.e., an indirect measurement of the audiovisual synchrony perception) to exclude the possibility that this modulation of synchrony perception was solely attributable to the response strategy using stimuli identical to those of Experiment 1. The characteristics of the McGurk effect reported by participants depended on exposure lag. Thus, it was shown that audiovisual synchrony perception for speech could be modulated following exposure to constant lag both in direct and indirect measurement. Our results suggest that temporal recalibration occurs not only in non-speech signals but also in monosyllabic speech at the perceptual level.
|A study compared intelligibility and speech rate differences following speaker implementation of 3 strategies (topic, alphabet, and combined topic and alphabet supplementation) and a habitual speech control condition for 5 speakers with severe dysarthria. Combined cues and alphabet cues yielded significantly higher intelligibility scores and…
A study of cross examination speeches of males and females was conducted to determine gender differences in intercollegiate debate. The theory base for gender differences in speech is closely tied to the analysis of dyadic conversation. It is based on the belief that women are less forceful and dominant in cross examination, and will exhibit…
|The ultimate test of the speech-action dichotomy, as it relates to symbolic speech to be considered by the courts, may be the fasting of prison inmates who use hunger strikes to protest the conditions of their confinement or to make political statements. While hunger strikes have been utilized by prisoners for years as a means of protest, it was…
|Vouloumanos and Werker (2007) claim that human neonates have a (possibly innate) bias to listen to speech based on a preference for natural speech utterances over sine-wave analogues. We argue that this bias more likely arises from the strikingly different saliency of voice melody in the two kinds of sounds, a bias that has already been shown to…
One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database
The paper describes experiments that utilised sound, synthesised speech and stereophonic output to communicate data in information systems. A stock control information system was the experimental platform for these experiments. The first experiment utilised rhythms and timbre to communicate windows and their associated functions in the system. The second experiment utilised simultaneously speech and sound to communicate a large amount
This letter investigates the impact of stress on monophone speech recognition accuracy and proposes a new set of acoustic parameters based on high resolution wavelet analysis. The two parameter schemes are entitled wavelet packet parameters (WPP) and subband-based cepstral parameters (SBC). The performance of these features is compared to traditional Mel-frequency cepstral coefficients (MFCC) for stressed speech monophone recognition. The
|This paper studies the reliability and validity of naturalistic speech errors as a tool for language production research. Possible biases when collecting naturalistic speech errors are identified and specific predictions derived. These patterns are then contrasted with published reports from Germanic languages (English, German and Dutch) and one…
Perez, Elvira; Santiago, Julio; Palma, Alfonso; O'Seaghdha, Padraig G.
The acquisition of speech perception and speech production skills emerges over a protracted time course in congenitally deaf children with multichannel cochlear implants (CI). Only through comprehensive, longitudinal studies can the full impact of cochlear implantation be assessed. In this study, the performance of CI users was examined longitudinally on a battery of speech perception measures and compared with subjects with profound hearing loss who used conventional hearing aids (HA). The average performance of the multichannel cochlear implant users gradually increased over time and continued to improve even after 5 years of CI use. Speech intelligibility was assessed from recordings of the subjects' elicited speech and played to panels of listeners. Intelligibility was scored in terms of percentage of words correctly understood. The average scores for subjects who had used their CI for 4 years or more exceeded 40%. PMID:8725523
Miyamoto, R T; Kirk, K I; Robbins, A M; Todd, S; Riley, A
This study focuses on achieving speech privacy using a meaningless steady masking noise. The most effective index for achieving a satisfactory level of speech privacy was selected, choosing between spectral distance and the articulation index. From a result, spectral distance was selected as the best and most practical index for achieving speech privacy. Next, speech along with a masking noise with a sound pressure level value corresponding to various speech privacy levels were presented to subjects who judged the psychological impression of the particular speech privacy level. Theoretical calculations were in good agreement with the experimental results.
The paper is a review of ancient Sanskrit literature for information on the origin and development of speech and language, speech production, normality of speech and language, and disorders of speech and language and their treatment. (DB)
Three- to five-year-old children produce speech that is characterized by a high level of variability within and across individuals. This variability, which is manifest in speech movements, acoustics, and overt behaviors, can be input to subgroup discovery methods to identify cohesive subgroups of speakers or to reveal distinct developmental pathways or profiles. This investigation characterized three distinct groups of typically developing children and provided normative benchmarks for speech development. These speech development profiles, identified among 63 typically developing preschool-aged speakers (ages 36–59 mo), were derived from the children's performance on multiple measures. These profiles were obtained by submitting to a k-means cluster analysis of 72 measures that composed three levels of speech analysis: behavioral (e.g., task accuracy, percentage of consonants correct), acoustic (e.g., syllable duration, syllable stress), and kinematic (e.g., variability of movements of the upper lip, lower lip, and jaw). Two of the discovered group profiles were distinguished by measures of variability but not by phonemic accuracy; the third group of children was characterized by their relatively low phonemic accuracy but not by an increase in measures of variability. Analyses revealed that of the original 72 measures, 8 key measures were sufficient to best distinguish the 3 profile groups.
Campbell, Thomas F.; Shriberg, Lawrence D.; Green, Jordan R.; Abdi, Herve; Rusiewicz, Heather Leavy; Venkatesh, Lakshmi; Moore, Christopher A.
To achieve robustness and efficiency for voice communication in noise, the noise suppression and bandwidth compression processes are combined to form a joint process using input from an array of microphones. An adaptive beamforming technique with a set of robust linear constraints and a single quadratic inequality constraint is used to preserve desired signal and to cancel directional plus ambient noise in a small room environment. This robustly constrained array processor is found to be effective in limiting signal cancelation over a wide range of input SNRs (-10 dB to +10 dB). The resulting intelligibility gains (8-10 dB) provide significant improvement to subsequent CELP coding. In addition, the desired speech activity is detected by estimating Target-to-Jammer Ratios (TJR) using subband correlations between different microphone inputs or using signals within the Generalized Sidelobe Canceler directly. These two novel techniques of speech activity detection for coding are studied thoroughly in this dissertation. Each is subsequently incorporated with the adaptive array and a 4.8 kbps CELP coder to form a Variable Bit Kate (VBR) coder with noise canceling and Spatial Voice Activity Detection (SVAD) capabilities. This joint noise suppression and bandwidth compression system demonstrates large improvements in desired speech quality after coding, accurate desired speech activity detection in various types of interference, and a reduction in the information bits required to code the speech.
This study aimed to determine the effects of speech and mastication on interhemispheric inhibition between the right and left primary motor areas (M1s) by using transcranial magnetic stimulation (TMS). Motor-evoked potentials (MEPs) were recorded from the first dorsal interossei (FDIs) of each hand of 10 healthy right-handed subjects under 3 conditions: at rest (control), during mastication (non-verbal oral movement), and during speech (reading aloud). Test TMS was delivered following conditioning TMS of the contralateral M1 at various interstimulus intervals. Under all conditions, the MEPs in the left FDIs were significantly inhibited after conditioning of the left M1 (i.e. inhibition of the right M1 by TMS of the left hemisphere). In contrast, the left M1 was significantly inhibited by the right hemisphere only during the control and mastication tasks, but not speech task. These results suggest that speech may facilitate the activity of the dominant M1 via functional connectivity between the speech area and the left M1, or may modify the balance of interhemispheric interactions, by suppressing inhibition of the dominant hemisphere by the non-dominant hemisphere. Our findings show a novel aspect of interhemispheric dominance and may improve therapeutic strategies for recovery from stroke. PMID:23123786
Disorders of music and speech perception, known as amusia and aphasia, have traditionally been regarded as dissociated deficits based on studies of brain damaged patients. This has been taken as evidence that music and speech are perceived by largely separate and independent networks in the brain. However, recent studies of congenital amusia have broadened this view by showing that the deficit is associated with problems in perceiving speech prosody, especially intonation and emotional prosody. In the present study the association between the perception of music and speech prosody was investigated with healthy Finnish adults (n = 61) using an on-line music perception test including the Scale subtest of Montreal Battery of Evaluation of Amusia (MBEA) and Off-Beat and Out-of-key tasks as well as a prosodic verbal task that measures the perception of word stress. Regression analyses showed that there was a clear association between prosody perception and music perception, especially in the domain of rhythm perception. This association was evident after controlling for music education, age, pitch perception, visuospatial perception, and working memory. Pitch perception was significantly associated with music perception but not with prosody perception. The association between music perception and visuospatial perception (measured using analogous tasks) was less clear. Overall, the pattern of results indicates that there is a robust link between music and speech perception and that this link can be mediated by rhythmic cues (time and stress). PMID:24032022
Spoken language exists because of a remarkable neural process. Inside a speaker's brain, an intended message gives rise to neural signals activating the muscles of the vocal tract. The process is remarkable because these muscles are activated in just the right way that the vocal tract produces sounds a listener understands as the intended message. What is the best approach to understanding the neural substrate of this crucial motor control process? One of the key recent modeling developments in neuroscience has been the use of state feedback control (SFC) theory to explain the role of the CNS in motor control. SFC postulates that the CNS controls motor output by (1) estimating the current dynamic state of the thing (e.g., arm) being controlled, and (2) generating controls based on this estimated state. SFC has successfully predicted a great range of non-speech motor phenomena, but as yet has not received attention in the speech motor control community. Here, we review some of the key characteristics of speech motor control and what they say about the role of the CNS in the process. We then discuss prior efforts to model the role of CNS in speech motor control, and argue that these models have inherent limitations – limitations that are overcome by an SFC model of speech motor control which we describe. We conclude by discussing a plausible neural substrate of our model.
Disorders of music and speech perception, known as amusia and aphasia, have traditionally been regarded as dissociated deficits based on studies of brain damaged patients. This has been taken as evidence that music and speech are perceived by largely separate and independent networks in the brain. However, recent studies of congenital amusia have broadened this view by showing that the deficit is associated with problems in perceiving speech prosody, especially intonation and emotional prosody. In the present study the association between the perception of music and speech prosody was investigated with healthy Finnish adults (n = 61) using an on-line music perception test including the Scale subtest of Montreal Battery of Evaluation of Amusia (MBEA) and Off-Beat and Out-of-key tasks as well as a prosodic verbal task that measures the perception of word stress. Regression analyses showed that there was a clear association between prosody perception and music perception, especially in the domain of rhythm perception. This association was evident after controlling for music education, age, pitch perception, visuospatial perception, and working memory. Pitch perception was significantly associated with music perception but not with prosody perception. The association between music perception and visuospatial perception (measured using analogous tasks) was less clear. Overall, the pattern of results indicates that there is a robust link between music and speech perception and that this link can be mediated by rhythmic cues (time and stress).
Summary Despite significant research and important clinical correlates, direct neural evidence for a phonological loop linking speech perception, short-term memory and production remains elusive. To investigate these processes, we acquired whole-head magnetoencephalographic (MEG) recordings from human subjects performing a variable-length syllable sequence reproduction task. The MEG sensor data was source-localized using a time-frequency spatially adaptive filter, and we examined the time-courses of cortical oscillatory power and the correlations of oscillatory power with behavior, between onset of the audio stimulus and the overt speech response. We found dissociations between time-courses of behaviorally relevant activations in a network of regions falling largely within the dorsal speech stream. In particular, verbal working memory load modulated high gamma power (HGP) in both Sylvian-Parietal-Temporal (Spt) and Broca’s Areas. The time-courses of the correlations between HGP and subject performance clearly alternated between these two regions throughout the task. Our results provide the first evidence of a reverberating input-output buffer system in the dorsal stream underlying speech sensorimotor integration, consistent with recent phonological loop, competitive queuing and speech-motor control models. These findings also shed new light on potential sources of speech dysfunction in aphasia and neuropsychiatric disorders, identifying anatomically and behaviorally dissociable activation time-windows critical for successful speech reproduction.
Herman, Alexander B.; Houde, John F.; Vinogradov, Sophia; Nagarajan, Srikantan
In order to acquire their native languages, children must learn richly structured systems with regularities at multiple levels. While structure at different levels could be learned serially, e.g., speech segmentation coming before word-object mapping, redundancies across levels make parallel learning more efficient. For instance, a series of syllables is likely to be a word not only because of high transitional probabilities, but also because of a consistently co-occurring object. But additional statistics require additional processing, and thus might not be useful to cognitively constrained learners. We show that the structure of child-directed speech makes simultaneous speech segmentation and word learning tractable for human learners. First, a corpus of child-directed speech was recorded from parents and children engaged in a naturalistic free-play task. Analyses revealed two consistent regularities in the sentence structure of naming events. These regularities were subsequently encoded in an artificial language to which adult participants were exposed in the context of simultaneous statistical speech segmentation and word learning. Either regularity was independently sufficient to support successful learning, but no learning occurred in the absence of both regularities. Thus, the structure of child-directed speech plays an important role in scaffolding speech segmentation and word learning in parallel.
Despite significant research and important clinical correlates, direct neural evidence for a phonological loop linking speech perception, short-term memory and production remains elusive. To investigate these processes, we acquired whole-head magnetoencephalographic (MEG) recordings from human subjects performing a variable-length syllable sequence reproduction task. The MEG sensor data were source localized using a time-frequency optimized spatially adaptive filter, and we examined the time courses of cortical oscillatory power and the correlations of oscillatory power with behavior between onset of the audio stimulus and the overt speech response. We found dissociations between time courses of behaviorally relevant activations in a network of regions falling primarily within the dorsal speech stream. In particular, verbal working memory load modulated high gamma power in both Sylvian-parietal-temporal and Broca's areas. The time courses of the correlations between high gamma power and subject performance clearly alternated between these two regions throughout the task. Our results provide the first evidence of a reverberating input-output buffer system in the dorsal stream underlying speech sensorimotor integration, consistent with recent phonological loop, competitive queuing, and speech-motor control models. These findings also shed new light on potential sources of speech dysfunction in aphasia and neuropsychiatric disorders, identifying anatomically and behaviorally dissociable activation time windows critical for successful speech reproduction. PMID:23536060
Herman, Alexander B; Houde, John F; Vinogradov, Sophia; Nagarajan, Srikantan S
Observing a speaker’s mouth profoundly influences speech perception. For example, listeners perceive an “illusory” “ta” when the video of a face producing /ka/ is dubbed onto an audio /pa/. Here, we show how cortical areas supporting speech production mediate this illusory percept and audiovisual (AV) speech perception more generally. Specifically, cortical activity during AV speech perception occurs in many of the same areas that are active during speech production. We find that different perceptions of the same syllable and the perception of different syllables are associated with different distributions of activity in frontal motor areas involved in speech production. Activity patterns in these frontal motor areas resulting from the illusory “ta” percept are more similar to the activity patterns evoked by AV/ta/ than they are to patterns evoked by AV/pa/ or AV/ka/. In contrast to the activity in frontal motor areas, stimulus-evoked activity for the illusory “ta” in auditory and somatosensory areas and visual areas initially resembles activity evoked by AV/pa/ and AV/ka/, respectively. Ultimately, though, activity in these regions comes to resemble activity evoked by AV/ta/. Together, these results suggest that AV speech elicits in the listener a motor plan for the production of the phoneme that the speaker might have been attempting to produce, and that feedback in the form of efference copy from the motor system ultimately influences the phonetic interpretation.
Skipper, Jeremy I.; van Wassenhove, Virginie; Nusbaum, Howard C.; Small, Steven L.
A set of computer programs was tested that can noticeably improve speech intelligibility for the hearing impaired. The processing first emphasizes speech features by removing noise and pitch irregularities from vowels and by adaptively enhancing the chara...
Treatments for stuttering based on variants of Goldiamond's prolonged-speech procedure involve teaching clients to speak with novel speech patterns. Those speech patterns consist of specific skills, described with such terms as soft contacts, gentle onsets, and continuous vocalization. It might be expected that effective client learning of such speech skills would be dependent on clinicians' ability to reliably identify any departures from the correct production of such speech targets. The present study investigated clinicians' reliability in detecting such errors during a prolonged-speech treatment program. Results showed questionable intraclinician agreement and poor interclinician agreement. Nonetheless, the prolonged-speech program in question is known to be effective in controlling stuttered speech. The clinical and theoretical implications of these findings are discussed. PMID:9771621
This paper presents text-independent speaker identification results for varying speaker population sizes up to 630 speakers for both clean, wideband speech, and telephone speech. A system based on Gaussian mixture speaker models is used for speaker identi...
The author describes the technology innovations necessary to accommodate the market need which is the driving force toward greater perceived computer intelligence. The author discusses aspects of both speech synthesis and speech recognition.
Our long-term research goal is the development and implementation of speaker-independent continuous speech recognition systems. It is our conviction that the proper utilization of speech-specific knowledge is essential for such advanced systems. Our resea...
The study describes the demographic and professional characteristics of the speech pathology/audiology work force and those students currently in training; identifies current patterns of manpower utilization in speech pathology and audiology; estimates th...
|Suggests criteria which an ideal first course in speech might meet: (1) focus on the elements typical in all speech communication behavior, (2) delete formal individual oral performance, and (3) utilize the empirical data produced by the behavioral scientists.'' (Author)|
We investigated whether the interpretation of auditory stimuli as speech or non-speech affects audiovisual (AV) speech integration at the neural level. Perceptually ambiguous sine-wave replicas (SWS) of natural speech were presented to listeners who were either in "speech mode" or "non-speech mode". At the behavioral level, incongruent lipread…
|We investigated whether the interpretation of auditory stimuli as speech or non-speech affects audiovisual (AV) speech integration at the neural level. Perceptually ambiguous sine-wave replicas (SWS) of natural speech were presented to listeners who were either in "speech mode" or "non-speech mode". At the behavioral level, incongruent lipread…
The discovery of mirror neurons, a class of neurons that respond when a monkey performs an action and also when the monkey observes others producing the same action, has promoted a renaissance for the Motor Theory (MT) of speech perception. This is because mirror neurons seem to accomplish the same kind of one to one mapping between perception and action that MT theorizes to be the basis of human speech communication. However, this seeming correspondence is superficial, and there are theoretical and empirical reasons to temper enthusiasm about the explanatory role mirror neurons might have for speech perception. In fact, rather than providing support for MT, mirror neurons are actually inconsistent with the central tenets of MT.
Lotto, Andrew J.; Hickok, Gregory S.; Holt, Lori L.
Electroencephalogram was recorded as healthy adults viewed short videos of spontaneous discourse in which a speaker used depictive gestures to complement information expressed through speech. Event-related potentials were computed time locked to content words in the speech stream and to subsequent related and unrelated picture probes. Gestures modulated event-related potentials to content words co-timed with the first gesture in a discourse segment, relative to the same words presented with static freeze frames of the speaker. Effects were observed 200–550?ms after speech onset, a time interval associated with semantic processing. Gestures also increased sensitivity to picture probe relatedness. Effects of gestures on picture probe and spoken word analysis were inversely correlated, suggesting that gestures differentially impact verbal and image-based processes.
To determine the degree to which emotional changes in speech reflect factors other than arousal, such as valence, the authors used a computer game to induce natural emotional speech. Voice samples were elicited following game events that were either conducive or obstructive to the goal of winning and were accompanied by either pleasant or unpleasant sounds. Acoustic analysis of the speech recordings of 30 adolescents revealed that mean energy, fundamental-frequency level, utterance duration, and the proportion of an utterance that was voiced varied with goal conduciveness; spectral energy distribution depended on manipulations of pleasantness; and pitch dynamics depended on the interaction of pleasantness and goal conduciveness. The results suggest that a single arousal dimension does not adequately characterize a number of emotion-related vocal changes, lending weight to multidimensional theories of emotional response patterning. PMID:16366756
Johnstone, Tom; van Reekum, Carien M; Hird, Kathryn; Kirsner, Kim; Scherer, Klaus R
The American Speech-Language-Hearing-Association (ASHA) is the national professional, scientific, and credentialing association for more than 166,000 members in fields like audiology and speech-language pathology. New users might want to slide on over to the Information For area. Here they will find thematic sections for audiologists, students, academic programs, and the general public. Also on the homepage are six areas of note, including Publications, Events, Advocacy, and Continuing Education. In the Publications area, visitors can look over best-practice documents, listen to a podcast series, and also learn more about ASHA's academic journals, which include the American Journal of Audiology and the Journal of Speech, Language, and Hearing Research. [KMG
Cochlear implant systems are used in diverse environments and should function during work, exercise and play as people go about their daily lives. This is a demanding requirement, with exposure to liquid and other contaminant ingress from many sources. For reliability, it is desirable that the speech processor withstands these exposures. This design challenge has been addressed in the Nucleus(R) Freedom(TM) speech processor. The Nucleus Freedom speech processor complies with International Standard IEC 60529, as independently certified. Tests include spraying the processor with water followed by immediate verification of functionality including microphone response, radio frequency link and processor controls. The processor has met level IP44 of the Standard. PMID:18792383
Gibson, Peter; Capcelea, Edmond; Darley, Ian; Leavens, Jason; Parker, John
The variations in speech production due to stress have an adverse affect on the performances of speech and speaker recognition\\u000a algorithms. In this work, different speech features, such as Sinusoidal Frequency Features (SFF), Sinusoidal Amplitude Features\\u000a (SAF), Cepstral Coefficients (CC) and Mel Frequency Cepstral Coefficients (MFCC), are evaluated to find out their relative\\u000a effectiveness to represent the stressed speech. Different
An equalizer to enhance the quality of reconstructed speech from an analysis-by-synthesis speech coder, e.g., CELP coder, is described. The equalizer makes use of the set of short-term predictor parameters normally transmitted from the speech encoder to the decoder. In addition, the equalizer computes a matching set of parameters from the recon- structed speech. The function of the equalizer is
We investigate methods of improving the intelligibility of synthetic speech under noisy or low-fidelity acoustic conditions. Techniques explored improve speech in a natural manner, such that training won't be required for the user to understand the enhanced speech. While the improvements are natural in this respect, the changes aren't limited to creating only speech that is achievable by a human
Abstract ,Speech and gesture provide two different access routes to a learner's,mental representation of a problem. We examined the gestures and speech produced,by,children,learning,the,concept,of mathematical equivalence, and found that children on the verge of acquiring the concept tended to express information in gesture which they did not express in speech. We explored what the production of such gesture-speech mismatches implies for
We describe an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system. Since this speech synthesis system uses phoneme HMMs as speech units, voice characteristics conversion is achieved by changing the HMM parameters appropriately. To transform the voice characteristics of synthesized speech to the target speaker, we applied the maximum a posteriori estimation and vector field smoothing (MAP\\/VFS)
Takashi Masuko; Keiichi Tokuda; Takao Kobayashi; S. Imai
Recent attempts to regulate Crisis Pregnancy Centers, pseudoclinics that surreptitiously aim to dissuade pregnant women from choosing abortion, have confronted the thorny problem of how to define commercial speech. The Supreme Court has offered three potential answers to this definitional quandary. This Note uses the Crisis Pregnancy Center cases to demonstrate that courts should use one of these solutions, the factor-based approach of Bolger v. Youngs Drugs Products Corp., to define commercial speech in the Crisis Pregnancy Center cases and elsewhere. In principle and in application, the Bolger factor-based approach succeeds in structuring commercial speech analysis at the margins of the doctrine. PMID:23461000
Phoneme recognition strongly depends on the intrinsic duration of speech segments, phoneme spectral change's durations, and the relative timing of two overlapping events. Excerpts of fluent speech are not very well perceived below a given duration threshold, and \\
The classification and separation of speech and music signals have attracted attention by many researchers. The purpose of the classification process is needed to build two different libraries: speech library and music library, from a stream of sounds. However, the separation process is needed in a cocktail-party problem to separate speech from music and remove the undesired one. In this
The two principal areas of natural language processing research in pragmantics are belief modelling and speech act processing. Belief modelling is the development of techniques to represent the mental attitudes of a dialogue participant. The latter approach, speech act processing, based on speech act theory, involves viewing dialogue in planning terms. Utterances in a dialogue are modelled as steps in