audio machine text-to-speech: Topics by Science.gov

Sample records for audio machine text-to-speech

Using Text-to-Speech (TTS) for Audio Computer-Assisted Self-Interviewing (ACASI)

ERIC Educational Resources Information Center

Couper, Mick P.; Berglund, Patricia; Kirgis, Nicole; Buageila, Sarrah

2016-01-01

We evaluate the use of text-to-speech (TTS) technology for audio computer-assisted self-interviewing (ACASI). We use a quasi-experimental design, comparing the use of recorded human voice in the 2006-2010 National Survey of Family Growth with the use of TTS in the first year of the 2011-2013 survey, where the essential survey conditions are…
Robot Command Interface Using an Audio-Visual Speech Recognition System

NASA Astrophysics Data System (ADS)

Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy

In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.
[Intermodal timing cues for audio-visual speech recognition].

PubMed

Hashimoto, Masahiro; Kumashiro, Masaharu

2004-06-01

The purpose of this study was to investigate the limitations of lip-reading advantages for Japanese young adults by desynchronizing visual and auditory information in speech. In the experiment, audio-visual speech stimuli were presented under the six test conditions: audio-alone, and audio-visually with either 0, 60, 120, 240 or 480 ms of audio delay. The stimuli were the video recordings of a face of a female Japanese speaking long and short Japanese sentences. The intelligibility of the audio-visual stimuli was measured as a function of audio delays in sixteen untrained young subjects. Speech intelligibility under the audio-delay condition of less than 120 ms was significantly better than that under the audio-alone condition. On the other hand, the delay of 120 ms corresponded to the mean mora duration measured for the audio stimuli. The results implied that audio delays of up to 120 ms would not disrupt lip-reading advantage, because visual and auditory information in speech seemed to be integrated on a syllabic time scale. Potential applications of this research include noisy workplace in which a worker must extract relevant speech from all the other competing noises.
Electrophysiological evidence for Audio-visuo-lingual speech integration.

PubMed

Treille, Avril; Vilain, Coriandre; Schwartz, Jean-Luc; Hueber, Thomas; Sato, Marc

2018-01-31

Recent neurophysiological studies demonstrate that audio-visual speech integration partly operates through temporal expectations and speech-specific predictions. From these results, one common view is that the binding of auditory and visual, lipread, speech cues relies on their joint probability and prior associative audio-visual experience. The present EEG study examined whether visual tongue movements integrate with relevant speech sounds, despite little associative audio-visual experience between the two modalities. A second objective was to determine possible similarities and differences of audio-visual speech integration between unusual audio-visuo-lingual and classical audio-visuo-labial modalities. To this aim, participants were presented with auditory, visual, and audio-visual isolated syllables, with the visual presentation related to either a sagittal view of the tongue movements or a facial view of the lip movements of a speaker, with lingual and facial movements previously recorded by an ultrasound imaging system and a video camera. In line with previous EEG studies, our results revealed an amplitude decrease and a latency facilitation of P2 auditory evoked potentials in both audio-visual-lingual and audio-visuo-labial conditions compared to the sum of unimodal conditions. These results argue against the view that auditory and visual speech cues solely integrate based on prior associative audio-visual perceptual experience. Rather, they suggest that dynamic and phonetic informational cues are sharable across sensory modalities, possibly through a cross-modal transfer of implicit articulatory motor knowledge. Copyright © 2017 Elsevier Ltd. All rights reserved.
Speech to Text Translation for Malay Language

NASA Astrophysics Data System (ADS)

Al-khulaidi, Rami Ali; Akmeliawati, Rini

2017-11-01

The speech recognition system is a front end and a back-end process that receives an audio signal uttered by a speaker and converts it into a text transcription. The speech system can be used in several fields including: therapeutic technology, education, social robotics and computer entertainments. In most cases in control tasks, which is the purpose of proposing our system, wherein the speed of performance and response concern as the system should integrate with other controlling platforms such as in voiced controlled robots. Therefore, the need for flexible platforms, that can be easily edited to jibe with functionality of the surroundings, came to the scene; unlike other software programs that require recording audios and multiple training for every entry such as MATLAB and Phoenix. In this paper, a speech recognition system for Malay language is implemented using Microsoft Visual Studio C#. 90 (ninety) Malay phrases were tested by 10 (ten) speakers from both genders in different contexts. The result shows that the overall accuracy (calculated from Confusion Matrix) is satisfactory as it is 92.69%.
Talker variability in audio-visual speech perception

PubMed Central

Heald, Shannon L. M.; Nusbaum, Howard C.

2014-01-01

A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition). So far, this talker variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker’s face, speech recognition is improved under adverse listening (e.g., noise or distortion) conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker’s face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker’s face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred. PMID:25076919
Talker variability in audio-visual speech perception.

PubMed

Heald, Shannon L M; Nusbaum, Howard C

2014-01-01

A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition). So far, this talker variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker's face, speech recognition is improved under adverse listening (e.g., noise or distortion) conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker's face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker's face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred.
Audio-video feature correlation: faces and speech

NASA Astrophysics Data System (ADS)

Durand, Gwenael; Montacie, Claude; Caraty, Marie-Jose; Faudemay, Pascal

1999-08-01

This paper presents a study of the correlation of features automatically extracted from the audio stream and the video stream of audiovisual documents. In particular, we were interested in finding out whether speech analysis tools could be combined with face detection methods, and to what extend they should be combined. A generic audio signal partitioning algorithm as first used to detect Silence/Noise/Music/Speech segments in a full length movie. A generic object detection method was applied to the keyframes extracted from the movie in order to detect the presence or absence of faces. The correlation between the presence of a face in the keyframes and of the corresponding voice in the audio stream was studied. A third stream, which is the script of the movie, is warped on the speech channel in order to automatically label faces appearing in the keyframes with the name of the corresponding character. We naturally found that extracted audio and video features were related in many cases, and that significant benefits can be obtained from the joint use of audio and video analysis methods.
Audio-Visual Speech Perception Is Special

ERIC Educational Resources Information Center

Tuomainen, J.; Andersen, T.S.; Tiippana, K.; Sams, M.

2005-01-01

In face-to-face conversation speech is perceived by ear and eye. We studied the prerequisites of audio-visual speech perception by using perceptually ambiguous sine wave replicas of natural speech as auditory stimuli. When the subjects were not aware that the auditory stimuli were speech, they showed only negligible integration of auditory and…
Is automatic speech-to-text transcription ready for use in psychological experiments?

PubMed

Ziman, Kirsten; Heusser, Andrew C; Fitzpatrick, Paxton C; Field, Campbell E; Manning, Jeremy R

2018-04-23

Verbal responses are a convenient and naturalistic way for participants to provide data in psychological experiments (Salzinger, The Journal of General Psychology, 61(1),65-94:1959). However, audio recordings of verbal responses typically require additional processing, such as transcribing the recordings into text, as compared with other behavioral response modalities (e.g., typed responses, button presses, etc.). Further, the transcription process is often tedious and time-intensive, requiring human listeners to manually examine each moment of recorded speech. Here we evaluate the performance of a state-of-the-art speech recognition algorithm (Halpern et al., 2016) in transcribing audio data into text during a list-learning experiment. We compare transcripts made by human annotators to the computer-generated transcripts. Both sets of transcripts matched to a high degree and exhibited similar statistical properties, in terms of the participants' recall performance and recall dynamics that the transcripts captured. This proof-of-concept study suggests that speech-to-text engines could provide a cheap, reliable, and rapid means of automatically transcribing speech data in psychological experiments. Further, our findings open the door for verbal response experiments that scale to thousands of participants (e.g., administered online), as well as a new generation of experiments that decode speech on the fly and adapt experimental parameters based on participants' prior responses.
No, There Is No 150 ms Lead of Visual Speech on Auditory Speech, but a Range of Audiovisual Asynchronies Varying from Small Audio Lead to Large Audio Lag

PubMed Central

Schwartz, Jean-Luc; Savariaux, Christophe

2014-01-01

An increasing number of neuroscience papers capitalize on the assumption published in this journal that visual speech would be typically 150 ms ahead of auditory speech. It happens that the estimation of audiovisual asynchrony in the reference paper is valid only in very specific cases, for isolated consonant-vowel syllables or at the beginning of a speech utterance, in what we call “preparatory gestures”. However, when syllables are chained in sequences, as they are typically in most parts of a natural speech utterance, asynchrony should be defined in a different way. This is what we call “comodulatory gestures” providing auditory and visual events more or less in synchrony. We provide audiovisual data on sequences of plosive-vowel syllables (pa, ta, ka, ba, da, ga, ma, na) showing that audiovisual synchrony is actually rather precise, varying between 20 ms audio lead and 70 ms audio lag. We show how more complex speech material should result in a range typically varying between 40 ms audio lead and 200 ms audio lag, and we discuss how this natural coordination is reflected in the so-called temporal integration window for audiovisual speech perception. Finally we present a toy model of auditory and audiovisual predictive coding, showing that visual lead is actually not necessary for visual prediction. PMID:25079216
Bridging music and speech rhythm: rhythmic priming and audio-motor training affect speech perception.

PubMed

Cason, Nia; Astésano, Corine; Schön, Daniele

2015-02-01

Following findings that musical rhythmic priming enhances subsequent speech perception, we investigated whether rhythmic priming for spoken sentences can enhance phonological processing - the building blocks of speech - and whether audio-motor training enhances this effect. Participants heard a metrical prime followed by a sentence (with a matching/mismatching prosodic structure), for which they performed a phoneme detection task. Behavioural (RT) data was collected from two groups: one who received audio-motor training, and one who did not. We hypothesised that 1) phonological processing would be enhanced in matching conditions, and 2) audio-motor training with the musical rhythms would enhance this effect. Indeed, providing a matching rhythmic prime context resulted in faster phoneme detection, thus revealing a cross-domain effect of musical rhythm on phonological processing. In addition, our results indicate that rhythmic audio-motor training enhances this priming effect. These results have important implications for rhythm-based speech therapies, and suggest that metrical rhythm in music and speech may rely on shared temporal processing brain resources. Copyright © 2015 Elsevier B.V. All rights reserved.
Transitioning from analog to digital audio recording in childhood speech sound disorders.

PubMed

Shriberg, Lawrence D; McSweeny, Jane L; Anderson, Bruce E; Campbell, Thomas F; Chial, Michael R; Green, Jordan R; Hauner, Katherina K; Moore, Christopher A; Rusiewicz, Heather L; Wilson, David L

2005-06-01

Few empirical findings or technical guidelines are available on the current transition from analog to digital audio recording in childhood speech sound disorders. Of particular concern in the present context was whether a transition from analog- to digital-based transcription and coding of prosody and voice features might require re-standardizing a reference database for research in childhood speech sound disorders. Two research transcribers with different levels of experience glossed, transcribed, and prosody-voice coded conversational speech samples from eight children with mild to severe speech disorders of unknown origin. The samples were recorded, stored, and played back using representative analog and digital audio systems. Effect sizes calculated for an array of analog versus digital comparisons ranged from negligible to medium, with a trend for participants' speech competency scores to be slightly lower for samples obtained and transcribed using the digital system. We discuss the implications of these and other findings for research and clinical practise.
Transitioning from analog to digital audio recording in childhood speech sound disorders

PubMed Central

Shriberg, Lawrence D.; McSweeny, Jane L.; Anderson, Bruce E.; Campbell, Thomas F.; Chial, Michael R.; Green, Jordan R.; Hauner, Katherina K.; Moore, Christopher A.; Rusiewicz, Heather L.; Wilson, David L.

2014-01-01

Few empirical findings or technical guidelines are available on the current transition from analog to digital audio recording in childhood speech sound disorders. Of particular concern in the present context was whether a transition from analog- to digital-based transcription and coding of prosody and voice features might require re-standardizing a reference database for research in childhood speech sound disorders. Two research transcribers with different levels of experience glossed, transcribed, and prosody-voice coded conversational speech samples from eight children with mild to severe speech disorders of unknown origin. The samples were recorded, stored, and played back using representative analog and digital audio systems. Effect sizes calculated for an array of analog versus digital comparisons ranged from negligible to medium, with a trend for participants’ speech competency scores to be slightly lower for samples obtained and transcribed using the digital system. We discuss the implications of these and other findings for research and clinical practise. PMID:16019779
Cross-Modal Matching of Audio-Visual German and French Fluent Speech in Infancy

PubMed Central

Kubicek, Claudia; Hillairet de Boisferon, Anne; Dupierrix, Eve; Pascalis, Olivier; Lœvenbruck, Hélène; Gervain, Judit; Schwarzer, Gudrun

2014-01-01

The present study examined when and how the ability to cross-modally match audio-visual fluent speech develops in 4.5-, 6- and 12-month-old German-learning infants. In Experiment 1, 4.5- and 6-month-old infants’ audio-visual matching ability of native (German) and non-native (French) fluent speech was assessed by presenting auditory and visual speech information sequentially, that is, in the absence of temporal synchrony cues. The results showed that 4.5-month-old infants were capable of matching native as well as non-native audio and visual speech stimuli, whereas 6-month-olds perceived the audio-visual correspondence of native language stimuli only. This suggests that intersensory matching narrows for fluent speech between 4.5 and 6 months of age. In Experiment 2, auditory and visual speech information was presented simultaneously, therefore, providing temporal synchrony cues. Here, 6-month-olds were found to match native as well as non-native speech indicating facilitation of temporal synchrony cues on the intersensory perception of non-native fluent speech. Intriguingly, despite the fact that audio and visual stimuli cohered temporally, 12-month-olds matched the non-native language only. Results were discussed with regard to multisensory perceptual narrowing during the first year of life. PMID:24586651
Effects of Audio-Visual Information on the Intelligibility of Alaryngeal Speech

ERIC Educational Resources Information Center

Evitts, Paul M.; Portugal, Lindsay; Van Dine, Ami; Holler, Aline

2010-01-01

Background: There is minimal research on the contribution of visual information on speech intelligibility for individuals with a laryngectomy (IWL). Aims: The purpose of this project was to determine the effects of mode of presentation (audio-only, audio-visual) on alaryngeal speech intelligibility. Method: Twenty-three naive listeners were…
Robust audio-visual speech recognition under noisy audio-video conditions.

PubMed

Stewart, Darryl; Seymour, Rowan; Pass, Adrian; Ming, Ji

2014-02-01

This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.
Support vector machine and mel frequency Cepstral coefficient based algorithm for hand gestures and bidirectional speech to text device

NASA Astrophysics Data System (ADS)

Balbin, Jessie R.; Padilla, Dionis A.; Fausto, Janette C.; Vergara, Ernesto M.; Garcia, Ramon G.; Delos Angeles, Bethsedea Joy S.; Dizon, Neil John A.; Mardo, Mark Kevin N.

2017-02-01

This research is about translating series of hand gesture to form a word and produce its equivalent sound on how it is read and said in Filipino accent using Support Vector Machine and Mel Frequency Cepstral Coefficient analysis. The concept is to detect Filipino speech input and translate the spoken words to their text form in Filipino. This study is trying to help the Filipino deaf community to impart their thoughts through the use of hand gestures and be able to communicate to people who do not know how to read hand gestures. This also helps literate deaf to simply read the spoken words relayed to them using the Filipino speech to text system.
Audio visual speech source separation via improved context dependent association model

NASA Astrophysics Data System (ADS)

Kazemi, Alireza; Boostani, Reza; Sobhanmanesh, Fariborz

2014-12-01

In this paper, we exploit the non-linear relation between a speech source and its associated lip video as a source of extra information to propose an improved audio-visual speech source separation (AVSS) algorithm. The audio-visual association is modeled using a neural associator which estimates the visual lip parameters from a temporal context of acoustic observation frames. We define an objective function based on mean square error (MSE) measure between estimated and target visual parameters. This function is minimized for estimation of the de-mixing vector/filters to separate the relevant source from linear instantaneous or time-domain convolutive mixtures. We have also proposed a hybrid criterion which uses AV coherency together with kurtosis as a non-Gaussianity measure. Experimental results are presented and compared in terms of visually relevant speech detection accuracy and output signal-to-interference ratio (SIR) of source separation. The suggested audio-visual model significantly improves relevant speech classification accuracy compared to existing GMM-based model and the proposed AVSS algorithm improves the speech separation quality compared to reference ICA- and AVSS-based methods.
Interactive MPEG-4 low-bit-rate speech/audio transmission over the Internet

NASA Astrophysics Data System (ADS)

Liu, Fang; Kim, JongWon; Kuo, C.-C. Jay

1999-11-01

The recently developed MPEG-4 technology enables the coding and transmission of natural and synthetic audio-visual data in the form of objects. In an effort to extend the object-based functionality of MPEG-4 to real-time Internet applications, architectural prototypes of multiplex layer and transport layer tailored for transmission of MPEG-4 data over IP are under debate among Internet Engineering Task Force (IETF), and MPEG-4 systems Ad Hoc group. In this paper, we present an architecture for interactive MPEG-4 speech/audio transmission system over the Internet. It utilities a framework of Real Time Streaming Protocol (RTSP) over Real-time Transport Protocol (RTP) to provide controlled, on-demand delivery of real time speech/audio data. Based on a client-server model, a couple of low bit-rate bit streams (real-time speech/audio, pre- encoded speech/audio) are multiplexed and transmitted via a single RTP channel to the receiver. The MPEG-4 Scene Description (SD) and Object Descriptor (OD) bit streams are securely sent through the RTSP control channel. Upon receiving, an initial MPEG-4 audio- visual scene is constructed after de-multiplexing, decoding of bit streams, and scene composition. A receiver is allowed to manipulate the initial audio-visual scene presentation locally, or interactively arrange scene changes by sending requests to the server. A server may also choose to update the client with new streams and list of contents for user selection.

Incorporating Auditory Models in Speech/Audio Applications

NASA Astrophysics Data System (ADS)

Krishnamoorthi, Harish

2011-12-01

Following the success in incorporating perceptual models in audio coding algorithms, their application in other speech/audio processing systems is expanding. In general, all perceptual speech/audio processing algorithms involve minimization of an objective function that directly/indirectly incorporates properties of human perception. This dissertation primarily investigates the problems associated with directly embedding an auditory model in the objective function formulation and proposes possible solutions to overcome high complexity issues for use in real-time speech/audio algorithms. Specific problems addressed in this dissertation include: 1) the development of approximate but computationally efficient auditory model implementations that are consistent with the principles of psychoacoustics, 2) the development of a mapping scheme that allows synthesizing a time/frequency domain representation from its equivalent auditory model output. The first problem is aimed at addressing the high computational complexity involved in solving perceptual objective functions that require repeated application of auditory model for evaluation of different candidate solutions. In this dissertation, a frequency pruning and a detector pruning algorithm is developed that efficiently implements the various auditory model stages. The performance of the pruned model is compared to that of the original auditory model for different types of test signals in the SQAM database. Experimental results indicate only a 4-7% relative error in loudness while attaining up to 80-90 % reduction in computational complexity. Similarly, a hybrid algorithm is developed specifically for use with sinusoidal signals and employs the proposed auditory pattern combining technique together with a look-up table to store representative auditory patterns. The second problem obtains an estimate of the auditory representation that minimizes a perceptual objective function and transforms the auditory pattern back to
Audio-visual speech processing in age-related hearing loss: Stronger integration and increased frontal lobe recruitment.

PubMed

Rosemann, Stephanie; Thiel, Christiane M

2018-07-15

Hearing loss is associated with difficulties in understanding speech, especially under adverse listening conditions. In these situations, seeing the speaker improves speech intelligibility in hearing-impaired participants. On the neuronal level, previous research has shown cross-modal plastic reorganization in the auditory cortex following hearing loss leading to altered processing of auditory, visual and audio-visual information. However, how reduced auditory input effects audio-visual speech perception in hearing-impaired subjects is largely unknown. We here investigated the impact of mild to moderate age-related hearing loss on processing audio-visual speech using functional magnetic resonance imaging. Normal-hearing and hearing-impaired participants performed two audio-visual speech integration tasks: a sentence detection task inside the scanner and the McGurk illusion outside the scanner. Both tasks consisted of congruent and incongruent audio-visual conditions, as well as auditory-only and visual-only conditions. We found a significantly stronger McGurk illusion in the hearing-impaired participants, which indicates stronger audio-visual integration. Neurally, hearing loss was associated with an increased recruitment of frontal brain areas when processing incongruent audio-visual, auditory and also visual speech stimuli, which may reflect the increased effort to perform the task. Hearing loss modulated both the audio-visual integration strength measured with the McGurk illusion and brain activation in frontal areas in the sentence task, showing stronger integration and higher brain activation with increasing hearing loss. Incongruent compared to congruent audio-visual speech revealed an opposite brain activation pattern in left ventral postcentral gyrus in both groups, with higher activation in hearing-impaired participants in the incongruent condition. Our results indicate that already mild to moderate hearing loss impacts audio-visual speech processing
Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

NASA Technical Reports Server (NTRS)

Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

2010-01-01

A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.
A video, text, and speech-driven realistic 3-d virtual head for human-machine interface.

PubMed

Yu, Jun; Wang, Zeng-Fu

2015-05-01

A multiple inputs-driven realistic facial animation system based on 3-D virtual head for human-machine interface is proposed. The system can be driven independently by video, text, and speech, thus can interact with humans through diverse interfaces. The combination of parameterized model and muscular model is used to obtain a tradeoff between computational efficiency and high realism of 3-D facial animation. The online appearance model is used to track 3-D facial motion from video in the framework of particle filtering, and multiple measurements, i.e., pixel color value of input image and Gabor wavelet coefficient of illumination ratio image, are infused to reduce the influence of lighting and person dependence for the construction of online appearance model. The tri-phone model is used to reduce the computational consumption of visual co-articulation in speech synchronized viseme synthesis without sacrificing any performance. The objective and subjective experiments show that the system is suitable for human-machine interaction.
Mistaking minds and machines: How speech affects dehumanization and anthropomorphism.

PubMed

Schroeder, Juliana; Epley, Nicholas

2016-11-01

Treating a human mind like a machine is an essential component of dehumanization, whereas attributing a humanlike mind to a machine is an essential component of anthropomorphism. Here we tested how a cue closely connected to a person's actual mental experience-a humanlike voice-affects the likelihood of mistaking a person for a machine, or a machine for a person. We predicted that paralinguistic cues in speech are particularly likely to convey the presence of a humanlike mind, such that removing voice from communication (leaving only text) would increase the likelihood of mistaking the text's creator for a machine. Conversely, adding voice to a computer-generated script (resulting in speech) would increase the likelihood of mistaking the text's creator for a human. Four experiments confirmed these hypotheses, demonstrating that people are more likely to infer a human (vs. computer) creator when they hear a voice expressing thoughts than when they read the same thoughts in text. Adding human visual cues to text (i.e., seeing a person perform a script in a subtitled video clip), did not increase the likelihood of inferring a human creator compared with only reading text, suggesting that defining features of personhood may be conveyed more clearly in speech (Experiments 1 and 2). Removing the naturalistic paralinguistic cues that convey humanlike capacity for thinking and feeling, such as varied pace and intonation, eliminates the humanizing effect of speech (Experiment 4). We discuss implications for dehumanizing others through text-based media, and for anthropomorphizing machines through speech-based media. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Audio-visual speech perception: a developmental ERP investigation

PubMed Central

Knowland, Victoria CP; Mercure, Evelyne; Karmiloff-Smith, Annette; Dick, Fred; Thomas, Michael SC

2014-01-01

Being able to see a talking face confers a considerable advantage for speech perception in adulthood. However, behavioural data currently suggest that children fail to make full use of these available visual speech cues until age 8 or 9. This is particularly surprising given the potential utility of multiple informational cues during language learning. We therefore explored this at the neural level. The event-related potential (ERP) technique has been used to assess the mechanisms of audio-visual speech perception in adults, with visual cues reliably modulating auditory ERP responses to speech. Previous work has shown congruence-dependent shortening of auditory N1/P2 latency and congruence-independent attenuation of amplitude in the presence of auditory and visual speech signals, compared to auditory alone. The aim of this study was to chart the development of these well-established modulatory effects over mid-to-late childhood. Experiment 1 employed an adult sample to validate a child-friendly stimulus set and paradigm by replicating previously observed effects of N1/P2 amplitude and latency modulation by visual speech cues; it also revealed greater attenuation of component amplitude given incongruent audio-visual stimuli, pointing to a new interpretation of the amplitude modulation effect. Experiment 2 used the same paradigm to map cross-sectional developmental change in these ERP responses between 6 and 11 years of age. The effect of amplitude modulation by visual cues emerged over development, while the effect of latency modulation was stable over the child sample. These data suggest that auditory ERP modulation by visual speech represents separable underlying cognitive processes, some of which show earlier maturation than others over the course of development. PMID:24176002
The Study and Implementation of Text-to-Speech System for Agricultural Information

NASA Astrophysics Data System (ADS)

Zheng, Huoguo; Hu, Haiyan; Liu, Shihong; Meng, Hong

The Broadcast and Television coverage has increased to more than 98% in china. Information services by radio have wide coverage, low cost, easy-to-grass-roots farmers to accept etc. characteristics. In order to play the better role of broadcast information service, as well as aim at the problem of lack of information resource in rural, we R & D the text-to-speech system. The system includes two parts, software and hardware device, both of them can translate text into audio file. The software subsystem was implemented basic on third-part middleware, and the hardware subsystem was realized with microelectronics technology. Results indicate that the hardware is better than software. The system has been applied in huailai city hebei province, which has conversed more than 8000 audio files as programming materials for the local radio station.
Infant Perception of Audio-Visual Speech Synchrony

ERIC Educational Resources Information Center

Lewkowicz, David J.

2010-01-01

Three experiments investigated perception of audio-visual (A-V) speech synchrony in 4- to 10-month-old infants. Experiments 1 and 2 used a convergent-operations approach by habituating infants to an audiovisually synchronous syllable (Experiment 1) and then testing for detection of increasing degrees of A-V asynchrony (366, 500, and 666 ms) or by…
Listening to an Audio Drama Activates Two Processing Networks, One for All Sounds, Another Exclusively for Speech

PubMed Central

Boldt, Robert; Malinen, Sanna; Seppä, Mika; Tikka, Pia; Savolainen, Petri; Hari, Riitta; Carlson, Synnöve

2013-01-01

Earlier studies have shown considerable intersubject synchronization of brain activity when subjects watch the same movie or listen to the same story. Here we investigated the across-subjects similarity of brain responses to speech and non-speech sounds in a continuous audio drama designed for blind people. Thirteen healthy adults listened for ∼19 min to the audio drama while their brain activity was measured with 3 T functional magnetic resonance imaging (fMRI). An intersubject-correlation (ISC) map, computed across the whole experiment to assess the stimulus-driven extrinsic brain network, indicated statistically significant ISC in temporal, frontal and parietal cortices, cingulate cortex, and amygdala. Group-level independent component (IC) analysis was used to parcel out the brain signals into functionally coupled networks, and the dependence of the ICs on external stimuli was tested by comparing them with the ISC map. This procedure revealed four extrinsic ICs of which two–covering non-overlapping areas of the auditory cortex–were modulated by both speech and non-speech sounds. The two other extrinsic ICs, one left-hemisphere-lateralized and the other right-hemisphere-lateralized, were speech-related and comprised the superior and middle temporal gyri, temporal poles, and the left angular and inferior orbital gyri. In areas of low ISC four ICs that were defined intrinsic fluctuated similarly as the time-courses of either the speech-sound-related or all-sounds-related extrinsic ICs. These ICs included the superior temporal gyrus, the anterior insula, and the frontal, parietal and midline occipital cortices. Taken together, substantial intersubject synchronization of cortical activity was observed in subjects listening to an audio drama, with results suggesting that speech is processed in two separate networks, one dedicated to the processing of speech sounds and the other to both speech and non-speech sounds. PMID:23734202
Listening to an audio drama activates two processing networks, one for all sounds, another exclusively for speech.

PubMed

Boldt, Robert; Malinen, Sanna; Seppä, Mika; Tikka, Pia; Savolainen, Petri; Hari, Riitta; Carlson, Synnöve

2013-01-01

Earlier studies have shown considerable intersubject synchronization of brain activity when subjects watch the same movie or listen to the same story. Here we investigated the across-subjects similarity of brain responses to speech and non-speech sounds in a continuous audio drama designed for blind people. Thirteen healthy adults listened for ∼19 min to the audio drama while their brain activity was measured with 3 T functional magnetic resonance imaging (fMRI). An intersubject-correlation (ISC) map, computed across the whole experiment to assess the stimulus-driven extrinsic brain network, indicated statistically significant ISC in temporal, frontal and parietal cortices, cingulate cortex, and amygdala. Group-level independent component (IC) analysis was used to parcel out the brain signals into functionally coupled networks, and the dependence of the ICs on external stimuli was tested by comparing them with the ISC map. This procedure revealed four extrinsic ICs of which two-covering non-overlapping areas of the auditory cortex-were modulated by both speech and non-speech sounds. The two other extrinsic ICs, one left-hemisphere-lateralized and the other right-hemisphere-lateralized, were speech-related and comprised the superior and middle temporal gyri, temporal poles, and the left angular and inferior orbital gyri. In areas of low ISC four ICs that were defined intrinsic fluctuated similarly as the time-courses of either the speech-sound-related or all-sounds-related extrinsic ICs. These ICs included the superior temporal gyrus, the anterior insula, and the frontal, parietal and midline occipital cortices. Taken together, substantial intersubject synchronization of cortical activity was observed in subjects listening to an audio drama, with results suggesting that speech is processed in two separate networks, one dedicated to the processing of speech sounds and the other to both speech and non-speech sounds.
Transitioning from Analog to Digital Audio Recording in Childhood Speech Sound Disorders

ERIC Educational Resources Information Center

Shriberg, Lawrence D.; Mcsweeny, Jane L.; Anderson, Bruce E.; Campbell, Thomas F.; Chial, Michael R.; Green, Jordan R.; Hauner, Katherina K.; Moore, Christopher A.; Rusiewicz, Heather L.; Wilson, David L.

2005-01-01

Few empirical findings or technical guidelines are available on the current transition from analog to digital audio recording in childhood speech sound disorders. Of particular concern in the present context was whether a transition from analog- to digital-based transcription and coding of prosody and voice features might require re-standardizing…
Integrated Spacesuit Audio System Enhances Speech Quality and Reduces Noise

NASA Technical Reports Server (NTRS)

Huang, Yiteng Arden; Chen, Jingdong; Chen, Shaoyan Sharyl

2009-01-01

A new approach has been proposed for increasing astronaut comfort and speech capture. Currently, the special design of a spacesuit forms an extreme acoustic environment making it difficult to capture clear speech without compromising comfort. The proposed Integrated Spacesuit Audio (ISA) system is to incorporate the microphones into the helmet and use software to extract voice signals from background noise.
The cortical representation of the speech envelope is earlier for audiovisual speech than audio speech.

PubMed

Crosse, Michael J; Lalor, Edmund C

2014-04-01

Visual speech can greatly enhance a listener's comprehension of auditory speech when they are presented simultaneously. Efforts to determine the neural underpinnings of this phenomenon have been hampered by the limited temporal resolution of hemodynamic imaging and the fact that EEG and magnetoencephalographic data are usually analyzed in response to simple, discrete stimuli. Recent research has shown that neuronal activity in human auditory cortex tracks the envelope of natural speech. Here, we exploit this finding by estimating a linear forward-mapping between the speech envelope and EEG data and show that the latency at which the envelope of natural speech is represented in cortex is shortened by >10 ms when continuous audiovisual speech is presented compared with audio-only speech. In addition, we use a reverse-mapping approach to reconstruct an estimate of the speech stimulus from the EEG data and, by comparing the bimodal estimate with the sum of the unimodal estimates, find no evidence of any nonlinear additive effects in the audiovisual speech condition. These findings point to an underlying mechanism that could account for enhanced comprehension during audiovisual speech. Specifically, we hypothesize that low-level acoustic features that are temporally coherent with the preceding visual stream may be synthesized into a speech object at an earlier latency, which may provide an extended period of low-level processing before extraction of semantic information.
The effect of simultaneous text on the recall of noise-degraded speech.

PubMed

Grossman, Irina; Rajan, Ramesh

2017-05-01

Written and spoken language utilize the same processing system, enabling text to modulate speech processing. We investigated how simultaneously presented text affected speech recall in babble noise using a retrospective recall task. Participants were presented with text-speech sentence pairs in multitalker babble noise and then prompted to recall what they heard or what they read. In Experiment 1, sentence pairs were either congruent or incongruent and they were presented in silence or at 1 of 4 noise levels. Audio and Visual control groups were also tested with sentences presented in only 1 modality. Congruent text facilitated accurate recall of degraded speech; incongruent text had no effect. Text and speech were seldom confused for each other. A consideration of the effects of the language background found that monolingual English speakers outperformed early multilinguals at recalling degraded speech; however the effects of text on speech processing were analogous. Experiment 2 considered if the benefit provided by matching text was maintained when the congruency of the text and speech becomes more ambiguous because of the addition of partially mismatching text-speech sentence pairs that differed only on their final keyword and because of the use of low signal-to-noise ratios. The experiment focused on monolingual English speakers; the results showed that even though participants commonly confused text-for-speech during incongruent text-speech pairings, these confusions could not fully account for the benefit provided by matching text. Thus, we uniquely demonstrate that congruent text benefits the recall of noise-degraded speech. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Audio-visual speech cue combination.

PubMed

Arnold, Derek H; Tear, Morgan; Schindel, Ryan; Roseboom, Warrick

2010-04-16

Different sources of sensory information can interact, often shaping what we think we have seen or heard. This can enhance the precision of perceptual decisions relative to those made on the basis of a single source of information. From a computational perspective, there are multiple reasons why this might happen, and each predicts a different degree of enhanced precision. Relatively slight improvements can arise when perceptual decisions are made on the basis of multiple independent sensory estimates, as opposed to just one. These improvements can arise as a consequence of probability summation. Greater improvements can occur if two initially independent estimates are summated to form a single integrated code, especially if the summation is weighted in accordance with the variance associated with each independent estimate. This form of combination is often described as a Bayesian maximum likelihood estimate. Still greater improvements are possible if the two sources of information are encoded via a common physiological process. Here we show that the provision of simultaneous audio and visual speech cues can result in substantial sensitivity improvements, relative to single sensory modality based decisions. The magnitude of the improvements is greater than can be predicted on the basis of either a Bayesian maximum likelihood estimate or a probability summation. Our data suggest that primary estimates of speech content are determined by a physiological process that takes input from both visual and auditory processing, resulting in greater sensitivity than would be possible if initially independent audio and visual estimates were formed and then subsequently combined.
Machine Translation from Text

NASA Astrophysics Data System (ADS)

Habash, Nizar; Olive, Joseph; Christianson, Caitlin; McCary, John

Machine translation (MT) from text, the topic of this chapter, is perhaps the heart of the GALE project. Beyond being a well defined application that stands on its own, MT from text is the link between the automatic speech recognition component and the distillation component. The focus of MT in GALE is on translating from Arabic or Chinese to English. The three languages represent a wide range of linguistic diversity and make the GALE MT task rather challenging and exciting.
Audio-visual speech intelligibility benefits with bilateral cochlear implants when talker location varies.

PubMed

van Hoesel, Richard J M

2015-04-01

One of the key benefits of using cochlear implants (CIs) in both ears rather than just one is improved localization. It is likely that in complex listening scenes, improved localization allows bilateral CI users to orient toward talkers to improve signal-to-noise ratios and gain access to visual cues, but to date, that conjecture has not been tested. To obtain an objective measure of that benefit, seven bilateral CI users were assessed for both auditory-only and audio-visual speech intelligibility in noise using a novel dynamic spatial audio-visual test paradigm. For each trial conducted in spatially distributed noise, first, an auditory-only cueing phrase that was spoken by one of four talkers was selected and presented from one of four locations. Shortly afterward, a target sentence was presented that was either audio-visual or, in another test configuration, audio-only and was spoken by the same talker and from the same location as the cueing phrase. During the target presentation, visual distractors were added at other spatial locations. Results showed that in terms of speech reception thresholds (SRTs), the average improvement for bilateral listening over the better performing ear alone was 9 dB for the audio-visual mode, and 3 dB for audition-alone. Comparison of bilateral performance for audio-visual and audition-alone showed that inclusion of visual cues led to an average SRT improvement of 5 dB. For unilateral device use, no such benefit arose, presumably due to the greatly reduced ability to localize the target talker to acquire visual information. The bilateral CI speech intelligibility advantage over the better ear in the present study is much larger than that previously reported for static talker locations and indicates greater everyday speech benefits and improved cost-benefit than estimated to date.
Effects of aging on audio-visual speech integration.

PubMed

Huyse, Aurélie; Leybaert, Jacqueline; Berthommier, Frédéric

2014-10-01

This study investigated the impact of aging on audio-visual speech integration. A syllable identification task was presented in auditory-only, visual-only, and audio-visual congruent and incongruent conditions. Visual cues were either degraded or unmodified. Stimuli were embedded in stationary noise alternating with modulated noise. Fifteen young adults and 15 older adults participated in this study. Results showed that older adults had preserved lipreading abilities when the visual input was clear but not when it was degraded. The impact of aging on audio-visual integration also depended on the quality of the visual cues. In the visual clear condition, the audio-visual gain was similar in both groups and analyses in the framework of the fuzzy-logical model of perception confirmed that older adults did not differ from younger adults in their audio-visual integration abilities. In the visual reduction condition, the audio-visual gain was reduced in the older group, but only when the noise was stationary, suggesting that older participants could compensate for the loss of lipreading abilities by using the auditory information available in the valleys of the noise. The fuzzy-logical model of perception confirmed the significant impact of aging on audio-visual integration by showing an increased weight of audition in the older group.
Audio Steganography with Embedded Text

NASA Astrophysics Data System (ADS)

Teck Jian, Chua; Chai Wen, Chuah; Rahman, Nurul Hidayah Binti Ab.; Hamid, Isredza Rahmi Binti A.

2017-08-01

Audio steganography is about hiding the secret message into the audio. It is a technique uses to secure the transmission of secret information or hide their existence. It also may provide confidentiality to secret message if the message is encrypted. To date most of the steganography software such as Mp3Stego and DeepSound use block cipher such as Advanced Encryption Standard or Data Encryption Standard to encrypt the secret message. It is a good practice for security. However, the encrypted message may become too long to embed in audio and cause distortion of cover audio if the secret message is too long. Hence, there is a need to encrypt the message with stream cipher before embedding the message into the audio. This is because stream cipher provides bit by bit encryption meanwhile block cipher provide a fixed length of bits encryption which result a longer output compare to stream cipher. Hence, an audio steganography with embedding text with Rivest Cipher 4 encryption cipher is design, develop and test in this project.
Investigating Perceptual Biases, Data Reliability, and Data Discovery in a Methodology for Collecting Speech Errors From Audio Recordings.

PubMed

Alderete, John; Davies, Monica

2018-04-01

This work describes a methodology of collecting speech errors from audio recordings and investigates how some of its assumptions affect data quality and composition. Speech errors of all types (sound, lexical, syntactic, etc.) were collected by eight data collectors from audio recordings of unscripted English speech. Analysis of these errors showed that: (i) different listeners find different errors in the same audio recordings, but (ii) the frequencies of error patterns are similar across listeners; (iii) errors collected "online" using on the spot observational techniques are more likely to be affected by perceptual biases than "offline" errors collected from audio recordings; and (iv) datasets built from audio recordings can be explored and extended in a number of ways that traditional corpus studies cannot be.

Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem

PubMed Central

Liu, Xunying; Zhang, Chao; Woodland, Phil; Fonteneau, Elisabeth

2017-01-01

There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain. PMID:28945744
Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues

NASA Astrophysics Data System (ADS)

Adams, W. H.; Iyengar, Giridharan; Lin, Ching-Yung; Naphade, Milind Ramesh; Neti, Chalapathy; Nock, Harriet J.; Smith, John R.

2003-12-01

We present a learning-based approach to the semantic indexing of multimedia content using cues derived from audio, visual, and text features. We approach the problem by developing a set of statistical models for a predefined lexicon. Novel concepts are then mapped in terms of the concepts in the lexicon. To achieve robust detection of concepts, we exploit features from multiple modalities, namely, audio, video, and text. Concept representations are modeled using Gaussian mixture models (GMM), hidden Markov models (HMM), and support vector machines (SVM). Models such as Bayesian networks and SVMs are used in a late-fusion approach to model concepts that are not explicitly modeled in terms of features. Our experiments indicate promise in the proposed classification and fusion methodologies: our proposed fusion scheme achieves more than 10% relative improvement over the best unimodal concept detector.
Detecting Parkinson's disease from sustained phonation and speech signals.

PubMed

Vaiciukynas, Evaldas; Verikas, Antanas; Gelzinis, Adas; Bacauskiene, Marija

2017-01-01

This study investigates signals from sustained phonation and text-dependent speech modalities for Parkinson's disease screening. Phonation corresponds to the vowel /a/ voicing task and speech to the pronunciation of a short sentence in Lithuanian language. Signals were recorded through two channels simultaneously, namely, acoustic cardioid (AC) and smart phone (SP) microphones. Additional modalities were obtained by splitting speech recording into voiced and unvoiced parts. Information in each modality is summarized by 18 well-known audio feature sets. Random forest (RF) is used as a machine learning algorithm, both for individual feature sets and for decision-level fusion. Detection performance is measured by the out-of-bag equal error rate (EER) and the cost of log-likelihood-ratio. Essentia audio feature set was the best using the AC speech modality and YAAFE audio feature set was the best using the SP unvoiced modality, achieving EER of 20.30% and 25.57%, respectively. Fusion of all feature sets and modalities resulted in EER of 19.27% for the AC and 23.00% for the SP channel. Non-linear projection of a RF-based proximity matrix into the 2D space enriched medical decision support by visualization.
Off the ear with no loss in speech understanding: comparing the RONDO and the OPUS 2 cochlear implant audio processors.

PubMed

Dazert, Stefan; Thomas, Jan Peter; Büchner, Andreas; Müller, Joachim; Hempel, John Martin; Löwenheim, Hubert; Mlynski, Robert

2017-03-01

The RONDO is a single-unit cochlear implant audio processor, which omits the need for a behind-the-ear (BTE) audio processor. The primary aim was to compare speech perception results in quiet and in noise with the RONDO and the OPUS 2, a BTE audio processor. Secondary aims were to determine subjects' self-assessed levels of sound quality and gather subjective feedback on RONDO use. All speech perception tests were performed with the RONDO and the OPUS 2 behind-the-ear audio processor at 3 test intervals. Subjects were required to use the RONDO between test intervals. Subjects were tested at upgrade from the OPUS 2 to the RONDO and at 1 and 6 months after upgrade. Speech perception was determined using the Freiburg Monosyllables in quiet test and the Oldenburg Sentence Test (OLSA) in noise. Subjective perception was determined using the Hearing Implant Sound Quality Index (HISQUI 19 ), and a RONDO device-specific questionnaire. 50 subjects participated in the study. Neither speech perception scores nor self-perceived sound quality scores were significantly different at any interval between the RONDO and the OPUS 2. Subjects reported high levels of satisfaction with the RONDO. The RONDO provides comparable speech perception to the OPUS 2 while providing users with high levels of satisfaction and comfort without increasing health risk. The RONDO is a suitable and safe alternative to traditional BTE audio processors.
Choosing and Using Text-to-Speech Software

ERIC Educational Resources Information Center

Peters, Tom; Bell, Lori

2007-01-01

This article describes a computer-based technology for generating speech called text-to-speech (TTS). This software is ready for widespread use by libraries, other organizations, and individual users. It offers the affordable ability to turn just about any electronic text that is not image-based into an artificially spoken communication. The…
Speech entrainment enables patients with Broca’s aphasia to produce fluent speech

PubMed Central

Hubbard, H. Isabel; Hudspeth, Sarah Grace; Holland, Audrey L.; Bonilha, Leonardo; Fromm, Davida; Rorden, Chris

2012-01-01

A distinguishing feature of Broca’s aphasia is non-fluent halting speech typically involving one to three words per utterance. Yet, despite such profound impairments, some patients can mimic audio-visual speech stimuli enabling them to produce fluent speech in real time. We call this effect ‘speech entrainment’ and reveal its neural mechanism as well as explore its usefulness as a treatment for speech production in Broca’s aphasia. In Experiment 1, 13 patients with Broca’s aphasia were tested in three conditions: (i) speech entrainment with audio-visual feedback where they attempted to mimic a speaker whose mouth was seen on an iPod screen; (ii) speech entrainment with audio-only feedback where patients mimicked heard speech; and (iii) spontaneous speech where patients spoke freely about assigned topics. The patients produced a greater variety of words using audio-visual feedback compared with audio-only feedback and spontaneous speech. No difference was found between audio-only feedback and spontaneous speech. In Experiment 2, 10 of the 13 patients included in Experiment 1 and 20 control subjects underwent functional magnetic resonance imaging to determine the neural mechanism that supports speech entrainment. Group results with patients and controls revealed greater bilateral cortical activation for speech produced during speech entrainment compared with spontaneous speech at the junction of the anterior insula and Brodmann area 47, in Brodmann area 37, and unilaterally in the left middle temporal gyrus and the dorsal portion of Broca’s area. Probabilistic white matter tracts constructed for these regions in the normal subjects revealed a structural network connected via the corpus callosum and ventral fibres through the extreme capsule. Unilateral areas were connected via the arcuate fasciculus. In Experiment 3, all patients included in Experiment 1 participated in a 6-week treatment phase using speech entrainment to improve speech production
Blind speech separation system for humanoid robot with FastICA for audio filtering and separation

NASA Astrophysics Data System (ADS)

Budiharto, Widodo; Santoso Gunawan, Alexander Agung

2016-07-01

Nowadays, there are many developments in building intelligent humanoid robot, mainly in order to handle voice and image. In this research, we propose blind speech separation system using FastICA for audio filtering and separation that can be used in education or entertainment. Our main problem is to separate the multi speech sources and also to filter irrelevant noises. After speech separation step, the results will be integrated with our previous speech and face recognition system which is based on Bioloid GP robot and Raspberry Pi 2 as controller. The experimental results show the accuracy of our blind speech separation system is about 88% in command and query recognition cases.
Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech.

PubMed

Alm, Magnus; Behne, Dawn

2013-10-01

Previous research indicates that perception of audio-visual (AV) synchrony changes in adulthood. Possible explanations for these age differences include a decline in hearing acuity, a decline in cognitive processing speed, and increased experience with AV binding. The current study aims to isolate the effect of AV experience by comparing synchrony judgments from 20 young adults (20 to 30 yrs) and 20 normal-hearing middle-aged adults (50 to 60 yrs), an age range for which a decline of cognitive processing speed is expected to be minimal. When presented with AV stop consonant syllables with asynchronies ranging from 440 ms audio-lead to 440 ms visual-lead, middle-aged adults showed significantly less tolerance for audio-lead than young adults. Middle-aged adults also showed a greater shift in their point of subjective simultaneity than young adults. Natural audio-lead asynchronies are arguably more predictable than natural visual-lead asynchronies, and this predictability may render audio-lead thresholds more prone to experience-related fine-tuning.
Applying Spatial Audio to Human Interfaces: 25 Years of NASA Experience

NASA Technical Reports Server (NTRS)

Begault, Durand R.; Wenzel, Elizabeth M.; Godfrey, Martine; Miller, Joel D.; Anderson, Mark R.

2010-01-01

From the perspective of human factors engineering, the inclusion of spatial audio within a human-machine interface is advantageous from several perspectives. Demonstrated benefits include the ability to monitor multiple streams of speech and non-speech warning tones using a cocktail party advantage, and for aurally-guided visual search. Other potential benefits include the spatial coordination and interaction of multimodal events, and evaluation of new communication technologies and alerting systems using virtual simulation. Many of these technologies were developed at NASA Ames Research Center, beginning in 1985. This paper reviews examples and describes the advantages of spatial sound in NASA-related technologies, including space operations, aeronautics, and search and rescue. The work has involved hardware and software development as well as basic and applied research.
Audio-visual speech perception in adult readers with dyslexia: an fMRI study.

PubMed

Rüsseler, Jascha; Ye, Zheng; Gerth, Ivonne; Szycik, Gregor R; Münte, Thomas F

2018-04-01

Developmental dyslexia is a specific deficit in reading and spelling that often persists into adulthood. In the present study, we used slow event-related fMRI and independent component analysis to identify brain networks involved in perception of audio-visual speech in a group of adult readers with dyslexia (RD) and a group of fluent readers (FR). Participants saw a video of a female speaker saying a disyllabic word. In the congruent condition, audio and video input were identical whereas in the incongruent condition, the two inputs differed. Participants had to respond to occasionally occurring animal names. The independent components analysis (ICA) identified several components that were differently modulated in FR and RD. Two of these components including fusiform gyrus and occipital gyrus showed less activation in RD compared to FR possibly indicating a deficit to extract face information that is needed to integrate auditory and visual information in natural speech perception. A further component centered on the superior temporal sulcus (STS) also exhibited less activation in RD compared to FR. This finding is corroborated in the univariate analysis that shows less activation in STS for RD compared to FR. These findings suggest a general impairment in recruitment of audiovisual processing areas in dyslexia during the perception of natural speech.
Evaluating Text-to-Speech Synthesizers

ERIC Educational Resources Information Center

Cardoso, Walcir; Smith, George; Fuentes, Cesar Garcia

2015-01-01

Text-To-Speech (TTS) synthesizers have piqued the interest of researchers for their potential to enhance the L2 acquisition of writing (Kirstein, 2006), vocabulary and reading (Proctor, Dalton, & Grisham, 2007) and pronunciation (Cardoso, Collins, & White, 2012; Soler-Urzua, 2011). Despite their proven effectiveness, there is a need for…
Steganalysis of recorded speech

NASA Astrophysics Data System (ADS)

Johnson, Micah K.; Lyu, Siwei; Farid, Hany

2005-03-01

Digital audio provides a suitable cover for high-throughput steganography. At 16 bits per sample and sampled at a rate of 44,100 Hz, digital audio has the bit-rate to support large messages. In addition, audio is often transient and unpredictable, facilitating the hiding of messages. Using an approach similar to our universal image steganalysis, we show that hidden messages alter the underlying statistics of audio signals. Our statistical model begins by building a linear basis that captures certain statistical properties of audio signals. A low-dimensional statistical feature vector is extracted from this basis representation and used by a non-linear support vector machine for classification. We show the efficacy of this approach on LSB embedding and Hide4PGP. While no explicit assumptions about the content of the audio are made, our technique has been developed and tested on high-quality recorded speech.
Do gender differences in audio-visual benefit and visual influence in audio-visual speech perception emerge with age?

PubMed Central

Alm, Magnus; Behne, Dawn

2015-01-01

Gender and age have been found to affect adults’ audio-visual (AV) speech perception. However, research on adult aging focuses on adults over 60 years, who have an increasing likelihood for cognitive and sensory decline, which may confound positive effects of age-related AV-experience and its interaction with gender. Observed age and gender differences in AV speech perception may also depend on measurement sensitivity and AV task difficulty. Consequently both AV benefit and visual influence were used to measure visual contribution for gender-balanced groups of young (20–30 years) and middle-aged adults (50–60 years) with task difficulty varied using AV syllables from different talkers in alternative auditory backgrounds. Females had better speech-reading performance than males. Whereas no gender differences in AV benefit or visual influence were observed for young adults, visually influenced responses were significantly greater for middle-aged females than middle-aged males. That speech-reading performance did not influence AV benefit may be explained by visual speech extraction and AV integration constituting independent abilities. Contrastingly, the gender difference in visually influenced responses in middle adulthood may reflect an experience-related shift in females’ general AV perceptual strategy. Although young females’ speech-reading proficiency may not readily contribute to greater visual influence, between young and middle-adulthood recurrent confirmation of the contribution of visual cues induced by speech-reading proficiency may gradually shift females AV perceptual strategy toward more visually dominated responses. PMID:26236274
pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis.

PubMed

Giannakopoulos, Theodoros

2015-01-01

Audio information plays a rather important role in the increasing digital content that is available today, resulting in a need for methodologies that automatically analyze such content: audio event recognition for home automations and surveillance systems, speech recognition, music information retrieval, multimodal analysis (e.g. audio-visual analysis of online videos for content-based recommendation), etc. This paper presents pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization. pyAudioAnalysis is licensed under the Apache License and is available at GitHub (https://github.com/tyiannak/pyAudioAnalysis/). Here we present the theoretical background behind the wide range of the implemented methodologies, along with evaluation metrics for some of the methods. pyAudioAnalysis has been already used in several audio analysis research applications: smart-home functionalities through audio event detection, speech emotion recognition, depression classification based on audio-visual features, music segmentation, multimodal content-based movie recommendation and health applications (e.g. monitoring eating habits). The feedback provided from all these particular audio applications has led to practical enhancement of the library.
pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis

PubMed Central

Giannakopoulos, Theodoros

2015-01-01

Audio information plays a rather important role in the increasing digital content that is available today, resulting in a need for methodologies that automatically analyze such content: audio event recognition for home automations and surveillance systems, speech recognition, music information retrieval, multimodal analysis (e.g. audio-visual analysis of online videos for content-based recommendation), etc. This paper presents pyAudioAnalysis, an open-source Python library that provides a wide range of audio analysis procedures including: feature extraction, classification of audio signals, supervised and unsupervised segmentation and content visualization. pyAudioAnalysis is licensed under the Apache License and is available at GitHub (https://github.com/tyiannak/pyAudioAnalysis/). Here we present the theoretical background behind the wide range of the implemented methodologies, along with evaluation metrics for some of the methods. pyAudioAnalysis has been already used in several audio analysis research applications: smart-home functionalities through audio event detection, speech emotion recognition, depression classification based on audio-visual features, music segmentation, multimodal content-based movie recommendation and health applications (e.g. monitoring eating habits). The feedback provided from all these particular audio applications has led to practical enhancement of the library. PMID:26656189
Texting while driving: is speech-based text entry less risky than handheld text entry?

PubMed

He, J; Chaparro, A; Nguyen, B; Burge, R J; Crandall, J; Chaparro, B; Ni, R; Cao, S

2014-11-01

Research indicates that using a cell phone to talk or text while maneuvering a vehicle impairs driving performance. However, few published studies directly compare the distracting effects of texting using a hands-free (i.e., speech-based interface) versus handheld cell phone, which is an important issue for legislation, automotive interface design and driving safety training. This study compared the effect of speech-based versus handheld text entries on simulated driving performance by asking participants to perform a car following task while controlling the duration of a secondary text-entry task. Results showed that both speech-based and handheld text entries impaired driving performance relative to the drive-only condition by causing more variation in speed and lane position. Handheld text entry also increased the brake response time and increased variation in headway distance. Text entry using a speech-based cell phone was less detrimental to driving performance than handheld text entry. Nevertheless, the speech-based text entry task still significantly impaired driving compared to the drive-only condition. These results suggest that speech-based text entry disrupts driving, but reduces the level of performance interference compared to text entry with a handheld device. In addition, the difference in the distraction effect caused by speech-based and handheld text entry is not simply due to the difference in task duration. Copyright © 2014 Elsevier Ltd. All rights reserved.
Machine Learning Through Signature Trees. Applications to Human Speech.

ERIC Educational Resources Information Center

White, George M.

A signature tree is a binary decision tree used to classify unknown patterns. An attempt was made to develop a computer program for manipulating signature trees as a general research tool for exploring machine learning and pattern recognition. The program was applied to the problem of speech recognition to test its effectiveness for a specific…
Communicating headings and preview sentences in text and speech.

PubMed

Lorch, Robert F; Chen, Hung-Tao; Lemarié, Julie

2012-09-01

Two experiments tested the effects of preview sentences and headings on the quality of college students' outlines of informational texts. Experiment 1 found that performance was much better in the preview sentences condition than in a no-signals condition for both printed text and text-to-speech (TTS) audio rendering of the printed text. In contrast, performance in the headings condition was good for the printed text but poor for the auditory presentation because the TTS software failed to communicate nonverbal information carried by the visual headings. Experiment 2 compared outlining performance for five headings conditions during TTS presentation. Using a theoretical framework, "signaling available, relevant, accessible" (SARA) information, to provide an analysis of the information content of headings in the printed text, the manipulation of the headings systematically restored information that was omitted by the TTS application in Experiment 1. The result was that outlining performance improved to levels similar to the visual headings condition of Experiment 1. It is argued that SARA is a useful framework for guiding future development of TTS software for a wide variety of text signaling devices, not just headings.
Method and apparatus for obtaining complete speech signals for speech recognition applications

NASA Technical Reports Server (NTRS)

Abrash, Victor (Inventor); Cesari, Federico (Inventor); Franco, Horacio (Inventor); George, Christopher (Inventor); Zheng, Jing (Inventor)

2009-01-01

The present invention relates to a method and apparatus for obtaining complete speech signals for speech recognition applications. In one embodiment, the method continuously records an audio stream comprising a sequence of frames to a circular buffer. When a user command to commence or terminate speech recognition is received, the method obtains a number of frames of the audio stream occurring before or after the user command in order to identify an augmented audio signal for speech recognition processing. In further embodiments, the method analyzes the augmented audio signal in order to locate starting and ending speech endpoints that bound at least a portion of speech to be processed for recognition. At least one of the speech endpoints is located using a Hidden Markov Model.
Involvement of Right STS in Audio-Visual Integration for Affective Speech Demonstrated Using MEG

PubMed Central

Hagan, Cindy C.; Woods, Will; Johnson, Sam; Green, Gary G. R.; Young, Andrew W.

2013-01-01

Speech and emotion perception are dynamic processes in which it may be optimal to integrate synchronous signals emitted from different sources. Studies of audio-visual (AV) perception of neutrally expressed speech demonstrate supra-additive (i.e., where AV>[unimodal auditory+unimodal visual]) responses in left STS to crossmodal speech stimuli. However, emotions are often conveyed simultaneously with speech; through the voice in the form of speech prosody and through the face in the form of facial expression. Previous studies of AV nonverbal emotion integration showed a role for right (rather than left) STS. The current study therefore examined whether the integration of facial and prosodic signals of emotional speech is associated with supra-additive responses in left (cf. results for speech integration) or right (due to emotional content) STS. As emotional displays are sometimes difficult to interpret, we also examined whether supra-additive responses were affected by emotional incongruence (i.e., ambiguity). Using magnetoencephalography, we continuously recorded eighteen participants as they viewed and heard AV congruent emotional and AV incongruent emotional speech stimuli. Significant supra-additive responses were observed in right STS within the first 250 ms for emotionally incongruent and emotionally congruent AV speech stimuli, which further underscores the role of right STS in processing crossmodal emotive signals. PMID:23950977

Involvement of right STS in audio-visual integration for affective speech demonstrated using MEG.

PubMed

Hagan, Cindy C; Woods, Will; Johnson, Sam; Green, Gary G R; Young, Andrew W

2013-01-01

Speech and emotion perception are dynamic processes in which it may be optimal to integrate synchronous signals emitted from different sources. Studies of audio-visual (AV) perception of neutrally expressed speech demonstrate supra-additive (i.e., where AV>[unimodal auditory+unimodal visual]) responses in left STS to crossmodal speech stimuli. However, emotions are often conveyed simultaneously with speech; through the voice in the form of speech prosody and through the face in the form of facial expression. Previous studies of AV nonverbal emotion integration showed a role for right (rather than left) STS. The current study therefore examined whether the integration of facial and prosodic signals of emotional speech is associated with supra-additive responses in left (cf. results for speech integration) or right (due to emotional content) STS. As emotional displays are sometimes difficult to interpret, we also examined whether supra-additive responses were affected by emotional incongruence (i.e., ambiguity). Using magnetoencephalography, we continuously recorded eighteen participants as they viewed and heard AV congruent emotional and AV incongruent emotional speech stimuli. Significant supra-additive responses were observed in right STS within the first 250 ms for emotionally incongruent and emotionally congruent AV speech stimuli, which further underscores the role of right STS in processing crossmodal emotive signals.
Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

NASA Technical Reports Server (NTRS)

Ye, Sherry

2015-01-01

NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.
Anthropomorphic Coding of Speech and Audio: A Model Inversion Approach

NASA Astrophysics Data System (ADS)

Feldbauer, Christian; Kubin, Gernot; Kleijn, W. Bastiaan

2005-12-01

Auditory modeling is a well-established methodology that provides insight into human perception and that facilitates the extraction of signal features that are most relevant to the listener. The aim of this paper is to provide a tutorial on perceptual speech and audio coding using an invertible auditory model. In this approach, the audio signal is converted into an auditory representation using an invertible auditory model. The auditory representation is quantized and coded. Upon decoding, it is then transformed back into the acoustic domain. This transformation converts a complex distortion criterion into a simple one, thus facilitating quantization with low complexity. We briefly review past work on auditory models and describe in more detail the components of our invertible model and its inversion procedure, that is, the method to reconstruct the signal from the output of the auditory model. We summarize attempts to use the auditory representation for low-bit-rate coding. Our approach also allows the exploitation of the inherent redundancy of the human auditory system for the purpose of multiple description (joint source-channel) coding.
Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems.

PubMed

Greene, Beth G; Logan, John S; Pisoni, David B

1986-03-01

We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.
Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems

PubMed Central

GREENE, BETH G.; LOGAN, JOHN S.; PISONI, DAVID B.

2012-01-01

We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered. PMID:23225916
Paper-Based Textbooks with Audio Support for Print-Disabled Students.

PubMed

Fujiyoshi, Akio; Ohsawa, Akiko; Takaira, Takuya; Tani, Yoshiaki; Fujiyoshi, Mamoru; Ota, Yuko

2015-01-01

Utilizing invisible 2-dimensional codes and digital audio players with a 2-dimensional code scanner, we developed paper-based textbooks with audio support for students with print disabilities, called "multimodal textbooks." Multimodal textbooks can be read with the combination of the two modes: "reading printed text" and "listening to the speech of the text from a digital audio player with a 2-dimensional code scanner." Since multimodal textbooks look the same as regular textbooks and the price of a digital audio player is reasonable (about 30 euro), we think multimodal textbooks are suitable for students with print disabilities in ordinary classrooms.
Laboratory and in-flight experiments to evaluate 3-D audio display technology

NASA Technical Reports Server (NTRS)

Ericson, Mark; Mckinley, Richard; Kibbe, Marion; Francis, Daniel

1994-01-01

Laboratory and in-flight experiments were conducted to evaluate 3-D audio display technology for cockpit applications. A 3-D audio display generator was developed which digitally encodes naturally occurring direction information onto any audio signal and presents the binaural sound over headphones. The acoustic image is stabilized for head movement by use of an electromagnetic head-tracking device. In the laboratory, a 3-D audio display generator was used to spatially separate competing speech messages to improve the intelligibility of each message. Up to a 25 percent improvement in intelligibility was measured for spatially separated speech at high ambient noise levels (115 dB SPL). During the in-flight experiments, pilots reported that spatial separation of speech communications provided a noticeable improvement in intelligibility. The use of 3-D audio for target acquisition was also investigated. In the laboratory, 3-D audio enabled the acquisition of visual targets in about two seconds average response time at 17 degrees accuracy. During the in-flight experiments, pilots correctly identified ground targets 50, 75, and 100 percent of the time at separation angles of 12, 20, and 35 degrees, respectively. In general, pilot performance in the field with the 3-D audio display generator was as expected, based on data from laboratory experiments.
On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common

PubMed Central

Weninger, Felix; Eyben, Florian; Schuller, Björn W.; Mortillaro, Marcello; Scherer, Klaus R.

2013-01-01

Without doubt, there is emotional information in almost any kind of sound received by humans every day: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring in the environment, in the soundtrack of a movie, or in a radio play. In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena, but a holistic computational model of affect in sound is still lacking. In turn, for tomorrow’s pervasive technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of “the sound that something makes,” in order to evaluate the system’s auditory environment and its own audio output. This article aims at a first step toward a holistic computational model: starting from standard acoustic feature extraction schemes in the domains of speech, music, and sound analysis, we interpret the worth of individual features across these three domains, considering four audio databases with observer annotations in the arousal and valence dimensions. In the results, we find that by selection of appropriate descriptors, cross-domain arousal, and valence regression is feasible achieving significant correlations with the observer annotations of up to 0.78 for arousal (training on sound and testing on enacted speech) and 0.60 for valence (training on enacted speech and testing on music). The high degree of cross-domain consistency in encoding the two main dimensions of affect may be attributable to the co-evolution of speech and music from multimodal affect bursts, including the integration of nature sounds for expressive effects. PMID:23750144
On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common.

PubMed

Weninger, Felix; Eyben, Florian; Schuller, Björn W; Mortillaro, Marcello; Scherer, Klaus R

2013-01-01

WITHOUT DOUBT, THERE IS EMOTIONAL INFORMATION IN ALMOST ANY KIND OF SOUND RECEIVED BY HUMANS EVERY DAY: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring in the environment, in the soundtrack of a movie, or in a radio play. In the field of affective computing, there is currently some loosely connected research concerning either of these phenomena, but a holistic computational model of affect in sound is still lacking. In turn, for tomorrow's pervasive technical systems, including affective companions and robots, it is expected to be highly beneficial to understand the affective dimensions of "the sound that something makes," in order to evaluate the system's auditory environment and its own audio output. This article aims at a first step toward a holistic computational model: starting from standard acoustic feature extraction schemes in the domains of speech, music, and sound analysis, we interpret the worth of individual features across these three domains, considering four audio databases with observer annotations in the arousal and valence dimensions. In the results, we find that by selection of appropriate descriptors, cross-domain arousal, and valence regression is feasible achieving significant correlations with the observer annotations of up to 0.78 for arousal (training on sound and testing on enacted speech) and 0.60 for valence (training on enacted speech and testing on music). The high degree of cross-domain consistency in encoding the two main dimensions of affect may be attributable to the co-evolution of speech and music from multimodal affect bursts, including the integration of nature sounds for expressive effects.
Audio in Courseware: Design Knowledge Issues.

ERIC Educational Resources Information Center

Aarntzen, Diana

1993-01-01

Considers issues that need to be addressed when incorporating audio in courseware design. Topics discussed include functions of audio in courseware; the relationship between auditive and visual information; learner characteristics in relation to audio; events of instruction; and audio characteristics, including interactivity and speech technology.…
Text as a Supplement to Speech in Young and Older Adults a)

PubMed Central

Krull, Vidya; Humes, Larry E.

2015-01-01

Objective The purpose of this experiment was to quantify the contribution of visual text to auditory speech recognition in background noise. Specifically, we tested the hypothesis that partially accurate visual text from an automatic speech recognizer could be used successfully to supplement speech understanding in difficult listening conditions in older adults, with normal or impaired hearing. Our working hypotheses were based on what is known regarding audiovisual speech perception in the elderly from speechreading literature. We hypothesized that: 1) combining auditory and visual text information will result in improved recognition accuracy compared to auditory or visual text information alone; 2) benefit from supplementing speech with visual text (auditory and visual enhancement) in young adults will be greater than that in older adults; and 3) individual differences in performance on perceptual measures would be associated with cognitive abilities. Design Fifteen young adults with normal hearing, fifteen older adults with normal hearing, and fifteen older adults with hearing loss participated in this study. All participants completed sentence recognition tasks in auditory-only, text-only, and combined auditory-text conditions. The auditory sentence stimuli were spectrally shaped to restore audibility for the older participants with impaired hearing. All participants also completed various cognitive measures, including measures of working memory, processing speed, verbal comprehension, perceptual and cognitive speed, processing efficiency, inhibition, and the ability to form wholes from parts. Group effects were examined for each of the perceptual and cognitive measures. Audiovisual benefit was calculated relative to performance on auditory-only and visual-text only conditions. Finally, the relationship between perceptual measures and other independent measures were examined using principal-component factor analyses, followed by regression analyses. Results
Audio Classification in Speech and Music: A Comparison between a Statistical and a Neural Approach

NASA Astrophysics Data System (ADS)

Bugatti, Alessandro; Flammini, Alessandra; Migliorati, Pierangelo

2002-12-01

We focus the attention on the problem of audio classification in speech and music for multimedia applications. In particular, we present a comparison between two different techniques for speech/music discrimination. The first method is based on Zero crossing rate and Bayesian classification. It is very simple from a computational point of view, and gives good results in case of pure music or speech. The simulation results show that some performance degradation arises when the music segment contains also some speech superimposed on music, or strong rhythmic components. To overcome these problems, we propose a second method, that uses more features, and is based on neural networks (specifically a multi-layer Perceptron). In this case we obtain better performance, at the expense of a limited growth in the computational complexity. In practice, the proposed neural network is simple to be implemented if a suitable polynomial is used as the activation function, and a real-time implementation is possible even if low-cost embedded systems are used.
Use of Computer Speech Technologies To Enhance Learning.

ERIC Educational Resources Information Center

Ferrell, Joe

1999-01-01

Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)
Advances in audio source seperation and multisource audio content retrieval

NASA Astrophysics Data System (ADS)

Vincent, Emmanuel

2012-06-01

Audio source separation aims to extract the signals of individual sound sources from a given recording. In this paper, we review three recent advances which improve the robustness of source separation in real-world challenging scenarios and enable its use for multisource content retrieval tasks, such as automatic speech recognition (ASR) or acoustic event detection (AED) in noisy environments. We present a Flexible Audio Source Separation Toolkit (FASST) and discuss its advantages compared to earlier approaches such as independent component analysis (ICA) and sparse component analysis (SCA). We explain how cues as diverse as harmonicity, spectral envelope, temporal fine structure or spatial location can be jointly exploited by this toolkit. We subsequently present the uncertainty decoding (UD) framework for the integration of audio source separation and audio content retrieval. We show how the uncertainty about the separated source signals can be accurately estimated and propagated to the features. Finally, we explain how this uncertainty can be efficiently exploited by a classifier, both at the training and the decoding stage. We illustrate the resulting performance improvements in terms of speech separation quality and speaker recognition accuracy.
Speech endpoint detection with non-language speech sounds for generic speech processing applications

NASA Astrophysics Data System (ADS)

McClain, Matthew; Romanowski, Brian

2009-05-01

Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.
Fuzzy Logic-Based Audio Pattern Recognition

NASA Astrophysics Data System (ADS)

Malcangi, M.

2008-11-01

Audio and audio-pattern recognition is becoming one of the most important technologies to automatically control embedded systems. Fuzzy logic may be the most important enabling methodology due to its ability to rapidly and economically model such application. An audio and audio-pattern recognition engine based on fuzzy logic has been developed for use in very low-cost and deeply embedded systems to automate human-to-machine and machine-to-machine interaction. This engine consists of simple digital signal-processing algorithms for feature extraction and normalization, and a set of pattern-recognition rules manually tuned or automatically tuned by a self-learning process.
Comparing Learning Gains: Audio Versus Text-based Instructor Communication in a Blended Online Learning Environment

NASA Astrophysics Data System (ADS)

Shimizu, Dominique

Though blended course audio feedback has been associated with several measures of course satisfaction at the postsecondary and graduate levels compared to text feedback, it may take longer to prepare and positive results are largely unverified in K-12 literature. The purpose of this quantitative study was to investigate the time investment and learning impact of audio communications with 228 secondary students in a blended online learning biology unit at a central Florida public high school. A short, individualized audio message regarding the student's progress was given to each student in the audio group; similar text-based messages were given to each student in the text-based group on the same schedule; a control got no feedback. A pretest and posttest were employed to measure learning gains in the three groups. To compare the learning gains in two types of feedback with each other and to no feedback, a controlled, randomized, experimental design was implemented. In addition, the creation and posting of audio and text feedback communications were timed in order to assess whether audio feedback took longer to produce than text only feedback. While audio feedback communications did take longer to create and post, there was no difference between learning gains as measured by posttest scores when student received audio, text-based, or no feedback. Future studies using a similar randomized, controlled experimental design are recommended to verify these results and test whether the trend holds in a broader range of subjects, over different time frames, and using a variety of assessment types to measure student learning.
Effects of Text, Audio and Learner Control on Text-Sound Association and Cognitive Load of EFL Learners

ERIC Educational Resources Information Center

Enciso Bernal, Ana Maria

2014-01-01

This study investigated the effects of concurrent audio and equivalent onscreen text on the ability of learners of English as a foreign language (EFL) to form associations between textual and aural forms of target vocabulary words. The study also looked at the effects of learner control over an audio sequence on the association of textual and…
Development and testing of an audio forensic software for enhancing speech signals masked by loud music

NASA Astrophysics Data System (ADS)

Dobre, Robert A.; Negrescu, Cristian; Stanomir, Dumitru

2016-12-01

In many situations audio recordings can decide the fate of a trial when accepted as evidence. But until they can be taken into account they must be authenticated at first, but also the quality of the targeted content (speech in most cases) must be good enough to remove any doubt. In this scope two main directions of multimedia forensics come into play: content authentication and noise reduction. This paper presents an application that is included in the latter. If someone would like to conceal their conversation, the easiest way to do it would be to turn loud the nearest audio system. In this situation, if a microphone was placed close by, the recorded signal would be apparently useless because the speech signal would be masked by the loud music signal. The paper proposes an adaptive filters based solution to remove the musical content from a previously described signal mixture in order to recover the masked vocal signal. Two adaptive filtering algorithms were tested in the proposed solution: the Normalised Least Mean Squares (NLMS) and Recursive Least Squares (RLS). Their performances in the described situation were evaluated using Simulink, compared and included in the paper.
Vowel Imagery Decoding toward Silent Speech BCI Using Extreme Learning Machine with Electroencephalogram

PubMed Central

Kim, Jongin; Park, Hyeong-jun

2016-01-01

The purpose of this study is to classify EEG data on imagined speech in a single trial. We recorded EEG data while five subjects imagined different vowels, /a/, /e/, /i/, /o/, and /u/. We divided each single trial dataset into thirty segments and extracted features (mean, variance, standard deviation, and skewness) from all segments. To reduce the dimension of the feature vector, we applied a feature selection algorithm based on the sparse regression model. These features were classified using a support vector machine with a radial basis function kernel, an extreme learning machine, and two variants of an extreme learning machine with different kernels. Because each single trial consisted of thirty segments, our algorithm decided the label of the single trial by selecting the most frequent output among the outputs of the thirty segments. As a result, we observed that the extreme learning machine and its variants achieved better classification rates than the support vector machine with a radial basis function kernel and linear discrimination analysis. Thus, our results suggested that EEG responses to imagined speech could be successfully classified in a single trial using an extreme learning machine with a radial basis function and linear kernel. This study with classification of imagined speech might contribute to the development of silent speech BCI systems. PMID:28097128

Contributions of local speech encoding and functional connectivity to audio-visual speech perception

PubMed Central

Giordano, Bruno L; Ince, Robin A A; Gross, Joachim; Schyns, Philippe G; Panzeri, Stefano; Kayser, Christoph

2017-01-01

Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments. DOI: http://dx.doi.org/10.7554/eLife.24763.001 PMID:28590903
Audio-visual onset differences are used to determine syllable identity for ambiguous audio-visual stimulus pairs

PubMed Central

ten Oever, Sanne; Sack, Alexander T.; Wheat, Katherine L.; Bien, Nina; van Atteveldt, Nienke

2013-01-01

Content and temporal cues have been shown to interact during audio-visual (AV) speech identification. Typically, the most reliable unimodal cue is used more strongly to identify specific speech features; however, visual cues are only used if the AV stimuli are presented within a certain temporal window of integration (TWI). This suggests that temporal cues denote whether unimodal stimuli belong together, that is, whether they should be integrated. It is not known whether temporal cues also provide information about the identity of a syllable. Since spoken syllables have naturally varying AV onset asynchronies, we hypothesize that for suboptimal AV cues presented within the TWI, information about the natural AV onset differences can aid in speech identification. To test this, we presented low-intensity auditory syllables concurrently with visual speech signals, and varied the stimulus onset asynchronies (SOA) of the AV pair, while participants were instructed to identify the auditory syllables. We revealed that specific speech features (e.g., voicing) were identified by relying primarily on one modality (e.g., auditory). Additionally, we showed a wide window in which visual information influenced auditory perception, that seemed even wider for congruent stimulus pairs. Finally, we found a specific response pattern across the SOA range for syllables that were not reliably identified by the unimodal cues, which we explained as the result of the use of natural onset differences between AV speech signals. This indicates that temporal cues not only provide information about the temporal integration of AV stimuli, but additionally convey information about the identity of AV pairs. These results provide a detailed behavioral basis for further neuro-imaging and stimulation studies to unravel the neurofunctional mechanisms of the audio-visual-temporal interplay within speech perception. PMID:23805110
Audio-visual onset differences are used to determine syllable identity for ambiguous audio-visual stimulus pairs.

PubMed

Ten Oever, Sanne; Sack, Alexander T; Wheat, Katherine L; Bien, Nina; van Atteveldt, Nienke

2013-01-01

Content and temporal cues have been shown to interact during audio-visual (AV) speech identification. Typically, the most reliable unimodal cue is used more strongly to identify specific speech features; however, visual cues are only used if the AV stimuli are presented within a certain temporal window of integration (TWI). This suggests that temporal cues denote whether unimodal stimuli belong together, that is, whether they should be integrated. It is not known whether temporal cues also provide information about the identity of a syllable. Since spoken syllables have naturally varying AV onset asynchronies, we hypothesize that for suboptimal AV cues presented within the TWI, information about the natural AV onset differences can aid in speech identification. To test this, we presented low-intensity auditory syllables concurrently with visual speech signals, and varied the stimulus onset asynchronies (SOA) of the AV pair, while participants were instructed to identify the auditory syllables. We revealed that specific speech features (e.g., voicing) were identified by relying primarily on one modality (e.g., auditory). Additionally, we showed a wide window in which visual information influenced auditory perception, that seemed even wider for congruent stimulus pairs. Finally, we found a specific response pattern across the SOA range for syllables that were not reliably identified by the unimodal cues, which we explained as the result of the use of natural onset differences between AV speech signals. This indicates that temporal cues not only provide information about the temporal integration of AV stimuli, but additionally convey information about the identity of AV pairs. These results provide a detailed behavioral basis for further neuro-imaging and stimulation studies to unravel the neurofunctional mechanisms of the audio-visual-temporal interplay within speech perception.
Audio Visual Integration with Competing Sources in the Framework of Audio Visual Speech Scene Analysis.

PubMed

Ganesh, Attigodu Chandrashekara; Berthommier, Frédéric; Schwartz, Jean-Luc

2016-01-01

We introduce "Audio-Visual Speech Scene Analysis" (AVSSA) as an extension of the two-stage Auditory Scene Analysis model towards audiovisual scenes made of mixtures of speakers. AVSSA assumes that a coherence index between the auditory and the visual input is computed prior to audiovisual fusion, enabling to determine whether the sensory inputs should be bound together. Previous experiments on the modulation of the McGurk effect by audiovisual coherent vs. incoherent contexts presented before the McGurk target have provided experimental evidence supporting AVSSA. Indeed, incoherent contexts appear to decrease the McGurk effect, suggesting that they produce lower audiovisual coherence hence less audiovisual fusion. The present experiments extend the AVSSA paradigm by creating contexts made of competing audiovisual sources and measuring their effect on McGurk targets. The competing audiovisual sources have respectively a high and a low audiovisual coherence (that is, large vs. small audiovisual comodulations in time). The first experiment involves contexts made of two auditory sources and one video source associated to either the first or the second audio source. It appears that the McGurk effect is smaller after the context made of the visual source associated to the auditory source with less audiovisual coherence. In the second experiment with the same stimuli, the participants are asked to attend to either one or the other source. The data show that the modulation of fusion depends on the attentional focus. Altogether, these two experiments shed light on audiovisual binding, the AVSSA process and the role of attention.
Cortical Integration of Audio-Visual Information

PubMed Central

Vander Wyk, Brent C.; Ramsay, Gordon J.; Hudac, Caitlin M.; Jones, Warren; Lin, David; Klin, Ami; Lee, Su Mei; Pelphrey, Kevin A.

2013-01-01

We investigated the neural basis of audio-visual processing in speech and non-speech stimuli. Physically identical auditory stimuli (speech and sinusoidal tones) and visual stimuli (animated circles and ellipses) were used in this fMRI experiment. Relative to unimodal stimuli, each of the multimodal conjunctions showed increased activation in largely non-overlapping areas. The conjunction of Ellipse and Speech, which most resembles naturalistic audiovisual speech, showed higher activation in the right inferior frontal gyrus, fusiform gyri, left posterior superior temporal sulcus, and lateral occipital cortex. The conjunction of Circle and Tone, an arbitrary audio-visual pairing with no speech association, activated middle temporal gyri and lateral occipital cortex. The conjunction of Circle and Speech showed activation in lateral occipital cortex, and the conjunction of Ellipse and Tone did not show increased activation relative to unimodal stimuli. Further analysis revealed that middle temporal regions, although identified as multimodal only in the Circle-Tone condition, were more strongly active to Ellipse-Speech or Circle-Speech, but regions that were identified as multimodal for Ellipse-Speech were always strongest for Ellipse-Speech. Our results suggest that combinations of auditory and visual stimuli may together be processed by different cortical networks, depending on the extent to which speech or non-speech percepts are evoked. PMID:20709442
Text-to-audiovisual speech synthesizer for children with learning disabilities.

PubMed

Mendi, Engin; Bayrak, Coskun

2013-01-01

Learning disabilities affect the ability of children to learn, despite their having normal intelligence. Assistive tools can highly increase functional capabilities of children with learning disorders such as writing, reading, or listening. In this article, we describe a text-to-audiovisual synthesizer that can serve as an assistive tool for such children. The system automatically converts an input text to audiovisual speech, providing synchronization of the head, eye, and lip movements of the three-dimensional face model with appropriate facial expressions and word flow of the text. The proposed system can enhance speech perception and help children having learning deficits to improve their chances of success.
Auditory Support in Linguistically Diverse Classrooms: Factors Related to Bilingual Text-to-Speech Use

ERIC Educational Resources Information Center

Van Laere, E.; Braak, J.

2017-01-01

Text-to-speech technology can act as an important support tool in computer-based learning environments (CBLEs) as it provides auditory input, next to on-screen text. Particularly for students who use a language at home other than the language of instruction (LOI) applied at school, text-to-speech can be useful. The CBLE E-Validiv offers content in…
Highlight summarization in golf videos using audio signals

NASA Astrophysics Data System (ADS)

Kim, Hyoung-Gook; Kim, Jin Young

2008-01-01

In this paper, we present an automatic summarization of highlights in golf videos based on audio information alone without video information. The proposed highlight summarization system is carried out based on semantic audio segmentation and detection on action units from audio signals. Studio speech, field speech, music, and applause are segmented by means of sound classification. Swing is detected by the methods of impulse onset detection. Sounds like swing and applause form a complete action unit, while studio speech and music parts are used to anchor the program structure. With the advantage of highly precise detection of applause, highlights are extracted effectively. Our experimental results obtain high classification precision on 18 golf games. It proves that the proposed system is very effective and computationally efficient to apply the technology to embedded consumer electronic devices.
Audio-Tutorial Instruction in Medicine.

ERIC Educational Resources Information Center

Boyle, Gloria J.; Herrick, Merlyn C.

This progress report concerns an audio-tutorial approach used at the University of Missouri-Columbia School of Medicine. Instructional techniques such as slide-tape presentations, compressed speech audio tapes, computer-assisted instruction (CAI), motion pictures, television, microfiche, and graphic and printed materials have been implemented,…
Audio Frequency Analysis in Mobile Phones

ERIC Educational Resources Information Center

Aguilar, Horacio Munguía

2016-01-01

A new experiment using mobile phones is proposed in which its audio frequency response is analyzed using the audio port for inputting external signal and getting a measurable output. This experiment shows how the limited audio bandwidth used in mobile telephony is the main cause of the poor speech quality in this service. A brief discussion is…
Speech recognition technology: an outlook for human-to-machine interaction.

PubMed

Erdel, T; Crooks, S

2000-01-01

Speech recognition, as an enabling technology in healthcare-systems computing, is a topic that has been discussed for quite some time, but is just now coming to fruition. Traditionally, speech-recognition software has been constrained by hardware, but improved processors and increased memory capacities are starting to remove some of these limitations. With these barriers removed, companies that create software for the healthcare setting have the opportunity to write more successful applications. Among the criticisms of speech-recognition applications are the high rates of error and steep training curves. However, even in the face of such negative perceptions, there remains significant opportunities for speech recognition to allow healthcare providers and, more specifically, physicians, to work more efficiently and ultimately spend more time with their patients and less time completing necessary documentation. This article will identify opportunities for inclusion of speech-recognition technology in the healthcare setting and examine major categories of speech-recognition software--continuous speech recognition, command and control, and text-to-speech. We will discuss the advantages and disadvantages of each area, the limitations of the software today, and how future trends might affect them.
Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.

PubMed

Gebru, Israel D; Ba, Sileye; Li, Xiaofei; Horaud, Radu

2018-05-01

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.
Using speech recognition to enhance the Tongue Drive System functionality in computer access.

PubMed

Huo, Xueliang; Ghovanloo, Maysam

2011-01-01

Tongue Drive System (TDS) is a wireless tongue operated assistive technology (AT), which can enable people with severe physical disabilities to access computers and drive powered wheelchairs using their volitional tongue movements. TDS offers six discrete commands, simultaneously available to the users, for pointing and typing as a substitute for mouse and keyboard in computer access, respectively. To enhance the TDS performance in typing, we have added a microphone, an audio codec, and a wireless audio link to its readily available 3-axial magnetic sensor array, and combined it with a commercially available speech recognition software, the Dragon Naturally Speaking, which is regarded as one of the most efficient ways for text entry. Our preliminary evaluations indicate that the combined TDS and speech recognition technologies can provide end users with significantly higher performance than using each technology alone, particularly in completing tasks that require both pointing and text entry, such as web surfing.
Using Speech Recognition to Enhance the Tongue Drive System Functionality in Computer Access

PubMed Central

Huo, Xueliang; Ghovanloo, Maysam

2013-01-01

Tongue Drive System (TDS) is a wireless tongue operated assistive technology (AT), which can enable people with severe physical disabilities to access computers and drive powered wheelchairs using their volitional tongue movements. TDS offers six discrete commands, simultaneously available to the users, for pointing and typing as a substitute for mouse and keyboard in computer access, respectively. To enhance the TDS performance in typing, we have added a microphone, an audio codec, and a wireless audio link to its readily available 3-axial magnetic sensor array, and combined it with a commercially available speech recognition software, the Dragon Naturally Speaking, which is regarded as one of the most efficient ways for text entry. Our preliminary evaluations indicate that the combined TDS and speech recognition technologies can provide end users with significantly higher performance than using each technology alone, particularly in completing tasks that require both pointing and text entry, such as web surfing. PMID:22255801
Learning diagnostic models using speech and language measures.

PubMed

Peintner, Bart; Jarrold, William; Vergyriy, Dimitra; Richey, Colleen; Tempini, Maria Luisa Gorno; Ogar, Jennifer

2008-01-01

We describe results that show the effectiveness of machine learning in the automatic diagnosis of certain neurodegenerative diseases, several of which alter speech and language production. We analyzed audio from 9 control subjects and 30 patients diagnosed with one of three subtypes of Frontotemporal Lobar Degeneration. From this data, we extracted features of the audio signal and the words the patient used, which were obtained using our automated transcription technologies. We then automatically learned models that predict the diagnosis of the patient using these features. Our results show that learned models over these features predict diagnosis with accuracy significantly better than random. Future studies using higher quality recordings will likely improve these results.
"Look What I Did!": Student Conferences with Text-to-Speech Software

ERIC Educational Resources Information Center

Young, Chase; Stover, Katie

2014-01-01

The authors describe a strategy that empowers students to edit and revise their own writing. Students input their writing in to text-to-speech software that rereads the text aloud. While listening, students make necessary revisions and edits.
A centralized audio presentation manager

DOE Office of Scientific and Technical Information (OSTI.GOV)

Papp, A.L. III; Blattner, M.M.

1994-05-16

The centralized audio presentation manager addresses the problems which occur when multiple programs running simultaneously attempt to use the audio output of a computer system. Time dependence of sound means that certain auditory messages must be scheduled simultaneously, which can lead to perceptual problems due to psychoacoustic phenomena. Furthermore, the combination of speech and nonspeech audio is examined; each presents its own problems of perceptibility in an acoustic environment composed of multiple auditory streams. The centralized audio presentation manager receives abstract parameterized message requests from the currently running programs, and attempts to create and present a sonic representation in themore » most perceptible manner through the use of a theoretically and empirically designed rule set.« less
Structuring Broadcast Audio for Information Access

NASA Astrophysics Data System (ADS)

Gauvain, Jean-Luc; Lamel, Lori

2003-12-01

One rapidly expanding application area for state-of-the-art speech recognition technology is the automatic processing of broadcast audiovisual data for information access. Since much of the linguistic information is found in the audio channel, speech recognition is a key enabling technology which, when combined with information retrieval techniques, can be used for searching large audiovisual document collections. Audio indexing must take into account the specificities of audio data such as needing to deal with the continuous data stream and an imperfect word transcription. Other important considerations are dealing with language specificities and facilitating language portability. At Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), broadcast news transcription systems have been developed for seven languages: English, French, German, Mandarin, Portuguese, Spanish, and Arabic. The transcription systems have been integrated into prototype demonstrators for several application areas such as audio data mining, structuring audiovisual archives, selective dissemination of information, and topic tracking for media monitoring. As examples, this paper addresses the spoken document retrieval and topic tracking tasks.
Audio-Visual and Meaningful Semantic Context Enhancements in Older and Younger Adults.

PubMed

Smayda, Kirsten E; Van Engen, Kristin J; Maddox, W Todd; Chandrasekaran, Bharath

2016-01-01

Speech perception is critical to everyday life. Oftentimes noise can degrade a speech signal; however, because of the cues available to the listener, such as visual and semantic cues, noise rarely prevents conversations from continuing. The interaction of visual and semantic cues in aiding speech perception has been studied in young adults, but the extent to which these two cues interact for older adults has not been studied. To investigate the effect of visual and semantic cues on speech perception in older and younger adults, we recruited forty-five young adults (ages 18-35) and thirty-three older adults (ages 60-90) to participate in a speech perception task. Participants were presented with semantically meaningful and anomalous sentences in audio-only and audio-visual conditions. We hypothesized that young adults would outperform older adults across SNRs, modalities, and semantic contexts. In addition, we hypothesized that both young and older adults would receive a greater benefit from a semantically meaningful context in the audio-visual relative to audio-only modality. We predicted that young adults would receive greater visual benefit in semantically meaningful contexts relative to anomalous contexts. However, we predicted that older adults could receive a greater visual benefit in either semantically meaningful or anomalous contexts. Results suggested that in the most supportive context, that is, semantically meaningful sentences presented in the audiovisual modality, older adults performed similarly to young adults. In addition, both groups received the same amount of visual and meaningful benefit. Lastly, across groups, a semantically meaningful context provided more benefit in the audio-visual modality relative to the audio-only modality, and the presence of visual cues provided more benefit in semantically meaningful contexts relative to anomalous contexts. These results suggest that older adults can perceive speech as well as younger adults when both
Audio-Visual and Meaningful Semantic Context Enhancements in Older and Younger Adults

PubMed Central

Smayda, Kirsten E.; Van Engen, Kristin J.; Maddox, W. Todd; Chandrasekaran, Bharath

2016-01-01

Speech perception is critical to everyday life. Oftentimes noise can degrade a speech signal; however, because of the cues available to the listener, such as visual and semantic cues, noise rarely prevents conversations from continuing. The interaction of visual and semantic cues in aiding speech perception has been studied in young adults, but the extent to which these two cues interact for older adults has not been studied. To investigate the effect of visual and semantic cues on speech perception in older and younger adults, we recruited forty-five young adults (ages 18–35) and thirty-three older adults (ages 60–90) to participate in a speech perception task. Participants were presented with semantically meaningful and anomalous sentences in audio-only and audio-visual conditions. We hypothesized that young adults would outperform older adults across SNRs, modalities, and semantic contexts. In addition, we hypothesized that both young and older adults would receive a greater benefit from a semantically meaningful context in the audio-visual relative to audio-only modality. We predicted that young adults would receive greater visual benefit in semantically meaningful contexts relative to anomalous contexts. However, we predicted that older adults could receive a greater visual benefit in either semantically meaningful or anomalous contexts. Results suggested that in the most supportive context, that is, semantically meaningful sentences presented in the audiovisual modality, older adults performed similarly to young adults. In addition, both groups received the same amount of visual and meaningful benefit. Lastly, across groups, a semantically meaningful context provided more benefit in the audio-visual modality relative to the audio-only modality, and the presence of visual cues provided more benefit in semantically meaningful contexts relative to anomalous contexts. These results suggest that older adults can perceive speech as well as younger adults when

Detecting Abnormal Word Utterances in Children With Autism Spectrum Disorders: Machine-Learning-Based Voice Analysis Versus Speech Therapists.

PubMed

Nakai, Yasushi; Takiguchi, Tetsuya; Matsui, Gakuyo; Yamaoka, Noriko; Takada, Satoshi

2017-10-01

Abnormal prosody is often evident in the voice intonations of individuals with autism spectrum disorders. We compared a machine-learning-based voice analysis with human hearing judgments made by 10 speech therapists for classifying children with autism spectrum disorders ( n = 30) and typical development ( n = 51). Using stimuli limited to single-word utterances, machine-learning-based voice analysis was superior to speech therapist judgments. There was a significantly higher true-positive than false-negative rate for machine-learning-based voice analysis but not for speech therapists. Results are discussed in terms of some artificiality of clinician judgments based on single-word utterances, and the objectivity machine-learning-based voice analysis adds to judging abnormal prosody.
Study of an Audio Playback Machine Storage, Distribution, and Repair System. Options for Machine Operation. Study II, Part 1, Phase 2, Final Report.

ERIC Educational Resources Information Center

ManTech Technical Services Corp., Fairfax, VA.

This report presents the results of a management study of audio playback equipment operations conducted by the National Library Service, Library of Congress, its associated network of state and local machine lending agencies (MLA), and other parties that play a role in current operations. The objectives were to document current operations,…
Synchronized and noise-robust audio recordings during realtime magnetic resonance imaging scans.

PubMed

Bresch, Erik; Nielsen, Jon; Nayak, Krishna; Narayanan, Shrikanth

2006-10-01

This letter describes a data acquisition setup for recording, and processing, running speech from a person in a magnetic resonance imaging (MRI) scanner. The main focus is on ensuring synchronicity between image and audio acquisition, and in obtaining good signal to noise ratio to facilitate further speech analysis and modeling. A field-programmable gate array based hardware design for synchronizing the scanner image acquisition to other external data such as audio is described. The audio setup itself features two fiber optical microphones and a noise-canceling filter. Two noise cancellation methods are described including a novel approach using a pulse sequence specific model of the gradient noise of the MRI scanner. The setup is useful for scientific speech production studies. Sample results of speech and singing data acquired and processed using the proposed method are given.
Brain-to-text: decoding spoken phrases from phone representations in the brain.

PubMed

Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

2015-01-01

It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.
Brain-to-text: decoding spoken phrases from phone representations in the brain

PubMed Central

Herff, Christian; Heger, Dominic; de Pesters, Adriana; Telaar, Dominic; Brunner, Peter; Schalk, Gerwin; Schultz, Tanja

2015-01-01

It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech. PMID:26124702
(abstract) Synthesis of Speaker Facial Movements to Match Selected Speech Sequences

NASA Technical Reports Server (NTRS)

Scott, Kenneth C.

1994-01-01

We are developing a system for synthesizing image sequences the simulate the facial motion of a speaker. To perform this synthesis, we are pursuing two major areas of effort. We are developing the necessary computer graphics technology to synthesize a realistic image sequence of a person speaking selected speech sequences. Next, we are developing a model that expresses the relation between spoken phonemes and face/mouth shape. A subject is video taped speaking an arbitrary text that contains expression of the full list of desired database phonemes. The subject is video taped from the front speaking normally, recording both audio and video detail simultaneously. Using the audio track, we identify the specific video frames on the tape relating to each spoken phoneme. From this range we digitize the video frame which represents the extreme of mouth motion/shape. Thus, we construct a database of images of face/mouth shape related to spoken phonemes. A selected audio speech sequence is recorded which is the basis for synthesizing a matching video sequence; the speaker need not be the same as used for constructing the database. The audio sequence is analyzed to determine the spoken phoneme sequence and the relative timing of the enunciation of those phonemes. Synthesizing an image sequence corresponding to the spoken phoneme sequence is accomplished using a graphics technique known as morphing. Image sequence keyframes necessary for this processing are based on the spoken phoneme sequence and timing. We have been successful in synthesizing the facial motion of a native English speaker for a small set of arbitrary speech segments. Our future work will focus on advancement of the face shape/phoneme model and independent control of facial features.
Implementing Audio-CASI on Windows’ Platforms

PubMed Central

Cooley, Philip C.; Turner, Charles F.

2011-01-01

Audio computer-assisted self interviewing (Audio-CASI) technologies have recently been shown to provide important and sometimes dramatic improvements in the quality of survey measurements. This is particularly true for measurements requiring respondents to divulge highly sensitive information such as their sexual, drug use, or other sensitive behaviors. However, DOS-based Audio-CASI systems that were designed and adopted in the early 1990s have important limitations. Most salient is the poor control they provide for manipulating the video presentation of survey questions. This article reports our experiences adapting Audio-CASI to Microsoft Windows 3.1 and Windows 95 platforms. Overall, our Windows-based system provided the desired control over video presentation and afforded other advantages including compatibility with a much wider array of audio devices than our DOS-based Audio-CASI technologies. These advantages came at the cost of increased system requirements --including the need for both more RAM and larger hard disks. While these costs will be an issue for organizations converting large inventories of PCS to Windows Audio-CASI today, this will not be a serious constraint for organizations and individuals with small inventories of machines to upgrade or those purchasing new machines today. PMID:22081743
Synchronized and noise-robust audio recordings during realtime magnetic resonance imaging scans (L)

PubMed Central

Bresch, Erik; Nielsen, Jon; Nayak, Krishna; Narayanan, Shrikanth

2007-01-01

This letter describes a data acquisition setup for recording, and processing, running speech from a person in a magnetic resonance imaging (MRI) scanner. The main focus is on ensuring synchronicity between image and audio acquisition, and in obtaining good signal to noise ratio to facilitate further speech analysis and modeling. A field-programmable gate array based hardware design for synchronizing the scanner image acquisition to other external data such as audio is described. The audio setup itself features two fiber optical microphones and a noise-canceling filter. Two noise cancellation methods are described including a novel approach using a pulse sequence specific model of the gradient noise of the MRI scanner. The setup is useful for scientific speech production studies. Sample results of speech and singing data acquired and processed using the proposed method are given. PMID:17069275
Influences of selective adaptation on perception of audiovisual speech

PubMed Central

Dias, James W.; Cook, Theresa C.; Rosenblum, Lawrence D.

2016-01-01

Research suggests that selective adaptation in speech is a low-level process dependent on sensory-specific information shared between the adaptor and test-stimuli. However, previous research has only examined how adaptors shift perception of unimodal test stimuli, either auditory or visual. In the current series of experiments, we investigated whether adaptation to cross-sensory phonetic information can influence perception of integrated audio-visual phonetic information. We examined how selective adaptation to audio and visual adaptors shift perception of speech along an audiovisual test continuum. This test-continuum consisted of nine audio-/ba/-visual-/va/ stimuli, ranging in visual clarity of the mouth. When the mouth was clearly visible, perceivers “heard” the audio-visual stimulus as an integrated “va” percept 93.7% of the time (e.g., McGurk & MacDonald, 1976). As visibility of the mouth became less clear across the nine-item continuum, the audio-visual “va” percept weakened, resulting in a continuum ranging in audio-visual percepts from /va/ to /ba/. Perception of the test-stimuli was tested before and after adaptation. Changes in audiovisual speech perception were observed following adaptation to visual-/va/ and audiovisual-/va/, but not following adaptation to auditory-/va/, auditory-/ba/, or visual-/ba/. Adaptation modulates perception of integrated audio-visual speech by modulating the processing of sensory-specific information. The results suggest that auditory and visual speech information are not completely integrated at the level of selective adaptation. PMID:27041781
Visually Impaired Persons' Comprehension of Text Presented with Speech Synthesis.

ERIC Educational Resources Information Center

Hjelmquist, E.; And Others

1992-01-01

This study of 48 individuals with visual impairments (16 middle-aged with experience in synthetic speech, 16 middle-aged inexperienced, and 16 older inexperienced) found that speech synthesis, compared to natural speech, generally yielded lower results with respect to memory and understanding of texts. Experience had no effect on performance.…
Cue Integration in Categorical Tasks: Insights from Audio-Visual Speech Perception

PubMed Central

Bejjanki, Vikranth Rao; Clayards, Meghan; Knill, David C.; Aslin, Richard N.

2011-01-01

Previous cue integration studies have examined continuous perceptual dimensions (e.g., size) and have shown that human cue integration is well described by a normative model in which cues are weighted in proportion to their sensory reliability, as estimated from single-cue performance. However, this normative model may not be applicable to categorical perceptual dimensions (e.g., phonemes). In tasks defined over categorical perceptual dimensions, optimal cue weights should depend not only on the sensory variance affecting the perception of each cue but also on the environmental variance inherent in each task-relevant category. Here, we present a computational and experimental investigation of cue integration in a categorical audio-visual (articulatory) speech perception task. Our results show that human performance during audio-visual phonemic labeling is qualitatively consistent with the behavior of a Bayes-optimal observer. Specifically, we show that the participants in our task are sensitive, on a trial-by-trial basis, to the sensory uncertainty associated with the auditory and visual cues, during phonemic categorization. In addition, we show that while sensory uncertainty is a significant factor in determining cue weights, it is not the only one and participants' performance is consistent with an optimal model in which environmental, within category variability also plays a role in determining cue weights. Furthermore, we show that in our task, the sensory variability affecting the visual modality during cue-combination is not well estimated from single-cue performance, but can be estimated from multi-cue performance. The findings and computational principles described here represent a principled first step towards characterizing the mechanisms underlying human cue integration in categorical tasks. PMID:21637344
Review of Speech-to-Text Recognition Technology for Enhancing Learning

ERIC Educational Resources Information Center

Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min

2014-01-01

This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…
Speech sound classification and detection of articulation disorders with support vector machines and wavelets.

PubMed

Georgoulas, George; Georgopoulos, Voula C; Stylios, Chrysostomos D

2006-01-01

This paper proposes a novel integrated methodology to extract features and classify speech sounds with intent to detect the possible existence of a speech articulation disorder in a speaker. Articulation, in effect, is the specific and characteristic way that an individual produces the speech sounds. A methodology to process the speech signal, extract features and finally classify the signal and detect articulation problems in a speaker is presented. The use of support vector machines (SVMs), for the classification of speech sounds and detection of articulation disorders is introduced. The proposed method is implemented on a data set where different sets of features and different schemes of SVMs are tested leading to satisfactory performance.
Exploring expressivity and emotion with artificial voice and speech technologies.

PubMed

Pauletto, Sandra; Balentine, Bruce; Pidcock, Chris; Jones, Kevin; Bottaci, Leonardo; Aretoulaki, Maria; Wells, Jez; Mundy, Darren P; Balentine, James

2013-10-01

Emotion in audio-voice signals, as synthesized by text-to-speech (TTS) technologies, was investigated to formulate a theory of expression for user interface design. Emotional parameters were specified with markup tags, and the resulting audio was further modulated with post-processing techniques. Software was then developed to link a selected TTS synthesizer with an automatic speech recognition (ASR) engine, producing a chatbot that could speak and listen. Using these two artificial voice subsystems, investigators explored both artistic and psychological implications of artificial speech emotion. Goals of the investigation were interdisciplinary, with interest in musical composition, augmentative and alternative communication (AAC), commercial voice announcement applications, human-computer interaction (HCI), and artificial intelligence (AI). The work-in-progress points towards an emerging interdisciplinary ontology for artificial voices. As one study output, HCI tools are proposed for future collaboration.
Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications.

PubMed

VanDam, Mark; Silbert, Noah H

2016-01-01

Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output.
Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications

PubMed Central

2016-01-01

Automatic speech processing (ASP) has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA) system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output. PMID:27529813
McGurk stimuli for the investigation of multisensory integration in cochlear implant users: The Oldenburg Audio Visual Speech Stimuli (OLAVS).

PubMed

Stropahl, Maren; Schellhardt, Sebastian; Debener, Stefan

2017-06-01

The concurrent presentation of different auditory and visual syllables may result in the perception of a third syllable, reflecting an illusory fusion of visual and auditory information. This well-known McGurk effect is frequently used for the study of audio-visual integration. Recently, it was shown that the McGurk effect is strongly stimulus-dependent, which complicates comparisons across perceivers and inferences across studies. To overcome this limitation, we developed the freely available Oldenburg audio-visual speech stimuli (OLAVS), consisting of 8 different talkers and 12 different syllable combinations. The quality of the OLAVS set was evaluated with 24 normal-hearing subjects. All 96 stimuli were characterized based on their stimulus disparity, which was obtained from a probabilistic model (cf. Magnotti & Beauchamp, 2015). Moreover, the McGurk effect was studied in eight adult cochlear implant (CI) users. By applying the individual, stimulus-independent parameters of the probabilistic model, the predicted effect of stronger audio-visual integration in CI users could be confirmed, demonstrating the validity of the new stimulus material.
[Ventriloquism and audio-visual integration of voice and face].

PubMed

Yokosawa, Kazuhiko; Kanaya, Shoko

2012-07-01

Presenting synchronous auditory and visual stimuli in separate locations creates the illusion that the sound originates from the direction of the visual stimulus. Participants' auditory localization bias, called the ventriloquism effect, has revealed factors affecting the perceptual integration of audio-visual stimuli. However, many studies on audio-visual processes have focused on performance in simplified experimental situations, with a single stimulus in each sensory modality. These results cannot necessarily explain our perceptual behavior in natural scenes, where various signals exist within a single sensory modality. In the present study we report the contributions of a cognitive factor, that is, the audio-visual congruency of speech, although this factor has often been underestimated in previous ventriloquism research. Thus, we investigated the contribution of speech congruency on the ventriloquism effect using a spoken utterance and two videos of a talking face. The salience of facial movements was also manipulated. As a result, when bilateral visual stimuli are presented in synchrony with a single voice, cross-modal speech congruency was found to have a significant impact on the ventriloquism effect. This result also indicated that more salient visual utterances attracted participants' auditory localization. The congruent pairing of audio-visual utterances elicited greater localization bias than did incongruent pairing, whereas previous studies have reported little dependency on the reality of stimuli in ventriloquism. Moreover, audio-visual illusory congruency, owing to the McGurk effect, caused substantial visual interference to auditory localization. This suggests that a greater flexibility in responding to multi-sensory environments exists than has been previously considered.
Visual speech segmentation: using facial cues to locate word boundaries in continuous speech

PubMed Central

Mitchel, Aaron D.; Weiss, Daniel J.

2014-01-01

Speech is typically a multimodal phenomenon, yet few studies have focused on the exclusive contributions of visual cues to language acquisition. To address this gap, we investigated whether visual prosodic information can facilitate speech segmentation. Previous research has demonstrated that language learners can use lexical stress and pitch cues to segment speech and that learners can extract this information from talking faces. Thus, we created an artificial speech stream that contained minimal segmentation cues and paired it with two synchronous facial displays in which visual prosody was either informative or uninformative for identifying word boundaries. Across three familiarisation conditions (audio stream alone, facial streams alone, and paired audiovisual), learning occurred only when the facial displays were informative to word boundaries, suggesting that facial cues can help learners solve the early challenges of language acquisition. PMID:25018577
Text to Speech (TTS) Capabilities for the Common Driver Trainer (CDT)

DTIC Science & Technology

2010-10-01

harnessing in’leigle jalClpeno jocelyn linu ~ los angeles lottery margarine mathematlze mathematized mathematized meme memes memol...including Julie, Kate, and Paul . Based upon the names of the voices, it may be that the VoiceText capability is the technology being used currently on...DFTTSExportToFileEx(O, " Paul ", 1, 1033, "Testing the Digital Future Text-to-Speech SDK.", -1, -1, -1, -1, -1, DFTTS_ TEXT_ TYPE_ XML, "test.wav", 0, "", -1

An Algorithm for Controlled Integration of Sound and Text.

ERIC Educational Resources Information Center

Wohlert, Harry S.; McCormick, Martin

1985-01-01

A serious drawback in introducing sound into computer programs for teaching foreign language speech has been the lack of an algorithm to turn off the cassette recorder immediately to keep screen text and audio in synchronization. This article describes a program which solves that problem. (SED)
Evaluating the Effort Expended to Understand Speech in Noise Using a Dual-Task Paradigm: The Effects of Providing Visual Speech Cues

ERIC Educational Resources Information Center

Fraser, Sarah; Gagne, Jean-Pierre; Alepins, Majolaine; Dubois, Pascale

2010-01-01

Purpose: Using a dual-task paradigm, 2 experiments (Experiments 1 and 2) were conducted to assess differences in the amount of listening effort expended to understand speech in noise in audiovisual (AV) and audio-only (A-only) modalities. Experiment 1 had equivalent noise levels in both modalities, and Experiment 2 equated speech recognition…
Use of speech-to-text technology for documentation by healthcare providers.

PubMed

Ajami, Sima

2016-01-01

Medical records are a critical component of a patient's treatment. However, documentation of patient-related information is considered a secondary activity in the provision of healthcare services, often leading to incomplete medical records and patient data of low quality. Advances in information technology (IT) in the health system and registration of information in electronic health records (EHR) using speechto- text conversion software have facilitated service delivery. This narrative review is a literature search with the help of libraries, books, conference proceedings, databases of Science Direct, PubMed, Proquest, Springer, SID (Scientific Information Database), and search engines such as Yahoo, and Google. I used the following keywords and their combinations: speech recognition, automatic report documentation, voice to text software, healthcare, information, and voice recognition. Due to lack of knowledge of other languages, I searched all texts in English or Persian with no time limits. Of a total of 70, only 42 articles were selected. Speech-to-text conversion technology offers opportunities to improve the documentation process of medical records, reduce cost and time of recording information, enhance the quality of documentation, improve the quality of services provided to patients, and support healthcare providers in legal matters. Healthcare providers should recognize the impact of this technology on service delivery.
Ontology-based structured cosine similarity in document summarization: with applications to mobile audio-based knowledge management.

PubMed

Yuan, Soe-Tsyr; Sun, Jerry

2005-10-01

Development of algorithms for automated text categorization in massive text document sets is an important research area of data mining and knowledge discovery. Most of the text-clustering methods were grounded in the term-based measurement of distance or similarity, ignoring the structure of the documents. In this paper, we present a novel method named structured cosine similarity (SCS) that furnishes document clustering with a new way of modeling on document summarization, considering the structure of the documents so as to improve the performance of document clustering in terms of quality, stability, and efficiency. This study was motivated by the problem of clustering speech documents (of no rich document features) attained from the wireless experience oral sharing conducted by mobile workforce of enterprises, fulfilling audio-based knowledge management. In other words, this problem aims to facilitate knowledge acquisition and sharing by speech. The evaluations also show fairly promising results on our method of structured cosine similarity.
Audio-visual imposture

NASA Astrophysics Data System (ADS)

Karam, Walid; Mokbel, Chafic; Greige, Hanna; Chollet, Gerard

2006-05-01

A GMM based audio visual speaker verification system is described and an Active Appearance Model with a linear speaker transformation system is used to evaluate the robustness of the verification. An Active Appearance Model (AAM) is used to automatically locate and track a speaker's face in a video recording. A Gaussian Mixture Model (GMM) based classifier (BECARS) is used for face verification. GMM training and testing is accomplished on DCT based extracted features of the detected faces. On the audio side, speech features are extracted and used for speaker verification with the GMM based classifier. Fusion of both audio and video modalities for audio visual speaker verification is compared with face verification and speaker verification systems. To improve the robustness of the multimodal biometric identity verification system, an audio visual imposture system is envisioned. It consists of an automatic voice transformation technique that an impostor may use to assume the identity of an authorized client. Features of the transformed voice are then combined with the corresponding appearance features and fed into the GMM based system BECARS for training. An attempt is made to increase the acceptance rate of the impostor and to analyzing the robustness of the verification system. Experiments are being conducted on the BANCA database, with a prospect of experimenting on the newly developed PDAtabase developed within the scope of the SecurePhone project.
A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition.

PubMed

Zhang, Yong; Li, Peng; Jin, Yingyezhe; Choe, Yoonsuck

2015-11-01

This paper presents a bioinspired digital liquid-state machine (LSM) for low-power very-large-scale-integration (VLSI)-based machine learning applications. To the best of the authors' knowledge, this is the first work that employs a bioinspired spike-based learning algorithm for the LSM. With the proposed online learning, the LSM extracts information from input patterns on the fly without needing intermediate data storage as required in offline learning methods such as ridge regression. The proposed learning rule is local such that each synaptic weight update is based only upon the firing activities of the corresponding presynaptic and postsynaptic neurons without incurring global communications across the neural network. Compared with the backpropagation-based learning, the locality of computation in the proposed approach lends itself to efficient parallel VLSI implementation. We use subsets of the TI46 speech corpus to benchmark the bioinspired digital LSM. To reduce the complexity of the spiking neural network model without performance degradation for speech recognition, we study the impacts of synaptic models on the fading memory of the reservoir and hence the network performance. Moreover, we examine the tradeoffs between synaptic weight resolution, reservoir size, and recognition performance and present techniques to further reduce the overhead of hardware implementation. Our simulation results show that in terms of isolated word recognition evaluated using the TI46 speech corpus, the proposed digital LSM rivals the state-of-the-art hidden Markov-model-based recognizer Sphinx-4 and outperforms all other reported recognizers including the ones that are based upon the LSM or neural networks.
Reduced efficiency of audiovisual integration for nonnative speech.

PubMed

Yi, Han-Gyol; Phelps, Jasmine E B; Smiljanic, Rajka; Chandrasekaran, Bharath

2013-11-01

The role of visual cues in native listeners' perception of speech produced by nonnative speakers has not been extensively studied. Native perception of English sentences produced by native English and Korean speakers in audio-only and audiovisual conditions was examined. Korean speakers were rated as more accented in audiovisual than in the audio-only condition. Visual cues enhanced word intelligibility for native English speech but less so for Korean-accented speech. Reduced intelligibility of Korean-accented audiovisual speech was associated with implicit visual biases, suggesting that listener-related factors partially influence the efficiency of audiovisual integration for nonnative speech perception.
Accessible Text-to-Speech Options for Students Who Struggle with Reading

ERIC Educational Resources Information Center

Bone, Erin K.; Bouck, Emily C.

2017-01-01

As students progress through school they spend more time reading to obtain information. Reading to learn can be a struggle for any student, but it tends to be a bigger obstacle for students with disabilities. Using text-to-speech applications and extensions is one way to assist students with disabilities who struggle to independently complete…
The enhancement of beneficial effects following audio feedback by cognitive preparation in the treatment of social anxiety: a single-session experiment.

PubMed

Nilsson, Jan-Erik; Lundh, Lars-Gunnar; Faghihi, Shahriar; Roth-Andersson, Gun

2011-12-01

According to cognitive models, negatively biased processing of the publicly observable self is an important aspect of social phobia; if this is true, effective methods for producing corrective feedback concerning the public self should be strived for. Video feedback is proven effective, but since one's voice represents another aspect of the self, audio feedback should produce equivalent results. This is the first study to assess the enhancement of audio feedback by cognitive preparation in a single-session randomized controlled experiment. Forty socially anxious participants were asked to give a speech, then to listen to and evaluate a taped recording of their performance. Half of the sample was given cognitive preparation prior to the audio feedback and the remainder received audio feedback only. Cognitive preparation involved asking participants to (1) predict in detail what they would hear on the audiotape, (2) form an image of themselves giving the speech and (3) listen to the audio recording as though they were listening to a stranger. To assess generalization effects all participants were asked to give a second speech. Audio feedback with cognitive preparation was shown to produce less negative ratings after the first speech, and effects generalized to the evaluation of the second speech. More positive speech evaluations were associated with corresponding reductions of state anxiety. Social anxiety as indexed by the Implicit Association Test was reduced in participants given cognitive preparation. Small sample size; analogue study. Audio feedback with cognitive preparation may be utilized as a treatment intervention for social phobia. Copyright © 2011 Elsevier Ltd. All rights reserved.
Video-assisted segmentation of speech and audio track

NASA Astrophysics Data System (ADS)

Pandit, Medha; Yusoff, Yusseri; Kittler, Josef; Christmas, William J.; Chilton, E. H. S.

1999-08-01

Video database research is commonly concerned with the storage and retrieval of visual information invovling sequence segmentation, shot representation and video clip retrieval. In multimedia applications, video sequences are usually accompanied by a sound track. The sound track contains potential cues to aid shot segmentation such as different speakers, background music, singing and distinctive sounds. These different acoustic categories can be modeled to allow for an effective database retrieval. In this paper, we address the problem of automatic segmentation of audio track of multimedia material. This audio based segmentation can be combined with video scene shot detection in order to achieve partitioning of the multimedia material into semantically significant segments.
Kernel-Based Sensor Fusion With Application to Audio-Visual Voice Activity Detection

NASA Astrophysics Data System (ADS)

Dov, David; Talmon, Ronen; Cohen, Israel

2016-12-01

In this paper, we address the problem of multiple view data fusion in the presence of noise and interferences. Recent studies have approached this problem using kernel methods, by relying particularly on a product of kernels constructed separately for each view. From a graph theory point of view, we analyze this fusion approach in a discrete setting. More specifically, based on a statistical model for the connectivity between data points, we propose an algorithm for the selection of the kernel bandwidth, a parameter, which, as we show, has important implications on the robustness of this fusion approach to interferences. Then, we consider the fusion of audio-visual speech signals measured by a single microphone and by a video camera pointed to the face of the speaker. Specifically, we address the task of voice activity detection, i.e., the detection of speech and non-speech segments, in the presence of structured interferences such as keyboard taps and office noise. We propose an algorithm for voice activity detection based on the audio-visual signal. Simulation results show that the proposed algorithm outperforms competing fusion and voice activity detection approaches. In addition, we demonstrate that a proper selection of the kernel bandwidth indeed leads to improved performance.
Use of Video and Audio Texts in EFL Listening Test

ERIC Educational Resources Information Center

Basal, Ahmet; Gülözer, Kaine; Demir, Ibrahim

2015-01-01

The study aims to discover whether audio or video modality in a listening test is more beneficial to test takers. In this study, the posttest-only control group design was utilized and quantitative data were collected in order to measure participant performances concerning two types of modality (audio or video) in a listening test. The…
A Study of Text-to-Speech (TTS) in Children's English Learning

ERIC Educational Resources Information Center

Huang, Yi-Ching; Liao, Lung-Chuan

2015-01-01

The purpose of this study was to explore the effects of the digital material incorporated into Text-to- Speech system for students' English spelling. The digital material was made on the basis of the Spelling Bee vocabulary list (approximately 300 words) issued by the selected school. 21 third graders from a private bilingual school in Taiwan were…
Neural Entrainment to Rhythmically Presented Auditory, Visual, and Audio-Visual Speech in Children

PubMed Central

Power, Alan James; Mead, Natasha; Barnes, Lisa; Goswami, Usha

2012-01-01

Auditory cortical oscillations have been proposed to play an important role in speech perception. It is suggested that the brain may take temporal “samples” of information from the speech stream at different rates, phase resetting ongoing oscillations so that they are aligned with similar frequency bands in the input (“phase locking”). Information from these frequency bands is then bound together for speech perception. To date, there are no explorations of neural phase locking and entrainment to speech input in children. However, it is clear from studies of language acquisition that infants use both visual speech information and auditory speech information in learning. In order to study neural entrainment to speech in typically developing children, we use a rhythmic entrainment paradigm (underlying 2 Hz or delta rate) based on repetition of the syllable “ba,” presented in either the auditory modality alone, the visual modality alone, or as auditory-visual speech (via a “talking head”). To ensure attention to the task, children aged 13 years were asked to press a button as fast as possible when the “ba” stimulus violated the rhythm for each stream type. Rhythmic violation depended on delaying the occurrence of a “ba” in the isochronous stream. Neural entrainment was demonstrated for all stream types, and individual differences in standardized measures of language processing were related to auditory entrainment at the theta rate. Further, there was significant modulation of the preferred phase of auditory entrainment in the theta band when visual speech cues were present, indicating cross-modal phase resetting. The rhythmic entrainment paradigm developed here offers a method for exploring individual differences in oscillatory phase locking during development. In particular, a method for assessing neural entrainment and cross-modal phase resetting would be useful for exploring developmental learning difficulties thought to involve temporal
A Selective Deficit in Phonetic Recalibration by Text in Developmental Dyslexia.

PubMed

Keetels, Mirjam; Bonte, Milene; Vroomen, Jean

2018-01-01

Upon hearing an ambiguous speech sound, listeners may adjust their perceptual interpretation of the speech input in accordance with contextual information, like accompanying text or lipread speech (i.e., phonetic recalibration; Bertelson et al., 2003). As developmental dyslexia (DD) has been associated with reduced integration of text and speech sounds, we investigated whether this deficit becomes manifest when text is used to induce this type of audiovisual learning. Adults with DD and normal readers were exposed to ambiguous consonants halfway between /aba/ and /ada/ together with text or lipread speech. After this audiovisual exposure phase, they categorized auditory-only ambiguous test sounds. Results showed that individuals with DD, unlike normal readers, did not use text to recalibrate their phoneme categories, whereas their recalibration by lipread speech was spared. Individuals with DD demonstrated similar deficits when ambiguous vowels (halfway between /wIt/ and /wet/) were recalibrated by text. These findings indicate that DD is related to a specific letter-speech sound association deficit that extends over phoneme classes (vowels and consonants), but - as lipreading was spared - does not extend to a more general audio-visual integration deficit. In particular, these results highlight diminished reading-related audiovisual learning in addition to the commonly reported phonological problems in developmental dyslexia.
Impact of Audio-Visual Asynchrony on Lip-Reading Effects -Neuromagnetic and Psychophysical Study-

PubMed Central

Yahata, Izumi; Kanno, Akitake; Sakamoto, Shuichi; Takanashi, Yoshitaka; Takata, Shiho; Nakasato, Nobukazu; Kawashima, Ryuta; Katori, Yukio

2016-01-01

The effects of asynchrony between audio and visual (A/V) stimuli on the N100m responses of magnetoencephalography in the left hemisphere were compared with those on the psychophysical responses in 11 participants. The latency and amplitude of N100m were significantly shortened and reduced in the left hemisphere by the presentation of visual speech as long as the temporal asynchrony between A/V stimuli was within 100 ms, but were not significantly affected with audio lags of -500 and +500 ms. However, some small effects were still preserved on average with audio lags of 500 ms, suggesting similar asymmetry of the temporal window to that observed in psychophysical measurements, which tended to be more robust (wider) for audio lags; i.e., the pattern of visual-speech effects as a function of A/V lag observed in the N100m in the left hemisphere grossly resembled that in psychophysical measurements on average, although the individual responses were somewhat varied. The present results suggest that the basic configuration of the temporal window of visual effects on auditory-speech perception could be observed from the early auditory processing stage. PMID:28030631
Advancements in text-to-speech technology and implications for AAC applications

NASA Astrophysics Data System (ADS)

Syrdal, Ann K.

2003-10-01

Intelligibility was the initial focus in text-to-speech (TTS) research, since it is clearly a necessary condition for the application of the technology. Sufficiently high intelligibility (approximating human speech) has been achieved in the last decade by the better formant-based and concatenative TTS systems. This led to commercially available TTS systems for highly motivated users, particularly the blind and vocally impaired. Some unnatural qualities of TTS were exploited by these users, such as very fast speaking rates and altered pitch ranges for flagging relevant information. Recently, the focus in TTS research has turned to improving naturalness, so that synthetic speech sounds more human and less robotic. Unit selection approaches to concatenative synthesis have dramatically improved TTS quality, although at the cost of larger and more complex systems. This advancement in naturalness has made TTS technology more acceptable to the general public. The vocally impaired appreciate a more natural voice with which to represent themselves when communicating with others. Unit selection TTS does not achieve such high speaking rates as the earlier TTS systems, however, which is a disadvantage to some AAC device users. An important new research emphasis is to improve and increase the range of emotional expressiveness of TTS.
Speech information retrieval: a review

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hafen, Ryan P.; Henry, Michael J.

Audio is an information-rich component of multimedia. Information can be extracted from audio in a number of different ways, and thus there are several established audio signal analysis research fields. These fields include speech recognition, speaker recognition, audio segmentation and classification, and audio finger-printing. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major audio analysis fields. The goal is to introduce enough back-ground for someone new in the field to quickly gain high-level understanding andmore » to provide direction for further study.« less
A technology prototype system for rating therapist empathy from audio recordings in addiction counseling.

PubMed

Xiao, Bo; Huang, Chewei; Imel, Zac E; Atkins, David C; Georgiou, Panayiotis; Narayanan, Shrikanth S

2016-04-01

Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.
A technology prototype system for rating therapist empathy from audio recordings in addiction counseling

PubMed Central

Xiao, Bo; Huang, Chewei; Imel, Zac E.; Atkins, David C.; Georgiou, Panayiotis; Narayanan, Shrikanth S.

2016-01-01

Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy—a key therapy quality index—from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training. PMID:28286867

A Bit Stream Scalable Speech/Audio Coder Combining Enhanced Regular Pulse Excitation and Parametric Coding

NASA Astrophysics Data System (ADS)

Riera-Palou, Felip; den Brinker, Albertus C.

2007-12-01

This paper introduces a new audio and speech broadband coding technique based on the combination of a pulse excitation coder and a standardized parametric coder, namely, MPEG-4 high-quality parametric coder. After presenting a series of enhancements to regular pulse excitation (RPE) to make it suitable for the modeling of broadband signals, it is shown how pulse and parametric codings complement each other and how they can be merged to yield a layered bit stream scalable coder able to operate at different points in the quality bit rate plane. The performance of the proposed coder is evaluated in a listening test. The major result is that the extra functionality of the bit stream scalability does not come at the price of a reduced performance since the coder is competitive with standardized coders (MP3, AAC, SSC).
Orthographic learning and the role of text-to-speech software in Dutch disabled readers.

PubMed

Staels, Eva; Van den Broeck, Wim

2015-01-01

In this study, we examined whether orthographic learning can be demonstrated in disabled readers learning to read in a transparent orthography (Dutch). In addition, we tested the effect of the use of text-to-speech software, a new form of direct instruction, on orthographic learning. Both research goals were investigated by replicating Share's self-teaching paradigm. A total of 65 disabled Dutch readers were asked to read eight stories containing embedded homophonic pseudoword targets (e.g., Blot/Blod), with or without the support of text-to-speech software. The amount of orthographic learning was assessed 3 or 7 days later by three measures of orthographic learning. First, the results supported the presence of orthographic learning during independent silent reading by demonstrating that target spellings were correctly identified more often, named more quickly, and spelled more accurately than their homophone foils. Our results support the hypothesis that all readers, even poor readers of transparent orthographies, are capable of developing word-specific knowledge. Second, a negative effect of text-to-speech software on orthographic learning was demonstrated in this study. This negative effect was interpreted as the consequence of passively listening to the auditory presentation of the text. We clarify how these results can be interpreted within current theoretical accounts of orthographic learning and briefly discuss implications for remedial interventions. © Hammill Institute on Disabilities 2013.
Audio-Visual Speech Perception: A Developmental ERP Investigation

ERIC Educational Resources Information Center

Knowland, Victoria C. P.; Mercure, Evelyne; Karmiloff-Smith, Annette; Dick, Fred; Thomas, Michael S. C.

2014-01-01

Being able to see a talking face confers a considerable advantage for speech perception in adulthood. However, behavioural data currently suggest that children fail to make full use of these available visual speech cues until age 8 or 9. This is particularly surprising given the potential utility of multiple informational cues during language…
Cortical Tracking of Global and Local Variations of Speech Rhythm during Connected Natural Speech Perception.

PubMed

Alexandrou, Anna Maria; Saarinen, Timo; Kujala, Jan; Salmelin, Riitta

2018-06-19

During natural speech perception, listeners must track the global speaking rate, that is, the overall rate of incoming linguistic information, as well as transient, local speaking rate variations occurring within the global speaking rate. Here, we address the hypothesis that this tracking mechanism is achieved through coupling of cortical signals to the amplitude envelope of the perceived acoustic speech signals. Cortical signals were recorded with magnetoencephalography (MEG) while participants perceived spontaneously produced speech stimuli at three global speaking rates (slow, normal/habitual, and fast). Inherently to spontaneously produced speech, these stimuli also featured local variations in speaking rate. The coupling between cortical and acoustic speech signals was evaluated using audio-MEG coherence. Modulations in audio-MEG coherence spatially differentiated between tracking of global speaking rate, highlighting the temporal cortex bilaterally and the right parietal cortex, and sensitivity to local speaking rate variations, emphasizing the left parietal cortex. Cortical tuning to the temporal structure of natural connected speech thus seems to require the joint contribution of both auditory and parietal regions. These findings suggest that cortical tuning to speech rhythm operates on two functionally distinct levels: one encoding the global rhythmic structure of speech and the other associated with online, rapidly evolving temporal predictions. Thus, it may be proposed that speech perception is shaped by evolutionary tuning, a preference for certain speaking rates, and predictive tuning, associated with cortical tracking of the constantly changing rate of linguistic information in a speech stream.
Audio-visual affective expression recognition

NASA Astrophysics Data System (ADS)

Huang, Thomas S.; Zeng, Zhihong

2007-11-01

Automatic affective expression recognition has attracted more and more attention of researchers from different disciplines, which will significantly contribute to a new paradigm for human computer interaction (affect-sensitive interfaces, socially intelligent environments) and advance the research in the affect-related fields including psychology, psychiatry, and education. Multimodal information integration is a process that enables human to assess affective states robustly and flexibly. In order to understand the richness and subtleness of human emotion behavior, the computer should be able to integrate information from multiple sensors. We introduce in this paper our efforts toward machine understanding of audio-visual affective behavior, based on both deliberate and spontaneous displays. Some promising methods are presented to integrate information from both audio and visual modalities. Our experiments show the advantage of audio-visual fusion in affective expression recognition over audio-only or visual-only approaches.
Speech Music Discrimination Using Class-Specific Features

DTIC Science & Technology

2004-08-01

Speech Music Discrimination Using Class-Specific Features Thomas Beierholm...between speech and music . Feature extraction is class-specific and can therefore be tailored to each class meaning that segment size, model orders...interest. Some of the applications of audio signal classification are speech/ music classification [1], acoustical environmental classification [2][3
Fall Detection Using Smartphone Audio Features.

PubMed

Cheffena, Michael

2016-07-01

An automated fall detection system based on smartphone audio features is developed. The spectrogram, mel frequency cepstral coefficents (MFCCs), linear predictive coding (LPC), and matching pursuit (MP) features of different fall and no-fall sound events are extracted from experimental data. Based on the extracted audio features, four different machine learning classifiers: k-nearest neighbor classifier (k-NN), support vector machine (SVM), least squares method (LSM), and artificial neural network (ANN) are investigated for distinguishing between fall and no-fall events. For each audio feature, the performance of each classifier in terms of sensitivity, specificity, accuracy, and computational complexity is evaluated. The best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity, and accuracy all above 98%. The classifier also has acceptable computational requirement for training and testing. The system is applicable in home environments where the phone is placed in the vicinity of the user.
Acoustics of Clear Speech: Effect of Instruction

ERIC Educational Resources Information Center

Lam, Jennifer; Tjaden, Kris; Wilding, Greg

2012-01-01

Purpose: This study investigated how different instructions for eliciting clear speech affected selected acoustic measures of speech. Method: Twelve speakers were audio-recorded reading 18 different sentences from the Assessment of Intelligibility of Dysarthric Speech (Yorkston & Beukelman, 1984). Sentences were produced in habitual, clear,…
The sweet-home project: audio technology in smart homes to improve well-being and reliance.

PubMed

Vacher, Michel; Istrate, Dan; Portet, François; Joubert, Thierry; Chevalier, Thierry; Smidtas, Serge; Meillon, Brigitte; Lecouteux, Benjamin; Sehili, Mohamed; Chahuara, Pedro; Méniard, Sylvain

2011-01-01

The Sweet-Home project aims at providing audio-based interaction technology that lets the user have full control over their home environment, at detecting distress situations and at easing the social inclusion of the elderly and frail population. This paper presents an overview of the project focusing on the multimodal sound corpus acquisition and labelling and on the investigated techniques for speech and sound recognition. The user study and the recognition performances show the interest of this audio technology.
Does Use of Text-to-Speech and Related Read-Aloud Tools Improve Reading Comprehension for Students with Reading Disabilities? A Meta-Analysis

ERIC Educational Resources Information Center

Wood, Sarah G.; Moxley, Jerad H.; Tighe, Elizabeth L.; Wagner, Richard K.

2018-01-01

Text-to-speech and related read-aloud tools are being widely implemented in an attempt to assist students' reading comprehension skills. Read-aloud software, including text-to-speech, is used to translate written text into spoken text, enabling one to listen to written text while reading along. It is not clear how effective text-to-speech is at…
Integrating Text-to-Speech Software into Pedagogically Sound Teaching and Learning Scenarios

ERIC Educational Resources Information Center

Rughooputh, S. D. D. V.; Santally, M. I.

2009-01-01

This paper presents a new technique of delivery of classes--an instructional technique which will no doubt revolutionize the teaching and learning, whether for on-campus, blended or online modules. This is based on the simple task of instructionally incorporating text-to-speech software embedded in the lecture slides that will simulate exactly the…
Sensitivity to audio-visual synchrony and its relation to language abilities in children with and without ASD.

PubMed

Righi, Giulia; Tenenbaum, Elena J; McCormick, Carolyn; Blossom, Megan; Amso, Dima; Sheinkopf, Stephen J

2018-04-01

Autism Spectrum Disorder (ASD) is often accompanied by deficits in speech and language processing. Speech processing relies heavily on the integration of auditory and visual information, and it has been suggested that the ability to detect correspondence between auditory and visual signals helps to lay the foundation for successful language development. The goal of the present study was to examine whether young children with ASD show reduced sensitivity to temporal asynchronies in a speech processing task when compared to typically developing controls, and to examine how this sensitivity might relate to language proficiency. Using automated eye tracking methods, we found that children with ASD failed to demonstrate sensitivity to asynchronies of 0.3s, 0.6s, or 1.0s between a video of a woman speaking and the corresponding audio track. In contrast, typically developing children who were language-matched to the ASD group, were sensitive to both 0.6s and 1.0s asynchronies. We also demonstrated that individual differences in sensitivity to audiovisual asynchronies and individual differences in orientation to relevant facial features were both correlated with scores on a standardized measure of language abilities. Results are discussed in the context of attention to visual language and audio-visual processing as potential precursors to language impairment in ASD. Autism Res 2018, 11: 645-653. © 2018 International Society for Autism Research, Wiley Periodicals, Inc. Speech processing relies heavily on the integration of auditory and visual information, and it has been suggested that the ability to detect correspondence between auditory and visual signals helps to lay the foundation for successful language development. The goal of the present study was to explore whether children with ASD process audio-visual synchrony in ways comparable to their typically developing peers, and the relationship between preference for synchrony and language ability. Results showed that
Research in speech communication.

PubMed

Flanagan, J

1995-10-24

Advances in digital speech processing are now supporting application and deployment of a variety of speech technologies for human/machine communication. In fact, new businesses are rapidly forming about these technologies. But these capabilities are of little use unless society can afford them. Happily, explosive advances in microelectronics over the past two decades have assured affordable access to this sophistication as well as to the underlying computing technology. The research challenges in speech processing remain in the traditionally identified areas of recognition, synthesis, and coding. These three areas have typically been addressed individually, often with significant isolation among the efforts. But they are all facets of the same fundamental issue--how to represent and quantify the information in the speech signal. This implies deeper understanding of the physics of speech production, the constraints that the conventions of language impose, and the mechanism for information processing in the auditory system. In ongoing research, therefore, we seek more accurate models of speech generation, better computational formulations of language, and realistic perceptual guides for speech processing--along with ways to coalesce the fundamental issues of recognition, synthesis, and coding. Successful solution will yield the long-sought dictation machine, high-quality synthesis from text, and the ultimate in low bit-rate transmission of speech. It will also open the door to language-translating telephony, where the synthetic foreign translation can be in the voice of the originating talker.
Audio-vocal responses of vocal fundamental frequency and formant during sustained vowel vocalizations in different noises.

PubMed

Lee, Shao-Hsuan; Hsiao, Tzu-Yu; Lee, Guo-She

2015-06-01

Sustained vocalizations of vowels [a], [i], and syllable [mə] were collected in twenty normal-hearing individuals. On vocalizations, five conditions of different audio-vocal feedback were introduced separately to the speakers including no masking, wearing supra-aural headphones only, speech-noise masking, high-pass noise masking, and broad-band-noise masking. Power spectral analysis of vocal fundamental frequency (F0) was used to evaluate the modulations of F0 and linear-predictive-coding was used to acquire first two formants. The results showed that while the formant frequencies were not significantly shifted, low-frequency modulations (<3 Hz) of F0 significantly increased with reduced audio-vocal feedback across speech sounds and were significantly correlated with auditory awareness of speakers' own voices. For sustained speech production, the motor speech controls on F0 may depend on a feedback mechanism while articulation should rely more on a feedforward mechanism. Power spectral analysis of F0 might be applied to evaluate audio-vocal control for various hearing and neurological disorders in the future. Copyright © 2015 Elsevier B.V. All rights reserved.
The Effect of Speech-to-Text Technology on Learning a Writing Strategy

ERIC Educational Resources Information Center

Haug, Katrina N.; Klein, Perry D.

2018-01-01

Previous research has shown that speech-to-text (STT) software can support students in producing a given piece of writing. This is the 1st study to investigate the use of STT to teach a writing strategy. We pretested 45 Grade 5 students on argument writing and trained them to use STT. Students participated in 4 lessons on an argument writing…
Semantic Context Detection Using Audio Event Fusion

NASA Astrophysics Data System (ADS)

Chu, Wei-Ta; Cheng, Wen-Huang; Wu, Ja-Ling

2006-12-01

Semantic-level content analysis is a crucial issue in achieving efficient content retrieval and management. We propose a hierarchical approach that models audio events over a time series in order to accomplish semantic context detection. Two levels of modeling, audio event and semantic context modeling, are devised to bridge the gap between physical audio features and semantic concepts. In this work, hidden Markov models (HMMs) are used to model four representative audio events, that is, gunshot, explosion, engine, and car braking, in action movies. At the semantic context level, generative (ergodic hidden Markov model) and discriminative (support vector machine (SVM)) approaches are investigated to fuse the characteristics and correlations among audio events, which provide cues for detecting gunplay and car-chasing scenes. The experimental results demonstrate the effectiveness of the proposed approaches and provide a preliminary framework for information mining by using audio characteristics.
Audio-Visual Speech in Noise Perception in Dyslexia

ERIC Educational Resources Information Center

van Laarhoven, Thijs; Keetels, Mirjam; Schakel, Lemmy; Vroomen, Jean

2018-01-01

Individuals with developmental dyslexia (DD) may experience, besides reading problems, other speech-related processing deficits. Here, we examined the influence of visual articulatory information (lip-read speech) at various levels of background noise on auditory word recognition in children and adults with DD. We found that children with a…
Estimation of Teacher Practices Based on Text Transcripts of Teacher Speech Using a Support Vector Machine Algorithm

ERIC Educational Resources Information Center

Araya, Roberto; Plana, Francisco; Dartnell, Pablo; Soto-Andrade, Jorge; Luci, Gina; Salinas, Elena; Araya, Marylen

2012-01-01

Teacher practice is normally assessed by observers who watch classes or videos of classes. Here, we analyse an alternative strategy that uses text transcripts and a support vector machine classifier. For each one of the 710 videos of mathematics classes from the 2005 Chilean National Teacher Assessment Programme, a single 4-minute slice was…
SNR-adaptive stream weighting for audio-MES ASR.

PubMed

Lee, Ki-Seung

2008-08-01

Myoelectric signals (MESs) from the speaker's mouth region have been successfully shown to improve the noise robustness of automatic speech recognizers (ASRs), thus promising to extend their usability in implementing noise-robust ASR. In the recognition system presented herein, extracted audio and facial MES features were integrated by a decision fusion method, where the likelihood score of the audio-MES observation vector was given by a linear combination of class-conditional observation log-likelihoods of two classifiers, using appropriate weights. We developed a weighting process adaptive to SNRs. The main objective of the paper involves determining the optimal SNR classification boundaries and constructing a set of optimum stream weights for each SNR class. These two parameters were determined by a method based on a maximum mutual information criterion. Acoustic and facial MES data were collected from five subjects, using a 60-word vocabulary. Four types of acoustic noise including babble, car, aircraft, and white noise were acoustically added to clean speech signals with SNR ranging from -14 to 31 dB. The classification accuracy of the audio ASR was as low as 25.5%. Whereas, the classification accuracy of the MES ASR was 85.2%. The classification accuracy could be further improved by employing the proposed audio-MES weighting method, which was as high as 89.4% in the case of babble noise. A similar result was also found for the other types of noise.
Combining Machine Learning and Natural Language Processing to Assess Literary Text Comprehension

ERIC Educational Resources Information Center

Balyan, Renu; McCarthy, Kathryn S.; McNamara, Danielle S.

2017-01-01

This study examined how machine learning and natural language processing (NLP) techniques can be leveraged to assess the interpretive behavior that is required for successful literary text comprehension. We compared the accuracy of seven different machine learning classification algorithms in predicting human ratings of student essays about…

End-to-End ASR-Free Keyword Search From Speech

NASA Astrophysics Data System (ADS)

Audhkhasi, Kartik; Rosenberg, Andrew; Sethy, Abhinav; Ramabhadran, Bhuvana; Kingsbury, Brian

2017-12-01

End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E systems are attractive due to the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores the design of an ASR-free end-to-end system for text query-based keyword search (KWS) from speech trained with minimal supervision. Our E2E KWS system consists of three sub-systems. The first sub-system is a recurrent neural network (RNN)-based acoustic auto-encoder trained to reconstruct the audio through a finite-dimensional representation. The second sub-system is a character-level RNN language model using embeddings learned from a convolutional neural network. Since the acoustic and text query embeddings occupy different representation spaces, they are input to a third feed-forward neural network that predicts whether the query occurs in the acoustic utterance or not. This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster.
An Introduction to Boiler Water Chemistry for the Marine Engineer: A Text of Audio-Tutorial Instruction.

ERIC Educational Resources Information Center

Schlenker, Richard M.; And Others

Presented is a manuscript for an introductory boiler water chemistry course for marine engineer education. The course is modular, self-paced, audio-tutorial, contract graded and combined lecture-laboratory instructed. Lectures are presented to students individually via audio-tapes and 35 mm slides. The course consists of a total of 17 modules -…
MPEG-7 audio-visual indexing test-bed for video retrieval

NASA Astrophysics Data System (ADS)

Gagnon, Langis; Foucher, Samuel; Gouaillier, Valerie; Brun, Christelle; Brousseau, Julie; Boulianne, Gilles; Osterrath, Frederic; Chapdelaine, Claude; Dutrisac, Julie; St-Onge, Francis; Champagne, Benoit; Lu, Xiaojian

2003-12-01

This paper reports on the development status of a Multimedia Asset Management (MAM) test-bed for content-based indexing and retrieval of audio-visual documents within the MPEG-7 standard. The project, called "MPEG-7 Audio-Visual Document Indexing System" (MADIS), specifically targets the indexing and retrieval of video shots and key frames from documentary film archives, based on audio-visual content like face recognition, motion activity, speech recognition and semantic clustering. The MPEG-7/XML encoding of the film database is done off-line. The description decomposition is based on a temporal decomposition into visual segments (shots), key frames and audio/speech sub-segments. The visible outcome will be a web site that allows video retrieval using a proprietary XQuery-based search engine and accessible to members at the Canadian National Film Board (NFB) Cineroute site. For example, end-user will be able to ask to point on movie shots in the database that have been produced in a specific year, that contain the face of a specific actor who tells a specific word and in which there is no motion activity. Video streaming is performed over the high bandwidth CA*net network deployed by CANARIE, a public Canadian Internet development organization.
Detecting double compression of audio signal

NASA Astrophysics Data System (ADS)

Yang, Rui; Shi, Yun Q.; Huang, Jiwu

2010-01-01

MP3 is the most popular audio format nowadays in our daily life, for example music downloaded from the Internet and file saved in the digital recorder are often in MP3 format. However, low bitrate MP3s are often transcoded to high bitrate since high bitrate ones are of high commercial value. Also audio recording in digital recorder can be doctored easily by pervasive audio editing software. This paper presents two methods for the detection of double MP3 compression. The methods are essential for finding out fake-quality MP3 and audio forensics. The proposed methods use support vector machine classifiers with feature vectors formed by the distributions of the first digits of the quantized MDCT (modified discrete cosine transform) coefficients. Extensive experiments demonstrate the effectiveness of the proposed methods. To the best of our knowledge, this piece of work is the first one to detect double compression of audio signal.
Research in speech communication.

PubMed Central

Flanagan, J

1995-01-01

Advances in digital speech processing are now supporting application and deployment of a variety of speech technologies for human/machine communication. In fact, new businesses are rapidly forming about these technologies. But these capabilities are of little use unless society can afford them. Happily, explosive advances in microelectronics over the past two decades have assured affordable access to this sophistication as well as to the underlying computing technology. The research challenges in speech processing remain in the traditionally identified areas of recognition, synthesis, and coding. These three areas have typically been addressed individually, often with significant isolation among the efforts. But they are all facets of the same fundamental issue--how to represent and quantify the information in the speech signal. This implies deeper understanding of the physics of speech production, the constraints that the conventions of language impose, and the mechanism for information processing in the auditory system. In ongoing research, therefore, we seek more accurate models of speech generation, better computational formulations of language, and realistic perceptual guides for speech processing--along with ways to coalesce the fundamental issues of recognition, synthesis, and coding. Successful solution will yield the long-sought dictation machine, high-quality synthesis from text, and the ultimate in low bit-rate transmission of speech. It will also open the door to language-translating telephony, where the synthetic foreign translation can be in the voice of the originating talker. Images Fig. 1 Fig. 2 Fig. 5 Fig. 8 Fig. 11 Fig. 12 Fig. 13 PMID:7479806
Large Vocabulary Audio-Visual Speech Recognition

DTIC Science & Technology

2002-06-12

www.is.cs.cmu.edu Email: waibel(a)cs.cmu~edu Inttractive Systenms Labs ttoctis Ssstms Labs Meeting Browser - -- Interpreting Human Communication "Why did...Speech Interacti Stams Labs t-cive Systms Focus of Attention Tracking Conclusion - Complete Model of Human Communication is Needed - Include all
Multisensory and modality specific processing of visual speech in different regions of the premotor cortex

PubMed Central

Callan, Daniel E.; Jones, Jeffery A.; Callan, Akiko

2014-01-01

Behavioral and neuroimaging studies have demonstrated that brain regions involved with speech production also support speech perception, especially under degraded conditions. The premotor cortex (PMC) has been shown to be active during both observation and execution of action (“Mirror System” properties), and may facilitate speech perception by mapping unimodal and multimodal sensory features onto articulatory speech gestures. For this functional magnetic resonance imaging (fMRI) study, participants identified vowels produced by a speaker in audio-visual (saw the speaker's articulating face and heard her voice), visual only (only saw the speaker's articulating face), and audio only (only heard the speaker's voice) conditions with varying audio signal-to-noise ratios in order to determine the regions of the PMC involved with multisensory and modality specific processing of visual speech gestures. The task was designed so that identification could be made with a high level of accuracy from visual only stimuli to control for task difficulty and differences in intelligibility. The results of the functional magnetic resonance imaging (fMRI) analysis for visual only and audio-visual conditions showed overlapping activity in inferior frontal gyrus and PMC. The left ventral inferior premotor cortex (PMvi) showed properties of multimodal (audio-visual) enhancement with a degraded auditory signal. The left inferior parietal lobule and right cerebellum also showed these properties. The left ventral superior and dorsal premotor cortex (PMvs/PMd) did not show this multisensory enhancement effect, but there was greater activity for the visual only over audio-visual conditions in these areas. The results suggest that the inferior regions of the ventral premotor cortex are involved with integrating multisensory information, whereas, more superior and dorsal regions of the PMC are involved with mapping unimodal (in this case visual) sensory features of the speech signal with
Linguistic experience and audio-visual perception of non-native fricatives.

PubMed

Wang, Yue; Behne, Dawn M; Jiang, Haisheng

2008-09-01

This study examined the effects of linguistic experience on audio-visual (AV) perception of non-native (L2) speech. Canadian English natives and Mandarin Chinese natives differing in degree of English exposure [long and short length of residence (LOR) in Canada] were presented with English fricatives of three visually distinct places of articulation: interdentals nonexistent in Mandarin and labiodentals and alveolars common in both languages. Stimuli were presented in quiet and in a cafe-noise background in four ways: audio only (A), visual only (V), congruent AV (AVc), and incongruent AV (AVi). Identification results showed that overall performance was better in the AVc than in the A or V condition and better in quiet than in cafe noise. While the Mandarin long LOR group approximated the native English patterns, the short LOR group showed poorer interdental identification, more reliance on visual information, and greater AV-fusion with the AVi materials, indicating the failure of L2 visual speech category formation with the short LOR non-natives and the positive effects of linguistic experience with the long LOR non-natives. These results point to an integrated network in AV speech processing as a function of linguistic background and provide evidence to extend auditory-based L2 speech learning theories to the visual domain.
Solid State Audio/Speech Processor Analysis.

DTIC Science & Technology

1980-03-01

techniques. The techniques were demonstrated to be worthwhile in an efficient realtime AWR system. Finally, microprocessor architectures were designed to...do not include custom chip development, detailed hardware design , construction or testing. ITTDCD is very encouraged by the results obtained in this...California, Berkley, was responsible for furnishing the simulation data of OD speech analysis techniques and for the design and development of the hardware OD
Presentation video retrieval using automatically recovered slide and spoken text

NASA Astrophysics Data System (ADS)

Cooper, Matthew

2013-03-01

Video is becoming a prevalent medium for e-learning. Lecture videos contain text information in both the presentation slides and lecturer's speech. This paper examines the relative utility of automatically recovered text from these sources for lecture video retrieval. To extract the visual information, we automatically detect slides within the videos and apply optical character recognition to obtain their text. Automatic speech recognition is used similarly to extract spoken text from the recorded audio. We perform controlled experiments with manually created ground truth for both the slide and spoken text from more than 60 hours of lecture video. We compare the automatically extracted slide and spoken text in terms of accuracy relative to ground truth, overlap with one another, and utility for video retrieval. Results reveal that automatically recovered slide text and spoken text contain different content with varying error profiles. Experiments demonstrate that automatically extracted slide text enables higher precision video retrieval than automatically recovered spoken text.
A Diagnostic Marker to Discriminate Childhood Apraxia of Speech From Speech Delay: I. Development and Description of the Pause Marker

PubMed Central

Strand, Edythe A.; Fourakis, Marios; Jakielski, Kathy J.; Hall, Sheryl D.; Karlsson, Heather B.; Mabie, Heather L.; McSweeny, Jane L.; Tilkens, Christie M.; Wilson, David L.

2017-01-01

Purpose The goal of this article (PM I) is to describe the rationale for and development of the Pause Marker (PM), a single-sign diagnostic marker proposed to discriminate early or persistent childhood apraxia of speech from speech delay. Method The authors describe and prioritize 7 criteria with which to evaluate the research and clinical utility of a diagnostic marker for childhood apraxia of speech, including evaluation of the present proposal. An overview is given of the Speech Disorders Classification System, including extensions completed in the same approximately 3-year period in which the PM was developed. Results The finalized Speech Disorders Classification System includes a nosology and cross-classification procedures for childhood and persistent speech disorders and motor speech disorders (Shriberg, Strand, & Mabie, 2017). A PM is developed that provides procedural and scoring information, and citations to papers and technical reports that include audio exemplars of the PM and reference data used to standardize PM scores are provided. Conclusions The PM described here is an acoustic-aided perceptual sign that quantifies one aspect of speech precision in the linguistic domain of phrasing. This diagnostic marker can be used to discriminate early or persistent childhood apraxia of speech from speech delay. PMID:28384779
Audio stream classification for multimedia database search

NASA Astrophysics Data System (ADS)

Artese, M.; Bianco, S.; Gagliardi, I.; Gasparini, F.

2013-03-01

Search and retrieval of huge archives of Multimedia data is a challenging task. A classification step is often used to reduce the number of entries on which to perform the subsequent search. In particular, when new entries of the database are continuously added, a fast classification based on simple threshold evaluation is desirable. In this work we present a CART-based (Classification And Regression Tree [1]) classification framework for audio streams belonging to multimedia databases. The database considered is the Archive of Ethnography and Social History (AESS) [2], which is mainly composed of popular songs and other audio records describing the popular traditions handed down generation by generation, such as traditional fairs, and customs. The peculiarities of this database are that it is continuously updated; the audio recordings are acquired in unconstrained environment; and for the non-expert human user is difficult to create the ground truth labels. In our experiments, half of all the available audio files have been randomly extracted and used as training set. The remaining ones have been used as test set. The classifier has been trained to distinguish among three different classes: speech, music, and song. All the audio files in the dataset have been previously manually labeled into the three classes above defined by domain experts.
An Evaluation of Text-to-Speech Synthesizers in the Foreign Language Classroom: Learners' Perceptions

ERIC Educational Resources Information Center

Bione, Tiago; Grimshaw, Jennica; Cardoso, Walcir

2016-01-01

As stated in Cardoso, Smith, and Garcia Fuentes (2015), second language researchers and practitioners have explored the pedagogical capabilities of Text-To-Speech synthesizers (TTS) for their potential to enhance the acquisition of writing (e.g. Kirstein, 2006), vocabulary and reading (e.g. Proctor, Dalton, & Grisham, 2007), and pronunciation…
Auditory and audio-visual processing in patients with cochlear, auditory brainstem, and auditory midbrain implants: An EEG study.

PubMed

Schierholz, Irina; Finke, Mareike; Kral, Andrej; Büchner, Andreas; Rach, Stefan; Lenarz, Thomas; Dengler, Reinhard; Sandmann, Pascale

2017-04-01

There is substantial variability in speech recognition ability across patients with cochlear implants (CIs), auditory brainstem implants (ABIs), and auditory midbrain implants (AMIs). To better understand how this variability is related to central processing differences, the current electroencephalography (EEG) study compared hearing abilities and auditory-cortex activation in patients with electrical stimulation at different sites of the auditory pathway. Three different groups of patients with auditory implants (Hannover Medical School; ABI: n = 6, CI: n = 6; AMI: n = 2) performed a speeded response task and a speech recognition test with auditory, visual, and audio-visual stimuli. Behavioral performance and cortical processing of auditory and audio-visual stimuli were compared between groups. ABI and AMI patients showed prolonged response times on auditory and audio-visual stimuli compared with NH listeners and CI patients. This was confirmed by prolonged N1 latencies and reduced N1 amplitudes in ABI and AMI patients. However, patients with central auditory implants showed a remarkable gain in performance when visual and auditory input was combined, in both speech and non-speech conditions, which was reflected by a strong visual modulation of auditory-cortex activation in these individuals. In sum, the results suggest that the behavioral improvement for audio-visual conditions in central auditory implant patients is based on enhanced audio-visual interactions in the auditory cortex. Their findings may provide important implications for the optimization of electrical stimulation and rehabilitation strategies in patients with central auditory prostheses. Hum Brain Mapp 38:2206-2225, 2017. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Vector adaptive predictive coder for speech and audio

NASA Technical Reports Server (NTRS)

Chen, Juin-Hwey (Inventor); Gersho, Allen (Inventor)

1990-01-01

A real-time vector adaptive predictive coder which approximates each vector of K speech samples by using each of M fixed vectors in a first codebook to excite a time-varying synthesis filter and picking the vector that minimizes distortion. Predictive analysis for each frame determines parameters used for computing from vectors in the first codebook zero-state response vectors that are stored at the same address (index) in a second codebook. Encoding of input speech vectors s.sub.n is then carried out using the second codebook. When the vector that minimizes distortion is found, its index is transmitted to a decoder which has a codebook identical to the first codebook of the decoder. There the index is used to read out a vector that is used to synthesize an output speech vector s.sub.n. The parameters used in the encoder are quantized, for example by using a table, and the indices are transmitted to the decoder where they are decoded to specify transfer characteristics of filters used in producing the vector s.sub.n from the receiver codebook vector selected by the vector index transmitted.
Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition

NASA Astrophysics Data System (ADS)

Ryumin, D.; Karpov, A. A.

2017-05-01

In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.
The Sweet-Home project: audio processing and decision making in smart home to improve well-being and reliance.

PubMed

Vacher, Michel; Chahuara, Pedro; Lecouteux, Benjamin; Istrate, Dan; Portet, Francois; Joubert, Thierry; Sehili, Mohamed; Meillon, Brigitte; Bonnefond, Nicolas; Fabre, Sébastien; Roux, Camille; Caffiau, Sybille

2013-01-01

The Sweet-Home project aims at providing audio-based interaction technology that lets the user have full control over their home environment, at detecting distress situations and at easing the social inclusion of the elderly and frail population. This paper presents an overview of the project focusing on the implemented techniques for speech and sound recognition as context-aware decision making with uncertainty. A user experiment in a smart home demonstrates the interest of this audio-based technology.
The influence of selective attention to auditory and visual speech on the integration of audiovisual speech information.

PubMed

Buchan, Julie N; Munhall, Kevin G

2011-01-01

Conflicting visual speech information can influence the perception of acoustic speech, causing an illusory percept of a sound not present in the actual acoustic speech (the McGurk effect). We examined whether participants can voluntarily selectively attend to either the auditory or visual modality by instructing participants to pay attention to the information in one modality and to ignore competing information from the other modality. We also examined how performance under these instructions was affected by weakening the influence of the visual information by manipulating the temporal offset between the audio and video channels (experiment 1), and the spatial frequency information present in the video (experiment 2). Gaze behaviour was also monitored to examine whether attentional instructions influenced the gathering of visual information. While task instructions did have an influence on the observed integration of auditory and visual speech information, participants were unable to completely ignore conflicting information, particularly information from the visual stream. Manipulating temporal offset had a more pronounced interaction with task instructions than manipulating the amount of visual information. Participants' gaze behaviour suggests that the attended modality influences the gathering of visual information in audiovisual speech perception.
NaturalReader: A New Generation Text Reader

ERIC Educational Resources Information Center

Flood, Jacqueline

2007-01-01

NaturalReader (http://www.naturalreaders.com/) is a new generation text reader, which means that it reads any machine readable text using synthesized speech without having to copy and paste the selected text into the NaturalReader application window. It installs a toolbar directly into all of the Microsoft Office[TM] programs and uses a mini-board…
Lip movements affect infants' audiovisual speech perception.

PubMed

Yeung, H Henny; Werker, Janet F

2013-05-01

Speech is robustly audiovisual from early in infancy. Here we show that audiovisual speech perception in 4.5-month-old infants is influenced by sensorimotor information related to the lip movements they make while chewing or sucking. Experiment 1 consisted of a classic audiovisual matching procedure, in which two simultaneously displayed talking faces (visual [i] and [u]) were presented with a synchronous vowel sound (audio /i/ or /u/). Infants' looking patterns were selectively biased away from the audiovisual matching face when the infants were producing lip movements similar to those needed to produce the heard vowel. Infants' looking patterns returned to those of a baseline condition (no lip movements, looking longer at the audiovisual matching face) when they were producing lip movements that did not match the heard vowel. Experiment 2 confirmed that these sensorimotor effects interacted with the heard vowel, as looking patterns differed when infants produced these same lip movements while seeing and hearing a talking face producing an unrelated vowel (audio /a/). These findings suggest that the development of speech perception and speech production may be mutually informative.

Visual speech information: a help or hindrance in perceptual processing of dysarthric speech.

PubMed

Borrie, Stephanie A

2015-03-01

This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.
Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation

PubMed Central

Banks, Briony; Gowen, Emma; Munro, Kevin J.; Adank, Patti

2015-01-01

Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker’s facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants’ eye gaze was recorded to verify that they looked at the speaker’s face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation. PMID:26283946
Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

PubMed

Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

2015-01-01

Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.
Orthographic Learning and the Role of Text-to-Speech Software in Dutch Disabled Readers

ERIC Educational Resources Information Center

Staels, Eva; Van den Broeck, Wim

2015-01-01

In this study, we examined whether orthographic learning can be demonstrated in disabled readers learning to read in a transparent orthography (Dutch). In addition, we tested the effect of the use of text-to-speech software, a new form of direct instruction, on orthographic learning. Both research goals were investigated by replicating Share's…
Noise-Canceling Helmet Audio System

NASA Technical Reports Server (NTRS)

Seibert, Marc A.; Culotta, Anthony J.

2007-01-01

A prototype helmet audio system has been developed to improve voice communication for the wearer in a noisy environment. The system was originally intended to be used in a space suit, wherein noise generated by airflow of the spacesuit life-support system can make it difficult for remote listeners to understand the astronaut s speech and can interfere with the astronaut s attempt to issue vocal commands to a voice-controlled robot. The system could be adapted to terrestrial use in helmets of protective suits that are typically worn in noisy settings: examples include biohazard, fire, rescue, and diving suits. The system (see figure) includes an array of microphones and small loudspeakers mounted at fixed positions in a helmet, amplifiers and signal-routing circuitry, and a commercial digital signal processor (DSP). Notwithstanding the fixed positions of the microphones and loudspeakers, the system can accommodate itself to any normal motion of the wearer s head within the helmet. The system operates in conjunction with a radio transceiver. An audio signal arriving via the transceiver intended to be heard by the wearer is adjusted in volume and otherwise conditioned and sent to the loudspeakers. The wearer s speech is collected by the microphones, the outputs of which are logically combined (phased) so as to form a microphone- array directional sensitivity pattern that discriminates in favor of sounds coming from vicinity of the wearer s mouth and against sounds coming from elsewhere. In the DSP, digitized samples of the microphone outputs are processed to filter out airflow noise and to eliminate feedback from the loudspeakers to the microphones. The resulting conditioned version of the wearer s speech signal is sent to the transceiver.
An experimental version of the MZT (speech-from-text) system with external F(sub 0) control

NASA Astrophysics Data System (ADS)

Nowak, Ignacy

1994-12-01

The version of a Polish speech from text system described in this article was developed using the speech-from-text system. The new system has additional functions which make it possible to enter commands in edited orthographic text to control the phrase component and accentuation parameters. This makes it possible to generate a series of modified intonation contours in the texts spoken by the system. The effects obtained are made easier to control by a graphic illustration of the base frequency pattern in phrases that were last 'spoken' by the system. This version of the system was designed as a test prototype which will help us expand and refine our set of rules for automatic generation of intonation contours, which in turn will enable the fully automated speech-from-text system to generate speech with a more varied and precisely formed fundamental frequency pattern.
Audio-based queries for video retrieval over Java enabled mobile devices

NASA Astrophysics Data System (ADS)

Ahmad, Iftikhar; Cheikh, Faouzi Alaya; Kiranyaz, Serkan; Gabbouj, Moncef

2006-02-01

In this paper we propose a generic framework for efficient retrieval of audiovisual media based on its audio content. This framework is implemented in a client-server architecture where the client application is developed in Java to be platform independent whereas the server application is implemented for the PC platform. The client application adapts to the characteristics of the mobile device where it runs such as screen size and commands. The entire framework is designed to take advantage of the high-level segmentation and classification of audio content to improve speed and accuracy of audio-based media retrieval. Therefore, the primary objective of this framework is to provide an adaptive basis for performing efficient video retrieval operations based on the audio content and types (i.e. speech, music, fuzzy and silence). Experimental results approve that such an audio based video retrieval scheme can be used from mobile devices to search and retrieve video clips efficiently over wireless networks.
Implementation of Three Text to Speech Systems for Kurdish Language

NASA Astrophysics Data System (ADS)

Bahrampour, Anvar; Barkhoda, Wafa; Azami, Bahram Zahir

Nowadays, concatenative method is used in most modern TTS systems to produce artificial speech. The most important challenge in this method is choosing appropriate unit for creating database. This unit must warranty smoothness and high quality speech, and also, creating database for it must reasonable and inexpensive. For example, syllable, phoneme, allophone, and, diphone are appropriate units for all-purpose systems. In this paper, we implemented three synthesis systems for Kurdish language based on syllable, allophone, and diphone and compare their quality using subjective testing.
Reading's SLiCK with New Audio Texts and Strategies.

ERIC Educational Resources Information Center

Boyle, Elizabeth A.; Washburn, Shari Gallin; Rosenberg, Michael S.; Connelly, Vincent J.; Brinckerhoff, Loring C.; Banerjee, Manju

2002-01-01

This article discusses challenges for secondary students with disabilities and alternative instructional methods that teachers of students with poor reading skills can use to convey content information effectively and efficiently. The use of audio textbooks on CD-ROMs is emphasized and the SLiCK strategy is explained as a support for the CD-ROM.…
Visual-Auditory Integration during Speech Imitation in Autism

ERIC Educational Resources Information Center

Williams, Justin H. G.; Massaro, Dominic W.; Peel, Natalie J.; Bosseler, Alexis; Suddendorf, Thomas

2004-01-01

Children with autistic spectrum disorder (ASD) may have poor audio-visual integration, possibly reflecting dysfunctional "mirror neuron" systems which have been hypothesised to be at the core of the condition. In the present study, a computer program, utilizing speech synthesizer software and a "virtual" head (Baldi), delivered speech stimuli for…
Measuring Implicit and Explicit Attitudes toward Foreign-Accented Speech

ERIC Educational Resources Information Center

Pantos, Andrew J.

2010-01-01

The purpose of this research was to investigate the nature of listeners' attitudes toward foreign-accented speech and the manner in which those attitudes are formed. This study measured 165 participants' implicit and explicit attitudes toward US- and foreign-accented audio stimuli. Implicit attitudes were measured with an audio Implicit…
Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

ERIC Educational Resources Information Center

Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

2015-01-01

One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…
A hybrid technique for speech segregation and classification using a sophisticated deep neural network

PubMed Central

Nawaz, Tabassam; Mehmood, Zahid; Rashid, Muhammad; Habib, Hafiz Adnan

2018-01-01

Recent research on speech segregation and music fingerprinting has led to improvements in speech segregation and music identification algorithms. Speech and music segregation generally involves the identification of music followed by speech segregation. However, music segregation becomes a challenging task in the presence of noise. This paper proposes a novel method of speech segregation for unlabelled stationary noisy audio signals using the deep belief network (DBN) model. The proposed method successfully segregates a music signal from noisy audio streams. A recurrent neural network (RNN)-based hidden layer segregation model is applied to remove stationary noise. Dictionary-based fisher algorithms are employed for speech classification. The proposed method is tested on three datasets (TIMIT, MIR-1K, and MusicBrainz), and the results indicate the robustness of proposed method for speech segregation. The qualitative and quantitative analysis carried out on three datasets demonstrate the efficiency of the proposed method compared to the state-of-the-art speech segregation and classification-based methods. PMID:29558485
Fifty years of progress in speech and speaker recognition

NASA Astrophysics Data System (ADS)

Furui, Sadaoki

2004-10-01

Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.
Audio-guided audiovisual data segmentation, indexing, and retrieval

NASA Astrophysics Data System (ADS)

Zhang, Tong; Kuo, C.-C. Jay

1998-12-01

While current approaches for video segmentation and indexing are mostly focused on visual information, audio signals may actually play a primary role in video content parsing. In this paper, we present an approach for automatic segmentation, indexing, and retrieval of audiovisual data, based on audio content analysis. The accompanying audio signal of audiovisual data is first segmented and classified into basic types, i.e., speech, music, environmental sound, and silence. This coarse-level segmentation and indexing step is based upon morphological and statistical analysis of several short-term features of the audio signals. Then, environmental sounds are classified into finer classes, such as applause, explosions, bird sounds, etc. This fine-level classification and indexing step is based upon time- frequency analysis of audio signals and the use of the hidden Markov model as the classifier. On top of this archiving scheme, an audiovisual data retrieval system is proposed. Experimental results show that the proposed approach has an accuracy rate higher than 90 percent for the coarse-level classification, and higher than 85 percent for the fine-level classification. Examples of audiovisual data segmentation and retrieval are also provided.
Cepstral domain modification of audio signals for data embedding: preliminary results

NASA Astrophysics Data System (ADS)

Gopalan, Kaliappan

2004-06-01

A method of embedding data in an audio signal using cepstral domain modification is described. Based on successful embedding in the spectral points of perceptually masked regions in each frame of speech, first the technique was extended to embedding in the log spectral domain. This extension resulted at approximately 62 bits /s of embedding with less than 2 percent of bit error rate (BER) for a clean cover speech (from the TIMIT database), and about 2.5 percent for a noisy speech (from an air traffic controller database), when all frames - including silence and transition between voiced and unvoiced segments - were used. Bit error rate increased significantly when the log spectrum in the vicinity of a formant was modified. In the next procedure, embedding by altering the mean cepstral values of two ranges of indices was studied. Tests on both a noisy utterance and a clean utterance indicated barely noticeable perceptual change in speech quality when lower range of cepstral indices - corresponding to vocal tract region - was modified in accordance with data. With an embedding capacity of approximately 62 bits/s - using one bit per each frame regardless of frame energy or type of speech - initial results showed a BER of less than 1.5 percent for a payload capacity of 208 embedded bits using the clean cover speech. BER of less than 1.3 percent resulted for the noisy host with a capacity was 316 bits. When the cepstrum was modified in the region of excitation, BER increased to over 10 percent. With quantization causing no significant problem, the technique warrants further studies with different cepstral ranges and sizes. Pitch-synchronous cepstrum modification, for example, may be more robust to attacks. In addition, cepstrum modification in regions of speech that are perceptually masked - analogous to embedding in frequency masked regions - may yield imperceptible stego audio with low BER.
Machine Learning Methods for Articulatory Data

ERIC Educational Resources Information Center

Berry, Jeffrey James

2012-01-01

Humans make use of more than just the audio signal to perceive speech. Behavioral and neurological research has shown that a person's knowledge of how speech is produced influences what is perceived. With methods for collecting articulatory data becoming more ubiquitous, methods for extracting useful information are needed to make this data…
Ultrasonic speech translator and communications system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Akerman, M.A.; Ayers, C.W.; Haynes, H.D.

1996-07-23

A wireless communication system undetectable by radio frequency methods for converting audio signals, including human voice, to electronic signals in the ultrasonic frequency range, transmitting the ultrasonic signal by way of acoustical pressure waves across a carrier medium, including gases, liquids, or solids, and reconverting the ultrasonic acoustical pressure waves back to the original audio signal. The ultrasonic speech translator and communication system includes an ultrasonic transmitting device and an ultrasonic receiving device. The ultrasonic transmitting device accepts as input an audio signal such as human voice input from a microphone or tape deck. The ultrasonic transmitting device frequency modulatesmore » an ultrasonic carrier signal with the audio signal producing a frequency modulated ultrasonic carrier signal, which is transmitted via acoustical pressure waves across a carrier medium such as gases, liquids or solids. The ultrasonic receiving device converts the frequency modulated ultrasonic acoustical pressure waves to a frequency modulated electronic signal, demodulates the audio signal from the ultrasonic carrier signal, and conditions the demodulated audio signal to reproduce the original audio signal at its output. 7 figs.« less
Ultrasonic speech translator and communications system

DOEpatents

Akerman, M.A.; Ayers, C.W.; Haynes, H.D.

1996-07-23

A wireless communication system undetectable by radio frequency methods for converting audio signals, including human voice, to electronic signals in the ultrasonic frequency range, transmitting the ultrasonic signal by way of acoustical pressure waves across a carrier medium, including gases, liquids, or solids, and reconverting the ultrasonic acoustical pressure waves back to the original audio signal. The ultrasonic speech translator and communication system includes an ultrasonic transmitting device and an ultrasonic receiving device. The ultrasonic transmitting device accepts as input an audio signal such as human voice input from a microphone or tape deck. The ultrasonic transmitting device frequency modulates an ultrasonic carrier signal with the audio signal producing a frequency modulated ultrasonic carrier signal, which is transmitted via acoustical pressure waves across a carrier medium such as gases, liquids or solids. The ultrasonic receiving device converts the frequency modulated ultrasonic acoustical pressure waves to a frequency modulated electronic signal, demodulates the audio signal from the ultrasonic carrier signal, and conditions the demodulated audio signal to reproduce the original audio signal at its output. 7 figs.
Ultrasonic speech translator and communications system

DOEpatents

Akerman, M. Alfred; Ayers, Curtis W.; Haynes, Howard D.

1996-01-01

A wireless communication system undetectable by radio frequency methods for converting audio signals, including human voice, to electronic signals in the ultrasonic frequency range, transmitting the ultrasonic signal by way of acoustical pressure waves across a carrier medium, including gases, liquids, or solids, and reconverting the ultrasonic acoustical pressure waves back to the original audio signal. The ultrasonic speech translator and communication system (20) includes an ultrasonic transmitting device (100) and an ultrasonic receiving device (200). The ultrasonic transmitting device (100) accepts as input (115) an audio signal such as human voice input from a microphone (114) or tape deck. The ultrasonic transmitting device (100) frequency modulates an ultrasonic carrier signal with the audio signal producing a frequency modulated ultrasonic carrier signal, which is transmitted via acoustical pressure waves across a carrier medium such as gases, liquids or solids. The ultrasonic receiving device (200) converts the frequency modulated ultrasonic acoustical pressure waves to a frequency modulated electronic signal, demodulates the audio signal from the ultrasonic carrier signal, and conditions the demodulated audio signal to reproduce the original audio signal at its output (250).

Dialogue enabling speech-to-text user assistive agent system for hearing-impaired person.

PubMed

Lee, Seongjae; Kang, Sunmee; Han, David K; Ko, Hanseok

2016-06-01

A novel approach for assisting bidirectional communication between people of normal hearing and hearing-impaired is presented. While the existing hearing-impaired assistive devices such as hearing aids and cochlear implants are vulnerable in extreme noise conditions or post-surgery side effects, the proposed concept is an alternative approach wherein spoken dialogue is achieved by means of employing a robust speech recognition technique which takes into consideration of noisy environmental factors without any attachment into human body. The proposed system is a portable device with an acoustic beamformer for directional noise reduction and capable of performing speech-to-text transcription function, which adopts a keyword spotting method. It is also equipped with an optimized user interface for hearing-impaired people, rendering intuitive and natural device usage with diverse domain contexts. The relevant experimental results confirm that the proposed interface design is feasible for realizing an effective and efficient intelligent agent for hearing-impaired.
Neural entrainment to rhythmic speech in children with developmental dyslexia

PubMed Central

Power, Alan J.; Mead, Natasha; Barnes, Lisa; Goswami, Usha

2013-01-01

A rhythmic paradigm based on repetition of the syllable “ba” was used to study auditory, visual, and audio-visual oscillatory entrainment to speech in children with and without dyslexia using EEG. Children pressed a button whenever they identified a delay in the isochronous stimulus delivery (500 ms; 2 Hz delta band rate). Response power, strength of entrainment and preferred phase of entrainment in the delta and theta frequency bands were compared between groups. The quality of stimulus representation was also measured using cross-correlation of the stimulus envelope with the neural response. The data showed a significant group difference in the preferred phase of entrainment in the delta band in response to the auditory and audio-visual stimulus streams. A different preferred phase has significant implications for the quality of speech information that is encoded neurally, as it implies enhanced neuronal processing (phase alignment) at less informative temporal points in the incoming signal. Consistent with this possibility, the cross-correlogram analysis revealed superior stimulus representation by the control children, who showed a trend for larger peak r-values and significantly later lags in peak r-values compared to participants with dyslexia. Significant relationships between both peak r-values and peak lags were found with behavioral measures of reading. The data indicate that the auditory temporal reference frame for speech processing is atypical in developmental dyslexia, with low frequency (delta) oscillations entraining to a different phase of the rhythmic syllabic input. This would affect the quality of encoding of speech, and could underlie the cognitive impairments in phonological representation that are the behavioral hallmark of this developmental disorder across languages. PMID:24376407
Audio-visual speech perception in prelingually deafened Japanese children following sequential bilateral cochlear implantation.

PubMed

Yamamoto, Ryosuke; Naito, Yasushi; Tona, Risa; Moroto, Saburo; Tamaya, Rinko; Fujiwara, Keizo; Shinohara, Shogo; Takebayashi, Shinji; Kikuchi, Masahiro; Michida, Tetsuhiko

2017-11-01

An effect of audio-visual (AV) integration is observed when the auditory and visual stimuli are incongruent (the McGurk effect). In general, AV integration is helpful especially in subjects wearing hearing aids or cochlear implants (CIs). However, the influence of AV integration on spoken word recognition in individuals with bilateral CIs (Bi-CIs) has not been fully investigated so far. In this study, we investigated AV integration in children with Bi-CIs. The study sample included thirty one prelingually deafened children who underwent sequential bilateral cochlear implantation. We assessed their responses to congruent and incongruent AV stimuli with three CI-listening modes: only the 1st CI, only the 2nd CI, and Bi-CIs. The responses were assessed in the whole group as well as in two sub-groups: a proficient group (syllable intelligibility ≥80% with the 1st CI) and a non-proficient group (syllable intelligibility < 80% with the 1st CI). We found evidence of the McGurk effect in each of the three CI-listening modes. AV integration responses were observed in a subset of incongruent AV stimuli, and the patterns observed with the 1st CI and with Bi-CIs were similar. In the proficient group, the responses with the 2nd CI were not significantly different from those with the 1st CI whereas in the non-proficient group the responses with the 2nd CI were driven by visual stimuli more than those with the 1st CI. Our results suggested that prelingually deafened Japanese children who underwent sequential bilateral cochlear implantation exhibit AV integration abilities, both in monaural listening as well as in binaural listening. We also observed a higher influence of visual stimuli on speech perception with the 2nd CI in the non-proficient group, suggesting that Bi-CIs listeners with poorer speech recognition rely on visual information more compared to the proficient subjects to compensate for poorer auditory input. Nevertheless, poorer quality auditory input with the 2nd
Perceptual congruency of audio-visual speech affects ventriloquism with bilateral visual stimuli.

PubMed

Kanaya, Shoko; Yokosawa, Kazuhiko

2011-02-01

Many studies on multisensory processes have focused on performance in simplified experimental situations, with a single stimulus in each sensory modality. However, these results cannot necessarily be applied to explain our perceptual behavior in natural scenes where various signals exist within one sensory modality. We investigated the role of audio-visual syllable congruency on participants' auditory localization bias or the ventriloquism effect using spoken utterances and two videos of a talking face. Salience of facial movements was also manipulated. Results indicated that more salient visual utterances attracted participants' auditory localization. Congruent pairing of audio-visual utterances elicited greater localization bias than incongruent pairing, while previous studies have reported little dependency on the reality of stimuli in ventriloquism. Moreover, audio-visual illusory congruency, owing to the McGurk effect, caused substantial visual interference on auditory localization. Multisensory performance appears more flexible and adaptive in this complex environment than in previous studies.
Toward diagnostic and phenotype markers for genetically transmitted speech delay.

PubMed

Shriberg, Lawrence D; Lewis, Barbara A; Tomblin, J Bruce; McSweeny, Jane L; Karlsson, Heather B; Scheer, Alison R

2005-08-01

Converging evidence supports the hypothesis that the most common subtype of childhood speech sound disorder (SSD) of currently unknown origin is genetically transmitted. We report the first findings toward a set of diagnostic markers to differentiate this proposed etiological subtype (provisionally termed speech delay-genetic) from other proposed subtypes of SSD of unknown origin. Conversational speech samples from 72 preschool children with speech delay of unknown origin from 3 research centers were selected from an audio archive. Participants differed on the number of biological, nuclear family members (0 or 2+) classified as positive for current and/or prior speech-language disorder. Although participants in the 2 groups were found to have similar speech competence, as indexed by their Percentage of Consonants Correct scores, their speech error patterns differed significantly in 3 ways. Compared with children who may have reduced genetic load for speech delay (no affected nuclear family members), children with possibly higher genetic load (2+ affected members) had (a) a significantly higher proportion of relative omission errors on the Late-8 consonants; (b) a significantly lower proportion of relative distortion errors on these consonants, particularly on the sibilant fricatives /s/, /z/, and //; and (c) a significantly lower proportion of backed /s/ distortions, as assessed by both perceptual and acoustic methods. Machine learning routines identified a 3-part classification rule that included differential weightings of these variables. The classification rule had diagnostic accuracy value of 0.83 (95% confidence limits = 0.74-0.92), with positive and negative likelihood ratios of 9.6 (95% confidence limits = 3.1-29.9) and 0.40 (95% confidence limits = 0.24-0.68), respectively. The diagnostic accuracy findings are viewed as promising. The error pattern for this proposed subtype of SSD is viewed as consistent with the cognitive-linguistic processing deficits
The Downside of Greater Lexical Influences: Selectively Poorer Speech Perception in Noise

PubMed Central

Xie, Zilong; Tessmer, Rachel; Chandrasekaran, Bharath

2017-01-01

Purpose Although lexical information influences phoneme perception, the extent to which reliance on lexical information enhances speech processing in challenging listening environments is unclear. We examined the extent to which individual differences in lexical influences on phonemic processing impact speech processing in maskers containing varying degrees of linguistic information (2-talker babble or pink noise). Method Twenty-nine monolingual English speakers were instructed to ignore the lexical status of spoken syllables (e.g., gift vs. kift) and to only categorize the initial phonemes (/g/ vs. /k/). The same participants then performed speech recognition tasks in the presence of 2-talker babble or pink noise in audio-only and audiovisual conditions. Results Individuals who demonstrated greater lexical influences on phonemic processing experienced greater speech processing difficulties in 2-talker babble than in pink noise. These selective difficulties were present across audio-only and audiovisual conditions. Conclusion Individuals with greater reliance on lexical processes during speech perception exhibit impaired speech recognition in listening conditions in which competing talkers introduce audible linguistic interferences. Future studies should examine the locus of lexical influences/interferences on phonemic processing and speech-in-speech processing. PMID:28586824
Machine printed text and handwriting identification in noisy document images.

PubMed

Zheng, Yefeng; Li, Huiping; Doermann, David

2004-03-01

In this paper, we address the problem of the identification of text in noisy document images. We are especially focused on segmenting and identifying between handwriting and machine printed text because: 1) Handwriting in a document often indicates corrections, additions, or other supplemental information that should be treated differently from the main content and 2) the segmentation and recognition techniques requested for machine printed and handwritten text are significantly different. A novel aspect of our approach is that we treat noise as a separate class and model noise based on selected features. Trained Fisher classifiers are used to identify machine printed text and handwriting from noise and we further exploit context to refine the classification. A Markov Random Field-based (MRF) approach is used to model the geometrical structure of the printed text, handwriting, and noise to rectify misclassifications. Experimental results show that our approach is robust and can significantly improve page segmentation in noisy document collections.
Investigating the Effectiveness of Speech-To-Text Recognition Applications on Learning Performance, Attention, and Meditation

ERIC Educational Resources Information Center

Shadiev, Rustam; Huang, Yueh-Min; Hwang, Jan-Pan

2017-01-01

In this study, the effectiveness of the application of speech-to-text recognition (STR) technology on enhancing learning and concentration in a calm state of mind, hereafter referred to as meditation (An intentional and self-regulated focusing of attention in order to relax and calm the mind), was investigated. This effectiveness was further…
Using Text-to-Speech Reading Support for an Adult with Mild Aphasia and Cognitive Impairment

ERIC Educational Resources Information Center

Harvey, Judy; Hux, Karen; Snell, Jeffry

2013-01-01

This single case study served to examine text-to-speech (TTS) effects on reading rate and comprehension in an individual with mild aphasia and cognitive impairment. Findings showed faster reading, given TTS presented at a normal speaking rate, but no significant comprehension changes. TTS may support reading in people with aphasia when time…
Language/Culture Modulates Brain and Gaze Processes in Audiovisual Speech Perception.

PubMed

Hisanaga, Satoko; Sekiyama, Kaoru; Igasaki, Tomohiko; Murayama, Nobuki

2016-10-13

Several behavioural studies have shown that the interplay between voice and face information in audiovisual speech perception is not universal. Native English speakers (ESs) are influenced by visual mouth movement to a greater degree than native Japanese speakers (JSs) when listening to speech. However, the biological basis of these group differences is unknown. Here, we demonstrate the time-varying processes of group differences in terms of event-related brain potentials (ERP) and eye gaze for audiovisual and audio-only speech perception. On a behavioural level, while congruent mouth movement shortened the ESs' response time for speech perception, the opposite effect was observed in JSs. Eye-tracking data revealed a gaze bias to the mouth for the ESs but not the JSs, especially before the audio onset. Additionally, the ERP P2 amplitude indicated that ESs processed multisensory speech more efficiently than auditory-only speech; however, the JSs exhibited the opposite pattern. Taken together, the ESs' early visual attention to the mouth was likely to promote phonetic anticipation, which was not the case for the JSs. These results clearly indicate the impact of language and/or culture on multisensory speech processing, suggesting that linguistic/cultural experiences lead to the development of unique neural systems for audiovisual speech perception.
Contributions of speech science to the technology of man-machine voice interactions

NASA Technical Reports Server (NTRS)

Lea, Wayne A.

1977-01-01

Research in speech understanding was reviewed. Plans which include prosodics research, phonological rules for speech understanding systems, and continued interdisciplinary phonetics research are discussed. Improved acoustic phonetic analysis capabilities in speech recognizers are suggested.
Intentional Voice Command Detection for Trigger-Free Speech Interface

NASA Astrophysics Data System (ADS)

Obuchi, Yasunari; Sumiyoshi, Takashi

In this paper we introduce a new framework of audio processing, which is essential to achieve a trigger-free speech interface for home appliances. If the speech interface works continually in real environments, it must extract occasional voice commands and reject everything else. It is extremely important to reduce the number of false alarms because the number of irrelevant inputs is much larger than the number of voice commands even for heavy users of appliances. The framework, called Intentional Voice Command Detection, is based on voice activity detection, but enhanced by various speech/audio processing techniques such as emotion recognition. The effectiveness of the proposed framework is evaluated using a newly-collected large-scale corpus. The advantages of combining various features were tested and confirmed, and the simple LDA-based classifier demonstrated acceptable performance. The effectiveness of various methods of user adaptation is also discussed.
Introduction to Human Services, Chapter III. Video Script Package, Text, and Audio Script Package.

ERIC Educational Resources Information Center

Miami-Dade Community Coll., FL.

Video, textual, and audio components of the third module of a multi-media, introductory course on Human Services are presented. The module packages, developed at Miami-Dade Community College, deal with technology, social change, and problem dependencies. A video cassette script is first provided that explores the "traditional,""inner," and "other…
Audio-Visual Temporal Recalibration Can be Constrained by Content Cues Regardless of Spatial Overlap.

PubMed

Roseboom, Warrick; Kawabe, Takahiro; Nishida, Shin'ya

2013-01-01

It has now been well established that the point of subjective synchrony for audio and visual events can be shifted following exposure to asynchronous audio-visual presentations, an effect often referred to as temporal recalibration. Recently it was further demonstrated that it is possible to concurrently maintain two such recalibrated estimates of audio-visual temporal synchrony. However, it remains unclear precisely what defines a given audio-visual pair such that it is possible to maintain a temporal relationship distinct from other pairs. It has been suggested that spatial separation of the different audio-visual pairs is necessary to achieve multiple distinct audio-visual synchrony estimates. Here we investigated if this is necessarily true. Specifically, we examined whether it is possible to obtain two distinct temporal recalibrations for stimuli that differed only in featural content. Using both complex (audio visual speech; see Experiment 1) and simple stimuli (high and low pitch audio matched with either vertically or horizontally oriented Gabors; see Experiment 2) we found concurrent, and opposite, recalibrations despite there being no spatial difference in presentation location at any point throughout the experiment. This result supports the notion that the content of an audio-visual pair alone can be used to constrain distinct audio-visual synchrony estimates regardless of spatial overlap.
Audio-Visual Temporal Recalibration Can be Constrained by Content Cues Regardless of Spatial Overlap

PubMed Central

Roseboom, Warrick; Kawabe, Takahiro; Nishida, Shin’Ya

2013-01-01

It has now been well established that the point of subjective synchrony for audio and visual events can be shifted following exposure to asynchronous audio-visual presentations, an effect often referred to as temporal recalibration. Recently it was further demonstrated that it is possible to concurrently maintain two such recalibrated estimates of audio-visual temporal synchrony. However, it remains unclear precisely what defines a given audio-visual pair such that it is possible to maintain a temporal relationship distinct from other pairs. It has been suggested that spatial separation of the different audio-visual pairs is necessary to achieve multiple distinct audio-visual synchrony estimates. Here we investigated if this is necessarily true. Specifically, we examined whether it is possible to obtain two distinct temporal recalibrations for stimuli that differed only in featural content. Using both complex (audio visual speech; see Experiment 1) and simple stimuli (high and low pitch audio matched with either vertically or horizontally oriented Gabors; see Experiment 2) we found concurrent, and opposite, recalibrations despite there being no spatial difference in presentation location at any point throughout the experiment. This result supports the notion that the content of an audio-visual pair alone can be used to constrain distinct audio-visual synchrony estimates regardless of spatial overlap. PMID:23658549
The Role of Speech Prosody and Text Reading Prosody in Children's Reading Comprehension

ERIC Educational Resources Information Center

Veenendaal, Nathalie J.; Groen, Margriet A.; Verhoeven, Ludo

2014-01-01

Background: Text reading prosody has been associated with reading comprehension. However, text reading prosody is a reading-dependent measure that relies heavily on decoding skills. Investigation of the contribution of speech prosody--which is independent from reading skills--in addition to text reading prosody, to reading comprehension could…
Multilevel Analysis in Analyzing Speech Data

ERIC Educational Resources Information Center

Guddattu, Vasudeva; Krishna, Y.

2011-01-01

The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…
Comparison of Acoustic and Kinematic Approaches to Measuring Utterance-Level Speech Variability

ERIC Educational Resources Information Center

Howell, Peter; Anderson, Andrew J.; Bartrip, Jon; Bailey, Eleanor

2009-01-01

Purpose: The spatiotemporal index (STI) is one measure of variability. As currently implemented, kinematic data are used, requiring equipment that cannot be used with some patient groups or in scanners. An experiment is reported that addressed whether STI can be extended to an audio measure of sound pressure of the speech envelope over time that…
Visual contribution to the multistable perception of speech.

PubMed

Sato, Marc; Basirat, Anahita; Schwartz, Jean-Luc

2007-11-01

The multistable perception of speech, or verbal transformation effect, refers to perceptual changes experienced while listening to a speech form that is repeated rapidly and continuously. In order to test whether visual information from the speaker's articulatory gestures may modify the emergence and stability of verbal auditory percepts, subjects were instructed to report any perceptual changes during unimodal, audiovisual, and incongruent audiovisual presentations of distinct repeated syllables. In a first experiment, the perceptual stability of reported auditory percepts was significantly modulated by the modality of presentation. In a second experiment, when audiovisual stimuli consisting of a stable audio track dubbed with a video track that alternated between congruent and incongruent stimuli were presented, a strong correlation between the timing of perceptual transitions and the timing of video switches was found. Finally, a third experiment showed that the vocal tract opening onset event provided by the visual input could play the role of a bootstrap mechanism in the search for transformations. Altogether, these results demonstrate the capacity of visual information to control the multistable perception of speech in its phonetic content and temporal course. The verbal transformation effect thus provides a useful experimental paradigm to explore audiovisual interactions in speech perception.
"Look who's talking!" Gaze Patterns for Implicit and Explicit Audio-Visual Speech Synchrony Detection in Children With High-Functioning Autism.

PubMed

Grossman, Ruth B; Steinhart, Erin; Mitchell, Teresa; McIlvane, William

2015-06-01

Conversation requires integration of information from faces and voices to fully understand the speaker's message. To detect auditory-visual asynchrony of speech, listeners must integrate visual movements of the face, particularly the mouth, with auditory speech information. Individuals with autism spectrum disorder may be less successful at such multisensory integration, despite their demonstrated preference for looking at the mouth region of a speaker. We showed participants (individuals with and without high-functioning autism (HFA) aged 8-19) a split-screen video of two identical individuals speaking side by side. Only one of the speakers was in synchrony with the corresponding audio track and synchrony switched between the two speakers every few seconds. Participants were asked to watch the video without further instructions (implicit condition) or to specifically watch the in-synch speaker (explicit condition). We recorded which part of the screen and face their eyes targeted. Both groups looked at the in-synch video significantly more with explicit instructions. However, participants with HFA looked at the in-synch video less than typically developing (TD) peers and did not increase their gaze time as much as TD participants in the explicit task. Importantly, the HFA group looked significantly less at the mouth than their TD peers, and significantly more at non-face regions of the image. There were no between-group differences for eye-directed gaze. Overall, individuals with HFA spend less time looking at the crucially important mouth region of the face during auditory-visual speech integration, which is maladaptive gaze behavior for this type of task. © 2015 International Society for Autism Research, Wiley Periodicals, Inc.

Computational validation of the motor contribution to speech perception.

PubMed

Badino, Leonardo; D'Ausilio, Alessandro; Fadiga, Luciano; Metta, Giorgio

2014-07-01

Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion information onto observers' motor representations. This process of matching the actions of others onto our own sensorimotor repertoire is thought to be important for action recognition, providing a non-mediated "motor perception" based on a bidirectional flow of information along the mirror parieto-frontal circuits. State-of-the-art machine learning strategies for hand action identification have shown better performances when sensorimotor data, as opposed to visual information only, are available during learning. As speech is a particular type of action (with acoustic targets), it is expected to activate a mirror neuron mechanism. Indeed, in speech perception, motor centers have been shown to be causally involved in the discrimination of speech sounds. In this paper, we review recent neurophysiological and machine learning-based studies showing (a) the specific contribution of the motor system to speech perception and (b) that automatic phone recognition is significantly improved when motor data are used during training of classifiers (as opposed to learning from purely auditory data). Copyright © 2014 Cognitive Science Society, Inc.
Reading Machines for Blind People.

ERIC Educational Resources Information Center

Fender, Derek H.

1983-01-01

Ten stages of developing reading machines for blind people are analyzed: handling of text material; optics; electro-optics; pattern recognition; character recognition; storage; speech synthesizers; browsing and place finding; computer indexing; and other sources of input. Cost considerations of the final product are emphasized. (CL)
Mandarin Visual Speech Information

ERIC Educational Resources Information Center

Chen, Trevor H.

2010-01-01

While the auditory-only aspects of Mandarin speech are heavily-researched and well-known in the field, this dissertation addresses its lesser-known aspects: The visual and audio-visual perception of Mandarin segmental information and lexical-tone information. Chapter II of this dissertation focuses on the audiovisual perception of Mandarin…
Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

PubMed

Agarwalla, Swapna; Sarma, Kandarpa Kumar

2016-06-01

Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time
Speech enhancement on smartphone voice recording

NASA Astrophysics Data System (ADS)

Tris Atmaja, Bagus; Nur Farid, Mifta; Arifianto, Dhany

2016-11-01

Speech enhancement is challenging task in audio signal processing to enhance the quality of targeted speech signal while suppress other noises. In the beginning, the speech enhancement algorithm growth rapidly from spectral subtraction, Wiener filtering, spectral amplitude MMSE estimator to Non-negative Matrix Factorization (NMF). Smartphone as revolutionary device now is being used in all aspect of life including journalism; personally and professionally. Although many smartphones have two microphones (main and rear) the only main microphone is widely used for voice recording. This is why the NMF algorithm widely used for this purpose of speech enhancement. This paper evaluate speech enhancement on smartphone voice recording by using some algorithms mentioned previously. We also extend the NMF algorithm to Kulback-Leibler NMF with supervised separation. The last algorithm shows improved result compared to others by spectrogram and PESQ score evaluation.
New Measures of Masked Text Recognition in Relation to Speech-in-Noise Perception and Their Associations with Age and Cognitive Abilities

ERIC Educational Resources Information Center

Besser, Jana; Zekveld, Adriana A.; Kramer, Sophia E.; Ronnberg, Jerker; Festen, Joost M.

2012-01-01

Purpose: In this research, the authors aimed to increase the analogy between Text Reception Threshold (TRT; Zekveld, George, Kramer, Goverts, & Houtgast, 2007) and Speech Reception Threshold (SRT; Plomp & Mimpen, 1979) and to examine the TRT's value in estimating cognitive abilities that are important for speech comprehension in noise. Method: The…
Emotion to emotion speech conversion in phoneme level

NASA Astrophysics Data System (ADS)

Bulut, Murtaza; Yildirim, Serdar; Busso, Carlos; Lee, Chul Min; Kazemzadeh, Ebrahim; Lee, Sungbok; Narayanan, Shrikanth

2004-10-01

Having an ability to synthesize emotional speech can make human-machine interaction more natural in spoken dialogue management. This study investigates the effectiveness of prosodic and spectral modification in phoneme level on emotion-to-emotion speech conversion. The prosody modification is performed with the TD-PSOLA algorithm (Moulines and Charpentier, 1990). We also transform the spectral envelopes of source phonemes to match those of target phonemes using LPC-based spectral transformation approach (Kain, 2001). Prosodic speech parameters (F0, duration, and energy) for target phonemes are estimated from the statistics obtained from the analysis of an emotional speech database of happy, angry, sad, and neutral utterances collected from actors. Listening experiments conducted with native American English speakers indicate that the modification of prosody only or spectrum only is not sufficient to elicit targeted emotions. The simultaneous modification of both prosody and spectrum results in higher acceptance rates of target emotions, suggesting that not only modeling speech prosody but also modeling spectral patterns that reflect underlying speech articulations are equally important to synthesize emotional speech with good quality. We are investigating suprasegmental level modifications for further improvement in speech quality and expressiveness.
An Introduction to Topic Modeling as an Unsupervised Machine Learning Way to Organize Text Information

ERIC Educational Resources Information Center

Snyder, Robin M.

2015-01-01

The field of topic modeling has become increasingly important over the past few years. Topic modeling is an unsupervised machine learning way to organize text (or image or DNA, etc.) information such that related pieces of text can be identified. This paper/session will present/discuss the current state of topic modeling, why it is important, and…
Speech Synthesis Applied to Language Teaching.

ERIC Educational Resources Information Center

Sherwood, Bruce

1981-01-01

The experimental addition of speech output to computer-based Esperanto lessons using speech synthesized from text is described. Because of Esperanto's phonetic spelling and simple rhythm, it is particularly easy to describe the mechanisms of Esperanto synthesis. Attention is directed to how the text-to-speech conversion is performed and the ways…
Acoustic analysis of speech under stress.

PubMed

Sondhi, Savita; Khan, Munna; Vijay, Ritu; Salhan, Ashok K; Chouhan, Satish

2015-01-01

When a person is emotionally charged, stress could be discerned in his voice. This paper presents a simplified and a non-invasive approach to detect psycho-physiological stress by monitoring the acoustic modifications during a stressful conversation. Voice database consists of audio clips from eight different popular FM broadcasts wherein the host of the show vexes the subjects who are otherwise unaware of the charade. The audio clips are obtained from real-life stressful conversations (no simulated emotions). Analysis is done using PRAAT software to evaluate mean fundamental frequency (F0) and formant frequencies (F1, F2, F3, F4) both in neutral and stressed state. Results suggest that F0 increases with stress; however, formant frequency decreases with stress. Comparison of Fourier and chirp spectra of short vowel segment shows that for relaxed speech, the two spectra are similar; however, for stressed speech, they differ in the high frequency range due to increased pitch modulation.
Expanding Audio Access to Mathematics Expressions by Students with Visual Impairments via MathML. Research Report. ETS RR-17-13

ERIC Educational Resources Information Center

Frankel, Lois; Brownstein, Beth; Soiffer, Neil

2017-01-01

This report describes the pilot conducted in the final phase of a project, Expanding Audio Access to Mathematics Expressions by Students With Visual Impairments via MathML, to provide easy-to-use tools for authoring and rendering secondary-school algebra-level math expressions in synthesized speech that is useful for students with blindness or low…
Audio-vocal interaction in single neurons of the monkey ventrolateral prefrontal cortex.

PubMed

Hage, Steffen R; Nieder, Andreas

2015-05-06

Complex audio-vocal integration systems depend on a strong interconnection between the auditory and the vocal motor system. To gain cognitive control over audio-vocal interaction during vocal motor control, the PFC needs to be involved. Neurons in the ventrolateral PFC (VLPFC) have been shown to separately encode the sensory perceptions and motor production of vocalizations. It is unknown, however, whether single neurons in the PFC reflect audio-vocal interactions. We therefore recorded single-unit activity in the VLPFC of rhesus monkeys (Macaca mulatta) while they produced vocalizations on command or passively listened to monkey calls. We found that 12% of randomly selected neurons in VLPFC modulated their discharge rate in response to acoustic stimulation with species-specific calls. Almost three-fourths of these auditory neurons showed an additional modulation of their discharge rates either before and/or during the monkeys' motor production of vocalization. Based on these audio-vocal interactions, the VLPFC might be well positioned to combine higher order auditory processing with cognitive control of the vocal motor output. Such audio-vocal integration processes in the VLPFC might constitute a precursor for the evolution of complex learned audio-vocal integration systems, ultimately giving rise to human speech. Copyright © 2015 the authors 0270-6474/15/357030-11$15.00/0.
Interactions between Text Chat and Audio Modalities for L2 Communication and Feedback in the Synthetic World "Second Life"

ERIC Educational Resources Information Center

Wigham, Ciara R.; Chanier, Thierry

2015-01-01

This paper reports on a study of the interactions between text chat and audio modalities in L2 communication in a synthetic (virtual) world and observes whether the text chat modality was used for corrective feedback and the characteristics of the latter. This is examined within the context of a hybrid content and language integrated learning…
Hearing impaired speech in noisy classrooms

NASA Astrophysics Data System (ADS)

Shahin, Kimary; McKellin, William H.; Jamieson, Janet; Hodgson, Murray; Pichora-Fuller, M. Kathleen

2005-04-01

Noisy classrooms have been shown to induce among students patterns of interaction similar to those used by hearing impaired people [W. H. McKellin et al., GURT (2003)]. In this research, the speech of children in a noisy classroom setting was investigated to determine if noisy classrooms have an effect on students' speech. Audio recordings were made of the speech of students during group work in their regular classrooms (grades 1-7), and of the speech of the same students in a sound booth. Noise level readings in the classrooms were also recorded. Each student's noisy and quiet environment speech samples were acoustically analyzed for prosodic and segmental properties (f0, pitch range, pitch variation, phoneme duration, vowel formants), and compared. The analysis showed that the students' speech in the noisy classrooms had characteristics of the speech of hearing-impaired persons [e.g., R. O'Halpin, Clin. Ling. and Phon. 15, 529-550 (2001)]. Some educational implications of our findings were identified. [Work supported by the Peter Wall Institute for Advanced Studies, University of British Columbia.
Automatic detection of obstructive sleep apnea using speech signals.

PubMed

Goldshtein, Evgenia; Tarasiuk, Ariel; Zigel, Yaniv

2011-05-01

Obstructive sleep apnea (OSA) is a common disorder associated with anatomical abnormalities of the upper airways that affects 5% of the population. Acoustic parameters may be influenced by the vocal tract structure and soft tissue properties. We hypothesize that speech signal properties of OSA patients will be different than those of control subjects not having OSA. Using speech signal processing techniques, we explored acoustic speech features of 93 subjects who were recorded using a text-dependent speech protocol and a digital audio recorder immediately prior to polysomnography study. Following analysis of the study, subjects were divided into OSA (n=67) and non-OSA (n=26) groups. A Gaussian mixture model-based system was developed to model and classify between the groups; discriminative features such as vocal tract length and linear prediction coefficients were selected using feature selection technique. Specificity and sensitivity of 83% and 79% were achieved for the male OSA and 86% and 84% for the female OSA patients, respectively. We conclude that acoustic features from speech signals during wakefulness can detect OSA patients with good specificity and sensitivity. Such a system can be used as a basis for future development of a tool for OSA screening. © 2011 IEEE
Speech Intelligibility and Psychosocial Functioning in Deaf Children and Teens with Cochlear Implants

ERIC Educational Resources Information Center

Freeman, Valerie; Pisoni, David B.; Kronenberger, William G.; Castellanos, Irina

2017-01-01

Deaf children with cochlear implants (CIs) are at risk for psychosocial adjustment problems, possibly due to delayed speech-language skills. This study investigated associations between a core component of spoken-language ability--speech intelligibility--and the psychosocial development of prelingually deaf CI users. Audio-transcription measures…
Automated Desensitization for the Clinical Treatment of Speech Anxiety

ERIC Educational Resources Information Center

McManus, Marianne; Lohr, James

1976-01-01

A self-guided, audio-tape, desensitization treatment procedure, using standard cassette recorders in a counseling service office, is an effective means for modifying self-report of speech anxiety. (MB)
Crossmodal and incremental perception of audiovisual cues to emotional speech.

PubMed

Barkhuysen, Pashiera; Krahmer, Emiel; Swerts, Marc

2010-01-01

In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: 1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? Both experiments reported below are based on tests with video clips of emotional utterances collected via a variant of the well-known Velten method. More specifically, we recorded speakers who displayed positive or negative emotions, which were congruent or incongruent with the (emotional) lexical content of the uttered sentence. In order to test this, we conducted two experiments. The first experiment is a perception experiment in which Czech participants, who do not speak Dutch, rate the perceived emotional state of Dutch speakers in a bimodal (audiovisual) or a unimodal (audio- or vision-only) condition. It was found that incongruent emotional speech leads to significantly more extreme perceived emotion scores than congruent emotional speech, where the difference between congruent and incongruent emotional speech is larger for the negative than for the positive conditions. Interestingly, the largest overall differences between congruent and incongruent emotions were found for the audio-only condition, which suggests that posing an incongruent emotion has a particularly strong effect on the spoken realization of emotions. The second experiment uses a gating paradigm to test the recognition speed for various emotional expressions from a speaker's face. In this experiment participants were presented with the same clips as experiment I, but this time presented vision-only. The clips were shown in successive segments (gates) of increasing duration. Results show that participants are surprisingly accurate in their recognition of the various emotions, as they already reach high recognition scores in the first gate (after only 160 ms). Interestingly, the recognition scores
Speech-Act and Text-Act Theory: "Theme-ing" in Freshman Composition.

ERIC Educational Resources Information Center

Horner, Winifred B.

In contrast to a speech-act theory that is limited by a simple speaker/hearer relationship, a text-act theory of written language allows for the historical or personal context of a writer and reader, both in the written work itself and in the act of reading. This theory can be applied to theme writing, essay examinations, and revision in the…
Using ultrasound visual biofeedback to treat persistent primary speech sound disorders.

PubMed

Cleland, Joanne; Scobbie, James M; Wrench, Alan A

2015-01-01

Growing evidence suggests that speech intervention using visual biofeedback may benefit people for whom visual skills are stronger than auditory skills (for example, the hearing-impaired population), especially when the target articulation is hard to describe or see. Diagnostic ultrasound can be used to image the tongue and has recently become more compact and affordable leading to renewed interest in it as a practical, non-invasive visual biofeedback tool. In this study, we evaluate its effectiveness in treating children with persistent speech sound disorders that have been unresponsive to traditional therapy approaches. A case series of seven different children (aged 6-11) with persistent speech sound disorders were evaluated. For each child, high-speed ultrasound (121 fps), audio and lip video recordings were made while probing each child's specific errors at five different time points (before, during and after intervention). After intervention, all the children made significant progress on targeted segments, evidenced by both perceptual measures and changes in tongue-shape.

Objective voice and speech analysis of persons with chronic hoarseness by prosodic analysis of speech samples.

PubMed

Haderlein, Tino; Döllinger, Michael; Matoušek, Václav; Nöth, Elmar

2016-10-01

Automatic voice assessment is often performed using sustained vowels. In contrast, speech analysis of read-out texts can be applied to voice and speech assessment. Automatic speech recognition and prosodic analysis were used to find regression formulae between automatic and perceptual assessment of four voice and four speech criteria. The regression was trained with 21 men and 62 women (average age 49.2 years) and tested with another set of 24 men and 49 women (48.3 years), all suffering from chronic hoarseness. They read the text 'Der Nordwind und die Sonne' ('The North Wind and the Sun'). Five voice and speech therapists evaluated the data on 5-point Likert scales. Ten prosodic and recognition accuracy measures (features) were identified which describe all the examined criteria. Inter-rater correlation within the expert group was between r = 0.63 for the criterion 'match of breath and sense units' and r = 0.87 for the overall voice quality. Human-machine correlation was between r = 0.40 for the match of breath and sense units and r = 0.82 for intelligibility. The perceptual ratings of different criteria were highly correlated with each other. Likewise, the feature sets modeling the criteria were very similar. The automatic method is suitable for assessing chronic hoarseness in general and for subgroups of functional and organic dysphonia. In its current version, it is almost as reliable as a randomly picked rater from a group of voice and speech therapists.
Audio feature extraction using probability distribution function

NASA Astrophysics Data System (ADS)

Suhaib, A.; Wan, Khairunizam; Aziz, Azri A.; Hazry, D.; Razlan, Zuradzman M.; Shahriman A., B.

2015-05-01

Voice recognition has been one of the popular applications in robotic field. It is also known to be recently used for biometric and multimedia information retrieval system. This technology is attained from successive research on audio feature extraction analysis. Probability Distribution Function (PDF) is a statistical method which is usually used as one of the processes in complex feature extraction methods such as GMM and PCA. In this paper, a new method for audio feature extraction is proposed which is by using only PDF as a feature extraction method itself for speech analysis purpose. Certain pre-processing techniques are performed in prior to the proposed feature extraction method. Subsequently, the PDF result values for each frame of sampled voice signals obtained from certain numbers of individuals are plotted. From the experimental results obtained, it can be seen visually from the plotted data that each individuals' voice has comparable PDF values and shapes.
Wavelet-based audio embedding and audio/video compression

NASA Astrophysics Data System (ADS)

Mendenhall, Michael J.; Claypoole, Roger L., Jr.

2001-12-01

Watermarking, traditionally used for copyright protection, is used in a new and exciting way. An efficient wavelet-based watermarking technique embeds audio information into a video signal. Several effective compression techniques are applied to compress the resulting audio/video signal in an embedded fashion. This wavelet-based compression algorithm incorporates bit-plane coding, index coding, and Huffman coding. To demonstrate the potential of this audio embedding and audio/video compression algorithm, we embed an audio signal into a video signal and then compress. Results show that overall compression rates of 15:1 can be achieved. The video signal is reconstructed with a median PSNR of nearly 33 dB. Finally, the audio signal is extracted from the compressed audio/video signal without error.
78 FR 49693 - Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech Telecommunications Relay Services...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-08-15

...] Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech Telecommunications Relay Services...: This is a summary of the Commission's Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech...), Internet Protocol Relay (IP Relay), and IP captioned telephone service (IP CTS) as compensable forms of TRS...
Non-fluent speech following stroke is caused by impaired efference copy.

PubMed

Feenaughty, Lynda; Basilakos, Alexandra; Bonilha, Leonardo; den Ouden, Dirk-Bart; Rorden, Chris; Stark, Brielle; Fridriksson, Julius

2017-09-01

Efference copy is a cognitive mechanism argued to be critical for initiating and monitoring speech: however, the extent to which breakdown of efference copy mechanisms impact speech production is unclear. This study examined the best mechanistic predictors of non-fluent speech among 88 stroke survivors. Objective speech fluency measures were subjected to a principal component analysis (PCA). The primary PCA factor was then entered into a multiple stepwise linear regression analysis as the dependent variable, with a set of independent mechanistic variables. Participants' ability to mimic audio-visual speech ("speech entrainment response") was the best independent predictor of non-fluent speech. We suggest that this "speech entrainment" factor reflects integrity of internal monitoring (i.e., efference copy) of speech production, which affects speech initiation and maintenance. Results support models of normal speech production and suggest that therapy focused on speech initiation and maintenance may improve speech fluency for individuals with chronic non-fluent aphasia post stroke.
Modeling Driving Performance Using In-Vehicle Speech Data From a Naturalistic Driving Study.

PubMed

Kuo, Jonny; Charlton, Judith L; Koppel, Sjaan; Rudin-Brown, Christina M; Cross, Suzanne

2016-09-01

We aimed to (a) describe the development and application of an automated approach for processing in-vehicle speech data from a naturalistic driving study (NDS), (b) examine the influence of child passenger presence on driving performance, and (c) model this relationship using in-vehicle speech data. Parent drivers frequently engage in child-related secondary behaviors, but the impact on driving performance is unknown. Applying automated speech-processing techniques to NDS audio data would facilitate the analysis of in-vehicle driver-child interactions and their influence on driving performance. Speech activity detection and speaker diarization algorithms were applied to audio data from a Melbourne-based NDS involving 42 families. Multilevel models were developed to evaluate the effect of speech activity and the presence of child passengers on driving performance. Speech activity was significantly associated with velocity and steering angle variability. Child passenger presence alone was not associated with changes in driving performance. However, speech activity in the presence of two child passengers was associated with the most variability in driving performance. The effects of in-vehicle speech on driving performance in the presence of child passengers appear to be heterogeneous, and multiple factors may need to be considered in evaluating their impact. This goal can potentially be achieved within large-scale NDS through the automated processing of observational data, including speech. Speech-processing algorithms enable new perspectives on driving performance to be gained from existing NDS data, and variables that were once labor-intensive to process can be readily utilized in future research. © 2016, Human Factors and Ergonomics Society.
Computationally Efficient Clustering of Audio-Visual Meeting Data

NASA Astrophysics Data System (ADS)

Hung, Hayley; Friedland, Gerald; Yeo, Chuohao

This chapter presents novel computationally efficient algorithms to extract semantically meaningful acoustic and visual events related to each of the participants in a group discussion using the example of business meeting recordings. The recording setup involves relatively few audio-visual sensors, comprising a limited number of cameras and microphones. We first demonstrate computationally efficient algorithms that can identify who spoke and when, a problem in speech processing known as speaker diarization. We also extract visual activity features efficiently from MPEG4 video by taking advantage of the processing that was already done for video compression. Then, we present a method of associating the audio-visual data together so that the content of each participant can be managed individually. The methods presented in this article can be used as a principal component that enables many higher-level semantic analysis tasks needed in search, retrieval, and navigation.
Divergent neural responses to narrative speech in disorders of consciousness.

PubMed

Iotzov, Ivan; Fidali, Brian C; Petroni, Agustin; Conte, Mary M; Schiff, Nicholas D; Parra, Lucas C

2017-11-01

Clinical assessment of auditory attention in patients with disorders of consciousness is often limited by motor impairment. Here, we employ intersubject correlations among electroencephalography responses to naturalistic speech in order to assay auditory attention among patients and healthy controls. Electroencephalographic data were recorded from 20 subjects with disorders of consciousness and 14 healthy controls during of two narrative audio stimuli, presented both forwards and time-reversed. Intersubject correlation of evoked electroencephalography signals were calculated, comparing responses of both groups to those of the healthy control subjects. This analysis was performed blinded and subsequently compared to the diagnostic status of each patient based on the Coma Recovery Scale-Revised. Subjects with disorders of consciousness exhibit significantly lower intersubject correlation than healthy controls during narrative speech. Additionally, while healthy subjects had higher intersubject correlation values in forwards versus backwards presentation, neural responses did not vary significantly with the direction of playback in subjects with disorders of consciousness. Increased intersubject correlation values in the backward speech condition were noted with improving disorder of consciousness diagnosis, both in cross-sectional analysis and in a subset of patients with longitudinal data. Intersubject correlation of neural responses to narrative speech audition differentiates healthy controls from patients and appears to index clinical diagnoses in disorders of consciousness.
Audio Visual Technology and the Teaching of Foreign Languages.

ERIC Educational Resources Information Center

Halbig, Michael C.

Skills in comprehending the spoken language source are becoming increasingly important due to the audio-visual orientation of our culture. It would seem natural, therefore, to adjust the learning goals and environment accordingly. The video-cassette machine is an ideal means for creating this learning environment and developing the listening…
Comparisons of Audio and Audiovisual Measures of Stuttering Frequency and Severity in Preschool-Age Children

ERIC Educational Resources Information Center

Rousseau, Isabelle; Onslow, Mark; Packman, Ann; Jones, Mark

2008-01-01

Purpose: To determine whether measures of stuttering frequency and measures of overall stuttering severity in preschoolers differ when made from audio-only recordings compared with audiovisual recordings. Method: Four blinded speech-language pathologists who had extensive experience with preschoolers who stutter measured stuttering frequency and…
Reconstruction of audio waveforms from spike trains of artificial cochlea models

PubMed Central

Zai, Anja T.; Bhargava, Saurabh; Mesgarani, Nima; Liu, Shih-Chii

2015-01-01

Spiking cochlea models describe the analog processing and spike generation process within the biological cochlea. Reconstructing the audio input from the artificial cochlea spikes is therefore useful for understanding the fidelity of the information preserved in the spikes. The reconstruction process is challenging particularly for spikes from the mixed signal (analog/digital) integrated circuit (IC) cochleas because of multiple non-linearities in the model and the additional variance caused by random transistor mismatch. This work proposes an offline method for reconstructing the audio input from spike responses of both a particular spike-based hardware model called the AEREAR2 cochlea and an equivalent software cochlea model. This method was previously used to reconstruct the auditory stimulus based on the peri-stimulus histogram of spike responses recorded in the ferret auditory cortex. The reconstructed audio from the hardware cochlea is evaluated against an analogous software model using objective measures of speech quality and intelligibility; and further tested in a word recognition task. The reconstructed audio under low signal-to-noise (SNR) conditions (SNR < –5 dB) gives a better classification performance than the original SNR input in this word recognition task. PMID:26528113
TongueToSpeech (TTS): Wearable wireless assistive device for augmented speech.

PubMed

Marjanovic, Nicholas; Piccinini, Giacomo; Kerr, Kevin; Esmailbeigi, Hananeh

2017-07-01

Speech is an important aspect of human communication; individuals with speech impairment are unable to communicate vocally in real time. Our team has developed the TongueToSpeech (TTS) device with the goal of augmenting speech communication for the vocally impaired. The proposed device is a wearable wireless assistive device that incorporates a capacitive touch keyboard interface embedded inside a discrete retainer. This device connects to a computer, tablet or a smartphone via Bluetooth connection. The developed TTS application converts text typed by the tongue into audible speech. Our studies have concluded that an 8-contact point configuration between the tongue and the TTS device would yield the best user precision and speed performance. On average using the TTS device inside the oral cavity takes 2.5 times longer than the pointer finger using a T9 (Text on 9 keys) keyboard configuration to type the same phrase. In conclusion, we have developed a discrete noninvasive wearable device that allows the vocally impaired individuals to communicate in real time.
Cued Speech for Enhancing Speech Perception and First Language Development of Children With Cochlear Implants

PubMed Central

Leybaert, Jacqueline; LaSasso, Carol J.

2010-01-01

Nearly 300 million people worldwide have moderate to profound hearing loss. Hearing impairment, if not adequately managed, has strong socioeconomic and affective impact on individuals. Cochlear implants have become the most effective vehicle for helping profoundly deaf children and adults to understand spoken language, to be sensitive to environmental sounds, and, to some extent, to listen to music. The auditory information delivered by the cochlear implant remains non-optimal for speech perception because it delivers a spectrally degraded signal and lacks some of the fine temporal acoustic structure. In this article, we discuss research revealing the multimodal nature of speech perception in normally-hearing individuals, with important inter-subject variability in the weighting of auditory or visual information. We also discuss how audio-visual training, via Cued Speech, can improve speech perception in cochlear implantees, particularly in noisy contexts. Cued Speech is a system that makes use of visual information from speechreading combined with hand shapes positioned in different places around the face in order to deliver completely unambiguous information about the syllables and the phonemes of spoken language. We support our view that exposure to Cued Speech before or after the implantation could be important in the aural rehabilitation process of cochlear implantees. We describe five lines of research that are converging to support the view that Cued Speech can enhance speech perception in individuals with cochlear implants. PMID:20724357
Speech vs. singing: infants choose happier sounds

PubMed Central

Corbeil, Marieve; Trehub, Sandra E.; Peretz, Isabelle

2013-01-01

Infants prefer speech to non-vocal sounds and to non-human vocalizations, and they prefer happy-sounding speech to neutral speech. They also exhibit an interest in singing, but there is little knowledge of their relative interest in speech and singing. The present study explored infants' attention to unfamiliar audio samples of speech and singing. In Experiment 1, infants 4–13 months of age were exposed to happy-sounding infant-directed speech vs. hummed lullabies by the same woman. They listened significantly longer to the speech, which had considerably greater acoustic variability and expressiveness, than to the lullabies. In Experiment 2, infants of comparable age who heard the lyrics of a Turkish children's song spoken vs. sung in a joyful/happy manner did not exhibit differential listening. Infants in Experiment 3 heard the happily sung lyrics of the Turkish children's song vs. a version that was spoken in an adult-directed or affectively neutral manner. They listened significantly longer to the sung version. Overall, happy voice quality rather than vocal mode (speech or singing) was the principal contributor to infant attention, regardless of age. PMID:23805119
Audio Spectrogram Representations for Processing with Convolutional Neural Networks

NASA Astrophysics Data System (ADS)

Wyse, L.

2017-05-01

One of the decisions that arise when designing a neural network for any application is how the data should be represented in order to be presented to, and possibly generated by, a neural network. For audio, the choice is less obvious than it seems to be for visual images, and a variety of representations have been used for different applications including the raw digitized sample stream, hand-crafted features, machine discovered features, MFCCs and variants that include deltas, and a variety of spectral representations. This paper reviews some of these representations and issues that arise, focusing particularly on spectrograms for generating audio using neural networks for style transfer.
Listeners' Perceptions of Speech and Language Disorders

ERIC Educational Resources Information Center

Allard, Emily R.; Williams, Dale F.

2008-01-01

Using semantic differential scales with nine trait pairs, 445 adults rated five audio-taped speech samples, one depicting an individual without a disorder and four portraying communication disorders. Statistical analyses indicated that the no disorder sample was rated higher with respect to the trait of employability than were the articulation,…
78 FR 49717 - Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech Telecommunications Relay Services...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-08-15

...] Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech Telecommunications Relay Services... Internet Protocol (IP) Speech-to-Speech Telecommunications Relay Services; Telecommunications Relay... (IP Relay) and video relay service (VRS), the Commission should bundle national STS outreach efforts...
Virtual Acoustic Displays for Teleconferencing: Intelligibility Advantage for "Telephone Grade" Audio

NASA Technical Reports Server (NTRS)

Begault, Durand R.; Null, Cynthia H. (Technical Monitor)

1994-01-01

Speech intelligibility was evaluated using a virtual acoustic ("3-D audio") display using the method specified by ANSI. Ten subjects were evaluated with stimuli either unfiltered or low-pass filtered at 4 kHz. Results show virtual acoustic techniques are advantageous for both full-bandwidth (44.1 kHz srate) and low (8 kHz srate) bandwidth "telephone-grade" teleconferencing systems.
Zipf's Law in Short-Time Timbral Codings of Speech, Music, and Environmental Sound Signals

PubMed Central

Haro, Martín; Serrà, Joan; Herrera, Perfecto; Corral, Álvaro

2012-01-01

Timbre is a key perceptual feature that allows discrimination between different sounds. Timbral sensations are highly dependent on the temporal evolution of the power spectrum of an audio signal. In order to quantitatively characterize such sensations, the shape of the power spectrum has to be encoded in a way that preserves certain physical and perceptual properties. Therefore, it is common practice to encode short-time power spectra using psychoacoustical frequency scales. In this paper, we study and characterize the statistical properties of such encodings, here called timbral code-words. In particular, we report on rank-frequency distributions of timbral code-words extracted from 740 hours of audio coming from disparate sources such as speech, music, and environmental sounds. Analogously to text corpora, we find a heavy-tailed Zipfian distribution with exponent close to one. Importantly, this distribution is found independently of different encoding decisions and regardless of the audio source. Further analysis on the intrinsic characteristics of most and least frequent code-words reveals that the most frequent code-words tend to have a more homogeneous structure. We also find that speech and music databases have specific, distinctive code-words while, in the case of the environmental sounds, this database-specific code-words are not present. Finally, we find that a Yule-Simon process with memory provides a reasonable quantitative approximation for our data, suggesting the existence of a common simple generative mechanism for all considered sound sources. PMID:22479497
International Meeting To Discuss Audio Technology as Applied to Library Services for Blind Individuals (3rd, Toronto, Ontario, Canada, April 20-22, 1995). Volumes 1-3.

ERIC Educational Resources Information Center

Library of Congress, Washington, DC. National Library Service for the Blind and Physically Handicapped.

This three-day conference on the subject of audio technology for the production of materials for the blind, takes the court reporter approach to recording the speeches and discussions of the meeting. The result is a three volume set of complete transcripts, one volume for each day of the meeting, but continuous in form. The highlights of each…

Cortical processing during smartphone text messaging.

PubMed

Tatum, William O; DiCiaccio, Benedetto; Yelvington, Kirsten H

2016-06-01

The objective of this study was to report the EEG features of text messaging using smartphones. One hundred twenty-nine patients were prospectively evaluated during video-EEG monitoring (VEM) over 16months. A reproducible texting rhythm (TR) present during active text messaging with a smartphone was compared with passive and forced audio telephone use, thumb/finger movements, cognitive testing/calculation, scanning eye movements, and speech/language tasks in patients with and without epilepsy. Statistical significance was set at p<0.05. Twenty-seven patients with a TR were identified from a cohort of 129 (93 female, mean age: 36; range: 18-71) unselected VEM patients. Fifty-three out of 129 patients had epileptic seizures (ES), 74/129 had nonepileptic seizures (NES), and 2/129 were dual-diagnosed. A reproducible TR was present in 27/129 (20.9%) specific to text messaging (p<0.0001) and present in 28% of patients with ES and 16% of patients with NES (p=NS). The TR was absent during independent tasks and audio cellular telephone use (p<0.0001). Age, gender, epilepsy type, MRI results, and EEG lateralization in patients with focal seizures were unrelated (p=NS). Our results suggest that the TR on scalp EEG represents a novel technology-specific neurophysiological alteration of brain networks. We propose that cortical processing in the contemporary brain is uniquely activated by the use of PEDs. These findings have practical implications that could impact industry and research in nonverbal communication. Copyright © 2016 Elsevier Inc. All rights reserved.
An introduction to quantum machine learning

NASA Astrophysics Data System (ADS)

Schuld, Maria; Sinayskiy, Ilya; Petruccione, Francesco

2015-04-01

Machine learning algorithms learn a desired input-output relation from examples in order to interpret new inputs. This is important for tasks such as image and speech recognition or strategy optimisation, with growing applications in the IT industry. In the last couple of years, researchers investigated if quantum computing can help to improve classical machine learning algorithms. Ideas range from running computationally costly algorithms or their subroutines efficiently on a quantum computer to the translation of stochastic methods into the language of quantum theory. This contribution gives a systematic overview of the emerging field of quantum machine learning. It presents the approaches as well as technical details in an accessible way, and discusses the potential of a future theory of quantum learning.
Deep learning

NASA Astrophysics Data System (ADS)

Lecun, Yann; Bengio, Yoshua; Hinton, Geoffrey

2015-05-01

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Deep learning.

PubMed

LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey

2015-05-28

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Speech Processing and Recognition (SPaRe)

DTIC Science & Technology

2011-01-01

results in the areas of automatic speech recognition (ASR), speech processing, machine translation (MT), natural language processing ( NLP ), and...Processing ( NLP ), Information Retrieval (IR) 16. SECURITY CLASSIFICATION OF: UNCLASSIFED 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...Figure 9, the IOC was only expected to provide document submission and search; automatic speech recognition (ASR) for English, Spanish, Arabic , and
Effects of Dictation, Speech to Text, and Handwriting on the Written Composition of Elementary School English Language Learners

ERIC Educational Resources Information Center

Arcon, Nina; Klein, Perry D.; Dombroski, Jill D.

2017-01-01

Previous research has shown that both dictation and speech-to-text (STT) software can increase the quality of writing for native English speakers. The purpose of this study was to investigate the effect of these modalities on the written composition and cognitive load of elementary school English language learners (ELLs). In a within-subjects…
Emerging Realities of Text-to-Speech Software for Nonnative-English-Speaking Community College Students in the Freshman Year

ERIC Educational Resources Information Center

Baker, Fiona S.

2015-01-01

This study explores the expectations and early and subsequent realities of text-to-speech software for 24 nonnative-English-speaking college students who were experiencing reading difficulties in their freshman year of college. The study took place over two semesters in one academic year (from September to June) at a community college on the…
Real-time speech-driven animation of expressive talking faces

NASA Astrophysics Data System (ADS)

Liu, Jia; You, Mingyu; Chen, Chun; Song, Mingli

2011-05-01

In this paper, we present a real-time facial animation system in which speech drives mouth movements and facial expressions synchronously. Considering five basic emotions, a hierarchical structure with an upper layer of emotion classification is established. Based on the recognized emotion label, the under-layer classification at sub-phonemic level has been modelled on the relationship between acoustic features of frames and audio labels in phonemes. Using certain constraint, the predicted emotion labels of speech are adjusted to gain the facial expression labels which are combined with sub-phonemic labels. The combinations are mapped into facial action units (FAUs), and audio-visual synchronized animation with mouth movements and facial expressions is generated by morphing between FAUs. The experimental results demonstrate that the two-layer structure succeeds in both emotion and sub-phonemic classifications, and the synthesized facial sequences reach a comparative convincing quality.
A machine learning based approach to identify protected health information in Chinese clinical text.

PubMed

Du, Liting; Xia, Chenxi; Deng, Zhaohua; Lu, Gary; Xia, Shuxu; Ma, Jingdong

2018-08-01

With the increasing application of electronic health records (EHRs) in the world, protecting private information in clinical text has drawn extensive attention from healthcare providers to researchers. De-identification, the process of identifying and removing protected health information (PHI) from clinical text, has been central to the discourse on medical privacy since 2006. While de-identification is becoming the global norm for handling medical records, there is a paucity of studies on its application on Chinese clinical text. Without efficient and effective privacy protection algorithms in place, the use of indispensable clinical information would be confined. We aimed to (i) describe the current process for PHI in China, (ii) propose a machine learning based approach to identify PHI in Chinese clinical text, and (iii) validate the effectiveness of the machine learning algorithm for de-identification in Chinese clinical text. Based on 14,719 discharge summaries from regional health centers in Ya'an City, Sichuan province, China, we built a conditional random fields (CRF) model to identify PHI in clinical text, and then used the regular expressions to optimize the recognition results of the PHI categories with fewer samples. We constructed a Chinese clinical text corpus with PHI tags through substantial manual annotation, wherein the descriptive statistics of PHI manifested its wide range and diverse categories. The evaluation showed with a high F-measure of 0.9878 that our CRF-based model had a good performance for identifying PHI in Chinese clinical text. The rapid adoption of EHR in the health sector has created an urgent need for tools that can parse patient specific information from Chinese clinical text. Our application of CRF algorithms for de-identification has shown the potential to meet this need by offering a highly accurate and flexible solution to analyzing Chinese clinical text. Copyright © 2018 Elsevier B.V. All rights reserved.
The Development of the Text Reception Threshold Test: A Visual Analogue of the Speech Reception Threshold Test

ERIC Educational Resources Information Center

Zekveld, Adriana A.; George, Erwin L. J.; Kramer, Sophia E.; Goverts, S. Theo; Houtgast, Tammo

2007-01-01

Purpose: In this study, the authors aimed to develop a visual analogue of the widely used Speech Reception Threshold (SRT; R. Plomp & A. M. Mimpen, 1979b) test. The Text Reception Threshold (TRT) test, in which visually presented sentences are masked by a bar pattern, enables the quantification of modality-aspecific variance in speech-in-noise…
Audio spectrum and sound pressure levels vary between pulse oximeters.

PubMed

Chandra, Deven; Tessler, Michael J; Usher, John

2006-01-01

The variable-pitch pulse oximeter is an important intraoperative patient monitor. Our ability to hear its auditory signal depends on its acoustical properties and our hearing. This study quantitatively describes the audio spectrum and sound pressure levels of the monitoring tones produced by five variable-pitch pulse oximeters. We compared the Datex-Ohmeda Capnomac Ultima, Hewlett-Packard M1166A, Datex-Engstrom AS/3, Ohmeda Biox 3700, and Datex-Ohmeda 3800 oximeters. Three machines of each of the five models were assessed for sound pressure levels (using a precision sound level meter) and audio spectrum (using a hanning windowed fast Fourier trans-form of three beats at saturations of 99%, 90%, and 85%). The widest range of sound pressure levels was produced by the Hewlett-Packard M1166A (46.5 +/- 1.74 dB to 76.9 +/- 2.77 dB). The loudest model was the Datex-Engstrom AS/3 (89.2 +/- 5.36 dB). Three oximeters, when set to the lower ranges of their volume settings, were indistinguishable from background operating room noise. Each model produced sounds with different audio spectra. Although each model produced a fundamental tone with multiple harmonic overtones, the number of harmonics varied with each model; from three harmonic tones on the Hewlett-Packard M1166A, to 12 on the Ohmeda Biox 3700. There were variations between models, and individual machines of the same model with respect to the fundamental tone associated with a given saturation. There is considerable variance in the sound pressure and audio spectrum of commercially-available pulse oximeters. Further studies are warranted in order to establish standards.
Neurophysiological evidence for the interplay of speech segmentation and word-referent mapping during novel word learning.

PubMed

François, Clément; Cunillera, Toni; Garcia, Enara; Laine, Matti; Rodriguez-Fornells, Antoni

2017-04-01

Learning a new language requires the identification of word units from continuous speech (the speech segmentation problem) and mapping them onto conceptual representation (the word to world mapping problem). Recent behavioral studies have revealed that the statistical properties found within and across modalities can serve as cues for both processes. However, segmentation and mapping have been largely studied separately, and thus it remains unclear whether both processes can be accomplished at the same time and if they share common neurophysiological features. To address this question, we recorded EEG of 20 adult participants during both an audio alone speech segmentation task and an audiovisual word-to-picture association task. The participants were tested for both the implicit detection of online mismatches (structural auditory and visual semantic violations) as well as for the explicit recognition of words and word-to-picture associations. The ERP results from the learning phase revealed a delayed learning-related fronto-central negativity (FN400) in the audiovisual condition compared to the audio alone condition. Interestingly, while online structural auditory violations elicited clear MMN/N200 components in the audio alone condition, visual-semantic violations induced meaning-related N400 modulations in the audiovisual condition. The present results support the idea that speech segmentation and meaning mapping can take place in parallel and act in synergy to enhance novel word learning. Copyright © 2016 Elsevier Ltd. All rights reserved.
A scheme for racquet sports video analysis with the combination of audio-visual information

NASA Astrophysics Data System (ADS)

Xing, Liyuan; Ye, Qixiang; Zhang, Weigang; Huang, Qingming; Yu, Hua

2005-07-01

As a very important category in sports video, racquet sports video, e.g. table tennis, tennis and badminton, has been paid little attention in the past years. Considering the characteristics of this kind of sports video, we propose a new scheme for structure indexing and highlight generating based on the combination of audio and visual information. Firstly, a supervised classification method is employed to detect important audio symbols including impact (ball hit), audience cheers, commentator speech, etc. Meanwhile an unsupervised algorithm is proposed to group video shots into various clusters. Then, by taking advantage of temporal relationship between audio and visual signals, we can specify the scene clusters with semantic labels including rally scenes and break scenes. Thirdly, a refinement procedure is developed to reduce false rally scenes by further audio analysis. Finally, an exciting model is proposed to rank the detected rally scenes from which many exciting video clips such as game (match) points can be correctly retrieved. Experiments on two types of representative racquet sports video, table tennis video and tennis video, demonstrate encouraging results.
Dual Key Speech Encryption Algorithm Based Underdetermined BSS

PubMed Central

Zhao, Huan; Chen, Zuo; Zhang, Xixiang

2014-01-01

When the number of the mixed signals is less than that of the source signals, the underdetermined blind source separation (BSS) is a significant difficult problem. Due to the fact that the great amount data of speech communications and real-time communication has been required, we utilize the intractability of the underdetermined BSS problem to present a dual key speech encryption method. The original speech is mixed with dual key signals which consist of random key signals (one-time pad) generated by secret seed and chaotic signals generated from chaotic system. In the decryption process, approximate calculation is used to recover the original speech signals. The proposed algorithm for speech signals encryption can resist traditional attacks against the encryption system, and owing to approximate calculation, decryption becomes faster and more accurate. It is demonstrated that the proposed method has high level of security and can recover the original signals quickly and efficiently yet maintaining excellent audio quality. PMID:24955430
Are the Literacy Difficulties That Characterize Developmental Dyslexia Associated with a Failure to Integrate Letters and Speech Sounds?

ERIC Educational Resources Information Center

Nash, Hannah M.; Gooch, Debbie; Hulme, Charles; Mahajan, Yatin; McArthur, Genevieve; Steinmetzger, Kurt; Snowling, Margaret J.

2017-01-01

The "automatic letter-sound integration hypothesis" (Blomert, [Blomert, L., 2011]) proposes that dyslexia results from a failure to fully integrate letters and speech sounds into automated audio-visual objects. We tested this hypothesis in a sample of English-speaking children with dyslexic difficulties (N = 13) and samples of…
Multisensory integration of speech sounds with letters vs. visual speech: only visual speech induces the mismatch negativity.

PubMed

Stekelenburg, Jeroen J; Keetels, Mirjam; Vroomen, Jean

2018-05-01

Numerous studies have demonstrated that the vision of lip movements can alter the perception of auditory speech syllables (McGurk effect). While there is ample evidence for integration of text and auditory speech, there are only a few studies on the orthographic equivalent of the McGurk effect. Here, we examined whether written text, like visual speech, can induce an illusory change in the perception of speech sounds on both the behavioural and neural levels. In a sound categorization task, we found that both text and visual speech changed the identity of speech sounds from an /aba/-/ada/ continuum, but the size of this audiovisual effect was considerably smaller for text than visual speech. To examine at which level in the information processing hierarchy these multisensory interactions occur, we recorded electroencephalography in an audiovisual mismatch negativity (MMN, a component of the event-related potential reflecting preattentive auditory change detection) paradigm in which deviant text or visual speech was used to induce an illusory change in a sequence of ambiguous sounds halfway between /aba/ and /ada/. We found that only deviant visual speech induced an MMN, but not deviant text, which induced a late P3-like positive potential. These results demonstrate that text has much weaker effects on sound processing than visual speech does, possibly because text has different biological roots than visual speech. © 2018 The Authors. European Journal of Neuroscience published by Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Effects of Audio-Visual Integration on the Detection of Masked Speech and Non-Speech Sounds

ERIC Educational Resources Information Center

Eramudugolla, Ranmalee; Henderson, Rachel; Mattingley, Jason B.

2011-01-01

Integration of simultaneous auditory and visual information about an event can enhance our ability to detect that event. This is particularly evident in the perception of speech, where the articulatory gestures of the speaker's lips and face can significantly improve the listener's detection and identification of the message, especially when that…
Visual Cortical Entrainment to Motion and Categorical Speech Features during Silent Lipreading

PubMed Central

O’Sullivan, Aisling E.; Crosse, Michael J.; Di Liberto, Giovanni M.; Lalor, Edmund C.

2017-01-01

Speech is a multisensory percept, comprising an auditory and visual component. While the content and processing pathways of audio speech have been well characterized, the visual component is less well understood. In this work, we expand current methodologies using system identification to introduce a framework that facilitates the study of visual speech in its natural, continuous form. Specifically, we use models based on the unheard acoustic envelope (E), the motion signal (M) and categorical visual speech features (V) to predict EEG activity during silent lipreading. Our results show that each of these models performs similarly at predicting EEG in visual regions and that respective combinations of the individual models (EV, MV, EM and EMV) provide an improved prediction of the neural activity over their constituent models. In comparing these different combinations, we find that the model incorporating all three types of features (EMV) outperforms the individual models, as well as both the EV and MV models, while it performs similarly to the EM model. Importantly, EM does not outperform EV and MV, which, considering the higher dimensionality of the V model, suggests that more data is needed to clarify this finding. Nevertheless, the performance of EMV, and comparisons of the subject performances for the three individual models, provides further evidence to suggest that visual regions are involved in both low-level processing of stimulus dynamics and categorical speech perception. This framework may prove useful for investigating modality-specific processing of visual speech under naturalistic conditions. PMID:28123363
Perception of the Multisensory Coherence of Fluent Audiovisual Speech in Infancy: Its Emergence & the Role of Experience

PubMed Central

Lewkowicz, David J.; Minar, Nicholas J.; Tift, Amy H.; Brandon, Melissa

2014-01-01

To investigate the developmental emergence of the ability to perceive the multisensory coherence of native and non-native audiovisual fluent speech, we tested 4-, 8–10, and 12–14 month-old English-learning infants. Infants first viewed two identical female faces articulating two different monologues in silence and then in the presence of an audible monologue that matched the visible articulations of one of the faces. Neither the 4-month-old nor the 8–10 month-old infants exhibited audio-visual matching in that neither group exhibited greater looking at the matching monologue. In contrast, the 12–14 month-old infants exhibited matching and, consistent with the emergence of perceptual expertise for the native language, they perceived the multisensory coherence of native-language monologues earlier in the test trials than of non-native language monologues. Moreover, the matching of native audible and visible speech streams observed in the 12–14 month olds did not depend on audio-visual synchrony whereas the matching of non-native audible and visible speech streams did depend on synchrony. Overall, the current findings indicate that the perception of the multisensory coherence of fluent audiovisual speech emerges late in infancy, that audio-visual synchrony cues are more important in the perception of the multisensory coherence of non-native than native audiovisual speech, and that the emergence of this skill most likely is affected by perceptual narrowing. PMID:25462038
Can you hear my age? Influences of speech rate and speech spontaneity on estimation of speaker age

PubMed Central

Skoog Waller, Sara; Eriksson, Mårten; Sörqvist, Patrik

2015-01-01

Cognitive hearing science is mainly about the study of how cognitive factors contribute to speech comprehension, but cognitive factors also partake in speech processing to infer non-linguistic information from speech signals, such as the intentions of the talker and the speaker’s age. Here, we report two experiments on age estimation by “naïve” listeners. The aim was to study how speech rate influences estimation of speaker age by comparing the speakers’ natural speech rate with increased or decreased speech rate. In Experiment 1, listeners were presented with audio samples of read speech from three different speaker age groups (young, middle aged, and old adults). They estimated the speakers as younger when speech rate was faster than normal and as older when speech rate was slower than normal. This speech rate effect was slightly greater in magnitude for older (60–65 years) speakers in comparison with younger (20–25 years) speakers, suggesting that speech rate may gain greater importance as a perceptual age cue with increased speaker age. This pattern was more pronounced in Experiment 2, in which listeners estimated age from spontaneous speech. Faster speech rate was associated with lower age estimates, but only for older and middle aged (40–45 years) speakers. Taken together, speakers of all age groups were estimated as older when speech rate decreased, except for the youngest speakers in Experiment 2. The absence of a linear speech rate effect in estimates of younger speakers, for spontaneous speech, implies that listeners use different age estimation strategies or cues (possibly vocabulary) depending on the age of the speaker and the spontaneity of the speech. Potential implications for forensic investigations and other applied domains are discussed. PMID:26236259

The contribution of visual information to the perception of speech in noise with and without informative temporal fine structure

PubMed Central

Stacey, Paula C.; Kitterick, Pádraig T.; Morris, Saffron D.; Sumner, Christian J.

2017-01-01

Understanding what is said in demanding listening situations is assisted greatly by looking at the face of a talker. Previous studies have observed that normal-hearing listeners can benefit from this visual information when a talker's voice is presented in background noise. These benefits have also been observed in quiet listening conditions in cochlear-implant users, whose device does not convey the informative temporal fine structure cues in speech, and when normal-hearing individuals listen to speech processed to remove these informative temporal fine structure cues. The current study (1) characterised the benefits of visual information when listening in background noise; and (2) used sine-wave vocoding to compare the size of the visual benefit when speech is presented with or without informative temporal fine structure. The accuracy with which normal-hearing individuals reported words in spoken sentences was assessed across three experiments. The availability of visual information and informative temporal fine structure cues was varied within and across the experiments. The results showed that visual benefit was observed using open- and closed-set tests of speech perception. The size of the benefit increased when informative temporal fine structure cues were removed. This finding suggests that visual information may play an important role in the ability of cochlear-implant users to understand speech in many everyday situations. Models of audio-visual integration were able to account for the additional benefit of visual information when speech was degraded and suggested that auditory and visual information was being integrated in a similar way in all conditions. The modelling results were consistent with the notion that audio-visual benefit is derived from the optimal combination of auditory and visual sensory cues. PMID:27085797
Mining protein function from text using term-based support vector machines

PubMed Central

Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J

2005-01-01

Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835
Interactive Activation Model of Speech Perception.

DTIC Science & Technology

1984-11-01

contract. 0 Elar, .l... & .McC’lelland .1.1. Speech perception a, a cognitive proces,: The interactive act ia- %e., tion model of speech perception. In...attempts to provide a machine solution to the problem of speech perception. A second kind of model, growing out of Cognitive Psychology, attempts to...architectures to cognitive and perceptual problems. We also owe a debt to what we might call the computational connectionists -- those who have applied highly
Unsupervised Decoding of Long-Term, Naturalistic Human Neural Recordings with Automated Video and Audio Annotations

PubMed Central

Wang, Nancy X. R.; Olson, Jared D.; Ojemann, Jeffrey G.; Rao, Rajesh P. N.; Brunton, Bingni W.

2016-01-01

Fully automated decoding of human activities and intentions from direct neural recordings is a tantalizing challenge in brain-computer interfacing. Implementing Brain Computer Interfaces (BCIs) outside carefully controlled experiments in laboratory settings requires adaptive and scalable strategies with minimal supervision. Here we describe an unsupervised approach to decoding neural states from naturalistic human brain recordings. We analyzed continuous, long-term electrocorticography (ECoG) data recorded over many days from the brain of subjects in a hospital room, with simultaneous audio and video recordings. We discovered coherent clusters in high-dimensional ECoG recordings using hierarchical clustering and automatically annotated them using speech and movement labels extracted from audio and video. To our knowledge, this represents the first time techniques from computer vision and speech processing have been used for natural ECoG decoding. Interpretable behaviors were decoded from ECoG data, including moving, speaking and resting; the results were assessed by comparison with manual annotation. Discovered clusters were projected back onto the brain revealing features consistent with known functional areas, opening the door to automated functional brain mapping in natural settings. PMID:27148018
Hearing Lips and Seeing Voices: How Cortical Areas Supporting Speech Production Mediate Audiovisual Speech Perception

PubMed Central

Skipper, Jeremy I.; van Wassenhove, Virginie; Nusbaum, Howard C.; Small, Steven L.

2009-01-01

Observing a speaker’s mouth profoundly influences speech perception. For example, listeners perceive an “illusory” “ta” when the video of a face producing /ka/ is dubbed onto an audio /pa/. Here, we show how cortical areas supporting speech production mediate this illusory percept and audiovisual (AV) speech perception more generally. Specifically, cortical activity during AV speech perception occurs in many of the same areas that are active during speech production. We find that different perceptions of the same syllable and the perception of different syllables are associated with different distributions of activity in frontal motor areas involved in speech production. Activity patterns in these frontal motor areas resulting from the illusory “ta” percept are more similar to the activity patterns evoked by AV/ta/ than they are to patterns evoked by AV/pa/ or AV/ka/. In contrast to the activity in frontal motor areas, stimulus-evoked activity for the illusory “ta” in auditory and somatosensory areas and visual areas initially resembles activity evoked by AV/pa/ and AV/ka/, respectively. Ultimately, though, activity in these regions comes to resemble activity evoked by AV/ta/. Together, these results suggest that AV speech elicits in the listener a motor plan for the production of the phoneme that the speaker might have been attempting to produce, and that feedback in the form of efference copy from the motor system ultimately influences the phonetic interpretation. PMID:17218482
Effects of audio-visual presentation of target words in word translation training

NASA Astrophysics Data System (ADS)

Akahane-Yamada, Reiko; Komaki, Ryo; Kubo, Rieko

2004-05-01

Komaki and Akahane-Yamada (Proc. ICA2004) used 2AFC translation task in vocabulary training, in which the target word is presented visually in orthographic form of one language, and the appropriate meaning in another language has to be chosen between two choices. Present paper examined the effect of audio-visual presentation of target word when native speakers of Japanese learn to translate English words into Japanese. Pairs of English words contrasted in several phonemic distinctions (e.g., /r/-/l/, /b/-/v/, etc.) were used as word materials, and presented in three conditions; visual-only (V), audio-only (A), and audio-visual (AV) presentations. Identification accuracy of those words produced by two talkers was also assessed. During pretest, the accuracy for A stimuli was lowest, implying that insufficient translation ability and listening ability interact with each other when aurally presented word has to be translated. However, there was no difference in accuracy between V and AV stimuli, suggesting that participants translate the words depending on visual information only. The effect of translation training using AV stimuli did not transfer to identification ability, showing that additional audio information during translation does not help improve speech perception. Further examination is necessary to determine the effective L2 training method. [Work supported by TAO, Japan.
Efficient audio signal processing for embedded systems

NASA Astrophysics Data System (ADS)

Chiu, Leung Kin

As mobile platforms continue to pack on more computational power, electronics manufacturers start to differentiate their products by enhancing the audio features. However, consumers also demand smaller devices that could operate for longer time, hence imposing design constraints. In this research, we investigate two design strategies that would allow us to efficiently process audio signals on embedded systems such as mobile phones and portable electronics. In the first strategy, we exploit properties of the human auditory system to process audio signals. We designed a sound enhancement algorithm to make piezoelectric loudspeakers sound ”richer" and "fuller." Piezoelectric speakers have a small form factor but exhibit poor response in the low-frequency region. In the algorithm, we combine psychoacoustic bass extension and dynamic range compression to improve the perceived bass coming out from the tiny speakers. We also developed an audio energy reduction algorithm for loudspeaker power management. The perceptually transparent algorithm extends the battery life of mobile devices and prevents thermal damage in speakers. This method is similar to audio compression algorithms, which encode audio signals in such a ways that the compression artifacts are not easily perceivable. Instead of reducing the storage space, however, we suppress the audio contents that are below the hearing threshold, therefore reducing the signal energy. In the second strategy, we use low-power analog circuits to process the signal before digitizing it. We designed an analog front-end for sound detection and implemented it on a field programmable analog array (FPAA). The system is an example of an analog-to-information converter. The sound classifier front-end can be used in a wide range of applications because programmable floating-gate transistors are employed to store classifier weights. Moreover, we incorporated a feature selection algorithm to simplify the analog front-end. A machine
Nasal and Oral Inspiration during Natural Speech Breathing

ERIC Educational Resources Information Center

Lester, Rosemary A.; Hoit, Jeannette D.

2014-01-01

Purpose: The purpose of this study was to determine the typical pattern for inspiration during speech breathing in healthy adults, as well as the factors that might influence it. Method: Ten healthy adults, 18-45 years of age, performed a variety of speaking tasks while nasal ram pressure, audio, and video recordings were obtained. Inspirations…
Speech-Message Extraction from Interference Introduced by External Distributed Sources

NASA Astrophysics Data System (ADS)

Kanakov, V. A.; Mironov, N. A.

2017-08-01

The problem of this study involves the extraction of a speech signal originating from a certain spatial point and calculation of the intelligibility of the extracted voice message. It is solved by the method of decreasing the influence of interference from the speech-message sources on the extracted signal. This method is based on introducing the time delays, which depend on the spatial coordinates, to the recording channels. Audio records of the voices of eight different people were used as test objects during the studies. It is proved that an increase in the number of microphones improves intelligibility of the speech message which is extracted from interference.
Young Autistic Children's Listening Preferences in Regard to Speech: A Possible Characterization of the Symptom of Social Withdrawal.

ERIC Educational Resources Information Center

Klin, Ami

1991-01-01

Twelve autistic children (ages 4-6) were given a choice between their mothers' speech and the noise of superimposed voices. In contrast to comparison groups of mentally retarded and normally developing children, the autistic children actively preferred the superimposed voices or showed a lack of preference for either audio segment. (Author/JDD)
HomeBank: An Online Repository of Daylong Child-Centered Audio Recordings

PubMed Central

VanDam, Mark; Warlaumont, Anne S.; Bergelson, Elika; Cristia, Alejandrina; Soderstrom, Melanie; De Palma, Paul; MacWhinney, Brian

2017-01-01

HomeBank is introduced here. It is a public, permanent, extensible, online database of daylong audio recorded in naturalistic environments. HomeBank serves two primary purposes. First, it is a repository for raw audio and associated files: one database requires special permissions, and another redacted database allows unrestricted public access. Associated files include metadata such as participant demographics and clinical diagnostics, automated annotations, and human-generated transcriptions and annotations. Many recordings use the child-perspective LENA recorders (LENA Research Foundation, Boulder, Colorado, United States), but various recordings and metadata can be accommodated. The HomeBank database can have both vetted and unvetted recordings, with different levels of accessibility. Additionally, HomeBank is an open repository for processing and analysis tools for HomeBank or similar data sets. HomeBank is flexible for users and contributors, making primary data available to researchers, especially those in child development, linguistics, and audio engineering. HomeBank facilitates researchers’ access to large-scale data and tools, linking the acoustic, auditory, and linguistic characteristics of children’s environments with a variety of variables including socioeconomic status, family characteristics, language trajectories, and disorders. Automated processing applied to daylong home audio recordings is now becoming widely used in early intervention initiatives, helping parents to provide richer speech input to at-risk children. PMID:27111272
Facilities to assist people to research into stammered speech

PubMed Central

Howell, Peter; Huckvale, Mark

2008-01-01

The purpose of this article is to indicate how access can be obtained, through Stammering Research, to audio recordings and transcriptions of spontaneous speech data from speakers who stammer. Selections of the first author’s data are available in several formats. We describe where to obtain free software for manipulation and analysis of the data in their respective formats. Papers reporting analyses of these data are invited as submissions to this section of Stammering Research. It is intended that subsequent analyses that employ these data will be published in Stammering Research on an on-going basis. Plans are outlined to provide similar data from young speakers (ones developing fluently and ones who stammer), follow-up data from speakers who stammer, data from speakers who stammer who do not speak English and from speakers who have other speech disorders, for comparison, all through the pages of Stammering Research. The invitation is extended to those promulgating evidence-based practice approaches (see the Journal of Fluency Disorders, volume 28, number 4 which is a special issue devoted to this topic) and anyone with other interesting data related to stammering to prepare them in a form that can be made accessible to others via Stammering Research. PMID:18418475
One approach to design of speech emotion database

NASA Astrophysics Data System (ADS)

Uhrin, Dominik; Chmelikova, Zdenka; Tovarek, Jaromir; Partila, Pavol; Voznak, Miroslav

2016-05-01

This article describes a system for evaluating the credibility of recordings with emotional character. Sound recordings form Czech language database for training and testing systems of speech emotion recognition. These systems are designed to detect human emotions in his voice. The emotional state of man is useful in the security forces and emergency call service. Man in action (soldier, police officer and firefighter) is often exposed to stress. Information about the emotional state (his voice) will help to dispatch to adapt control commands for procedure intervention. Call agents of emergency call service must recognize the mental state of the caller to adjust the mood of the conversation. In this case, the evaluation of the psychological state is the key factor for successful intervention. A quality database of sound recordings is essential for the creation of the mentioned systems. There are quality databases such as Berlin Database of Emotional Speech or Humaine. The actors have created these databases in an audio studio. It means that the recordings contain simulated emotions, not real. Our research aims at creating a database of the Czech emotional recordings of real human speech. Collecting sound samples to the database is only one of the tasks. Another one, no less important, is to evaluate the significance of recordings from the perspective of emotional states. The design of a methodology for evaluating emotional recordings credibility is described in this article. The results describe the advantages and applicability of the developed method.
Animation, audio, and spatial ability: Optimizing multimedia for scientific explanations

NASA Astrophysics Data System (ADS)

Koroghlanian, Carol May

This study investigated the effects of audio, animation and spatial ability in a computer based instructional program for biology. The program presented instructional material via text or audio with lean text and included eight instructional sequences presented either via static illustrations or animations. High school students enrolled in a biology course were blocked by spatial ability and randomly assigned to one of four treatments (Text-Static Illustration Audio-Static Illustration, Text-Animation, Audio-Animation). The study examined the effects of instructional mode (Text vs. Audio), illustration mode (Static Illustration vs. Animation) and spatial ability (Low vs. High) on practice and posttest achievement, attitude and time. Results for practice achievement indicated that high spatial ability participants achieved more than low spatial ability participants. Similar results for posttest achievement and spatial ability were not found. Participants in the Static Illustration treatments achieved the same as participants in the Animation treatments on both the practice and posttest. Likewise, participants in the Text treatments achieved the same as participants in the Audio treatments on both the practice and posttest. In terms of attitude, participants responded favorably to the computer based instructional program. They found the program interesting, felt the static illustrations or animations made the explanations easier to understand and concentrated on learning the material. Furthermore, participants in the Animation treatments felt the information was easier to understand than participants in the Static Illustration treatments. However, no difference for any attitude item was found for participants in the Text as compared to those in the Audio treatments. Significant differences were found by Spatial Ability for three attitude items concerning concentration and interest. In all three items, the low spatial ability participants responded more positively
Impact of Language on Development of Auditory-Visual Speech Perception

ERIC Educational Resources Information Center

Sekiyama, Kaoru; Burnham, Denis

2008-01-01

The McGurk effect paradigm was used to examine the developmental onset of inter-language differences between Japanese and English in auditory-visual speech perception. Participants were asked to identify syllables in audiovisual (with congruent or discrepant auditory and visual components), audio-only, and video-only presentations at various…
Comparing the Impact of Rates of Text-to-Speech Software on Reading Fluency and Comprehension for Adults with Reading Difficulties

ERIC Educational Resources Information Center

Coleman, Mari Beth; Killdare, Laura K.; Bell, Sherry Mee; Carter, Amanda M.

2014-01-01

The purpose of this study was to determine the impact of text-to-speech software on reading fluency and comprehension for four postsecondary students with below average reading fluency and comprehension including three students diagnosed with learning disabilities and concomitant conditions (e.g., attention deficit hyperactivity disorder, seizure…
MARTI: man-machine animation real-time interface

NASA Astrophysics Data System (ADS)

Jones, Christian M.; Dlay, Satnam S.

1997-05-01

The research introduces MARTI (man-machine animation real-time interface) for the realization of natural human-machine interfacing. The system uses simple vocal sound-tracks of human speakers to provide lip synchronization of computer graphical facial models. We present novel research in a number of engineering disciplines, which include speech recognition, facial modeling, and computer animation. This interdisciplinary research utilizes the latest, hybrid connectionist/hidden Markov model, speech recognition system to provide very accurate phone recognition and timing for speaker independent continuous speech, and expands on knowledge from the animation industry in the development of accurate facial models and automated animation. The research has many real-world applications which include the provision of a highly accurate and 'natural' man-machine interface to assist user interactions with computer systems and communication with one other using human idiosyncrasies; a complete special effects and animation toolbox providing automatic lip synchronization without the normal constraints of head-sets, joysticks, and skilled animators; compression of video data to well below standard telecommunication channel bandwidth for video communications and multi-media systems; assisting speech training and aids for the handicapped; and facilitating player interaction for 'video gaming' and 'virtual worlds.' MARTI has introduced a new level of realism to man-machine interfacing and special effect animation which has been previously unseen.
Perception of audio-visual speech synchrony in Spanish-speaking children with and without specific language impairment

PubMed Central

PONS, FERRAN; ANDREU, LLORENC.; SANZ-TORRENT, MONICA; BUIL-LEGAZ, LUCIA; LEWKOWICZ, DAVID J.

2014-01-01

Speech perception involves the integration of auditory and visual articulatory information and, thus, requires the perception of temporal synchrony between this information. There is evidence that children with Specific Language Impairment (SLI) have difficulty with auditory speech perception but it is not known if this is also true for the integration of auditory and visual speech. Twenty Spanish-speaking children with SLI, twenty typically developing age-matched Spanish-speaking children, and twenty Spanish-speaking children matched for MLU-w participated in an eye-tracking study to investigate the perception of audiovisual speech synchrony. Results revealed that children with typical language development perceived an audiovisual asynchrony of 666ms regardless of whether the auditory or visual speech attribute led the other one. Children with SLI only detected the 666 ms asynchrony when the auditory component followed the visual component. None of the groups perceived an audiovisual asynchrony of 366ms. These results suggest that the difficulty of speech processing by children with SLI would also involve difficulties in integrating auditory and visual aspects of speech perception. PMID:22874648
Perception of audio-visual speech synchrony in Spanish-speaking children with and without specific language impairment.

PubMed

Pons, Ferran; Andreu, Llorenç; Sanz-Torrent, Monica; Buil-Legaz, Lucía; Lewkowicz, David J

2013-06-01

Speech perception involves the integration of auditory and visual articulatory information, and thus requires the perception of temporal synchrony between this information. There is evidence that children with specific language impairment (SLI) have difficulty with auditory speech perception but it is not known if this is also true for the integration of auditory and visual speech. Twenty Spanish-speaking children with SLI, twenty typically developing age-matched Spanish-speaking children, and twenty Spanish-speaking children matched for MLU-w participated in an eye-tracking study to investigate the perception of audiovisual speech synchrony. Results revealed that children with typical language development perceived an audiovisual asynchrony of 666 ms regardless of whether the auditory or visual speech attribute led the other one. Children with SLI only detected the 666 ms asynchrony when the auditory component preceded [corrected] the visual component. None of the groups perceived an audiovisual asynchrony of 366 ms. These results suggest that the difficulty of speech processing by children with SLI would also involve difficulties in integrating auditory and visual aspects of speech perception.
Integrating voice evaluation: correlation between acoustic and audio-perceptual measures.

PubMed

Vaz Freitas, Susana; Melo Pestana, Pedro; Almeida, Vítor; Ferreira, Aníbal

2015-05-01

This article aims to establish correlations between acoustic and audio-perceptual measures using the GRBAS scale with respect to four different voice analysis software programs. Exploratory, transversal. A total of 90 voice records were collected and analyzed with the Dr. Speech (Tiger Electronics, Seattle, WA), Multidimensional Voice Program (Kay Elemetrics, NJ, USA), PRAAT (University of Amsterdam, The Netherlands), and Voice Studio (Seegnal, Oporto, Portugal) software programs. The acoustic measures were correlated to the audio-perceptual parameters of the GRBAS and rated by 10 experts. The predictive value of the acoustic measurements related to the audio-perceptual parameters exhibited magnitudes ranging from weak (R(2)a=0.17) to moderate (R(2)a=0.71). The parameter exhibiting the highest correlation magnitude is B (Breathiness), whereas the weaker correlation magnitudes were found to be for A (Asthenia) and S (Strain). The acoustic measures with stronger predictive values were local Shimmer, harmonics-to-noise ratio, APQ5 shimmer, and PPQ5 jitter, with different magnitudes for each one of the studied software programs. Some acoustic measures are pointed as significant predictors of GRBAS parameters, but they differ among software programs. B (Breathiness) was the parameter exhibiting the highest correlation magnitude. Copyright © 2015 The Voice Foundation. Published by Elsevier Inc. All rights reserved.

Texts as Metaphoric Machines and the Challenge of the Digital

ERIC Educational Resources Information Center

Kouppanou, Anna

2016-01-01

In this essay Anna Kouppanou expands the notion of metaphor from its received meaning to refer to an embodied and material process of connectedness that transforms the domains that it brings together. Because of metaphor's reliance on materiality and exteriority Kouppanou turns to literary texts, which she calls "metaphoric machines." In…
Pitch-Based Segregation of Reverberant Speech

DTIC Science & Technology

2005-02-01

speaker recognition in real environments, audio information retrieval and hearing prosthesis. Second, although binaural listening improves the...intelligibility of target speech under anechoic conditions (Bronkhorst, 2000), this binaural advantage is largely eliminated by reverberation (Plomp, 1976...Brown and Cooke, 1994; Wang and Brown, 1999; Hu and Wang, 2004) as well as in binaural separation (e.g., Roman et al., 2003; Palomaki et al., 2004
A study on nonlinear characteristics of speech sound with reference to some languages of North East region

NASA Astrophysics Data System (ADS)

Dutta, Rashmi

parametric model, which has been most useful in practical applications. Developing a system that can understand natural language has been a continuing goal of speech researchers. Fully automatic high quality machine translation systems are extremely difficult to build. The difficulty arises from the following reasons: In any natural language text, only part of the information to be conveyed is explicitly expressed. It is the human mind which fills up and supplements the details using contextual.
Visual-auditory integration during speech imitation in autism.

PubMed

Williams, Justin H G; Massaro, Dominic W; Peel, Natalie J; Bosseler, Alexis; Suddendorf, Thomas

2004-01-01

Children with autistic spectrum disorder (ASD) may have poor audio-visual integration, possibly reflecting dysfunctional 'mirror neuron' systems which have been hypothesised to be at the core of the condition. In the present study, a computer program, utilizing speech synthesizer software and a 'virtual' head (Baldi), delivered speech stimuli for identification in auditory, visual or bimodal conditions. Children with ASD were poorer than controls at recognizing stimuli in the unimodal conditions, but once performance on this measure was controlled for, no group difference was found in the bimodal condition. A group of participants with ASD were also trained to develop their speech-reading ability. Training improved visual accuracy and this also improved the children's ability to utilize visual information in their processing of speech. Overall results were compared to predictions from mathematical models based on integration and non-integration, and were most consistent with the integration model. We conclude that, whilst they are less accurate in recognizing stimuli in the unimodal condition, children with ASD show normal integration of visual and auditory speech stimuli. Given that training in recognition of visual speech was effective, children with ASD may benefit from multi-modal approaches in imitative therapy and language training.
Scientific bases of human-machine communication by voice.

PubMed Central

Schafer, R W

1995-01-01

The scientific bases for human-machine communication by voice are in the fields of psychology, linguistics, acoustics, signal processing, computer science, and integrated circuit technology. The purpose of this paper is to highlight the basic scientific and technological issues in human-machine communication by voice and to point out areas of future research opportunity. The discussion is organized around the following major issues in implementing human-machine voice communication systems: (i) hardware/software implementation of the system, (ii) speech synthesis for voice output, (iii) speech recognition and understanding for voice input, and (iv) usability factors related to how humans interact with machines. PMID:7479802
A social feedback loop for speech development and its reduction in autism

PubMed Central

Warlaumont, Anne S.; Richards, Jeffrey A.; Gilkerson, Jill; Oller, D. Kimbrough

2014-01-01

We analyze the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hours total) of children (8- to 48-month-olds) with and without autism. We find that adult responses are more likely when child vocalizations are speech-related. In turn, a child vocalization is more likely to be speech-related if the previous speech-related child vocalization received an immediate adult response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech-language development. Although this feedback loop applies in both typical development and autism, children with autism produce proportionally fewer speech-related vocalizations and the responses they receive are less contingent on whether their vocalizations are speech-related. We argue that such differences will diminish the strength of the social feedback loop with cascading effects on speech development over time. Differences related to socioeconomic status are also reported. PMID:24840717
Comparison of audio and audiovisual measures of adult stuttering: Implications for clinical trials.

PubMed

O'Brian, Sue; Jones, Mark; Onslow, Mark; Packman, Ann; Menzies, Ross; Lowe, Robyn

2015-04-15

This study investigated whether measures of percentage syllables stuttered (%SS) and stuttering severity ratings with a 9-point scale differ when made from audiovisual compared with audio-only recordings. Four experienced speech-language pathologists measured %SS and assigned stuttering severity ratings to 10-minute audiovisual and audio-only recordings of 36 adults. There was a mean 18% increase in %SS scores when samples were presented in audiovisual compared with audio-only mode. This result was consistent across both higher and lower %SS scores and was found to be directly attributable to counts of stuttered syllables rather than the total number of syllables. There was no significant difference between stuttering severity ratings made from the two modes. In clinical trials research, when using %SS as the primary outcome measure, audiovisual samples would be preferred as long as clear, good quality, front-on images can be easily captured. Alternatively, stuttering severity ratings may be a more valid measure to use as they correlate well with %SS and values are not influenced by the presentation mode.
The Neural Basis of Speech Perception through Lipreading and Manual Cues: Evidence from Deaf Native Users of Cued Speech

PubMed Central

Aparicio, Mario; Peigneux, Philippe; Charlier, Brigitte; Balériaux, Danielle; Kavec, Martin; Leybaert, Jacqueline

2017-01-01

We present here the first neuroimaging data for perception of Cued Speech (CS) by deaf adults who are native users of CS. CS is a visual mode of communicating a spoken language through a set of manual cues which accompany lipreading and disambiguate it. With CS, sublexical units of the oral language are conveyed clearly and completely through the visual modality without requiring hearing. The comparison of neural processing of CS in deaf individuals with processing of audiovisual (AV) speech in normally hearing individuals represents a unique opportunity to explore the similarities and differences in neural processing of an oral language delivered in a visuo-manual vs. an AV modality. The study included deaf adult participants who were early CS users and native hearing users of French who process speech audiovisually. Words were presented in an event-related fMRI design. Three conditions were presented to each group of participants. The deaf participants saw CS words (manual + lipread), words presented as manual cues alone, and words presented to be lipread without manual cues. The hearing group saw AV spoken words, audio-alone and lipread-alone. Three findings are highlighted. First, the middle and superior temporal gyrus (excluding Heschl’s gyrus) and left inferior frontal gyrus pars triangularis constituted a common, amodal neural basis for AV and CS perception. Second, integration was inferred in posterior parts of superior temporal sulcus for audio and lipread information in AV speech, but in the occipito-temporal junction, including MT/V5, for the manual cues and lipreading in CS. Third, the perception of manual cues showed a much greater overlap with the regions activated by CS (manual + lipreading) than lipreading alone did. This supports the notion that manual cues play a larger role than lipreading for CS processing. The present study contributes to a better understanding of the role of manual cues as support of visual speech perception in the framework
Foreign Subtitles Help but Native-Language Subtitles Harm Foreign Speech Perception

PubMed Central

Mitterer, Holger; McQueen, James M.

2009-01-01

Understanding foreign speech is difficult, in part because of unusual mappings between sounds and words. It is known that listeners in their native language can use lexical knowledge (about how words ought to sound) to learn how to interpret unusual speech-sounds. We therefore investigated whether subtitles, which provide lexical information, support perceptual learning about foreign speech. Dutch participants, unfamiliar with Scottish and Australian regional accents of English, watched Scottish or Australian English videos with Dutch, English or no subtitles, and then repeated audio fragments of both accents. Repetition of novel fragments was worse after Dutch-subtitle exposure but better after English-subtitle exposure. Native-language subtitles appear to create lexical interference, but foreign-language subtitles assist speech learning by indicating which words (and hence sounds) are being spoken. PMID:19918371
Auditory cross-modal reorganization in cochlear implant users indicates audio-visual integration.

PubMed

Stropahl, Maren; Debener, Stefan

2017-01-01

There is clear evidence for cross-modal cortical reorganization in the auditory system of post-lingually deafened cochlear implant (CI) users. A recent report suggests that moderate sensori-neural hearing loss is already sufficient to initiate corresponding cortical changes. To what extend these changes are deprivation-induced or related to sensory recovery is still debated. Moreover, the influence of cross-modal reorganization on CI benefit is also still unclear. While reorganization during deafness may impede speech recovery, reorganization also has beneficial influences on face recognition and lip-reading. As CI users were observed to show differences in multisensory integration, the question arises if cross-modal reorganization is related to audio-visual integration skills. The current electroencephalography study investigated cortical reorganization in experienced post-lingually deafened CI users ( n = 18), untreated mild to moderately hearing impaired individuals (n = 18) and normal hearing controls ( n = 17). Cross-modal activation of the auditory cortex by means of EEG source localization in response to human faces and audio-visual integration, quantified with the McGurk illusion, were measured. CI users revealed stronger cross-modal activations compared to age-matched normal hearing individuals. Furthermore, CI users showed a relationship between cross-modal activation and audio-visual integration strength. This may further support a beneficial relationship between cross-modal activation and daily-life communication skills that may not be fully captured by laboratory-based speech perception tests. Interestingly, hearing impaired individuals showed behavioral and neurophysiological results that were numerically between the other two groups, and they showed a moderate relationship between cross-modal activation and the degree of hearing loss. This further supports the notion that auditory deprivation evokes a reorganization of the auditory system
The Effect of Audio and Animation in Multimedia Instruction

ERIC Educational Resources Information Center

Koroghlanian, Carol; Klein, James D.

2004-01-01

This study investigated the effects of audio, animation, and spatial ability in a multimedia computer program for high school biology. Participants completed a multimedia program that presented content by way of text or audio with lean text. In addition, several instructional sequences were presented either with static illustrations or animations.…
Is talking to an automated teller machine natural and fun?

PubMed

Chan, F Y; Khalid, H M

Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.
A haptic-inspired audio approach for structural health monitoring decision-making

NASA Astrophysics Data System (ADS)

Mao, Zhu; Todd, Michael; Mascareñas, David

2015-03-01

Haptics is the field at the interface of human touch (tactile sensation) and classification, whereby tactile feedback is used to train and inform a decision-making process. In structural health monitoring (SHM) applications, haptic devices have been introduced and applied in a simplified laboratory scale scenario, in which nonlinearity, representing the presence of damage, was encoded into a vibratory manual interface. In this paper, the "spirit" of haptics is adopted, but here ultrasonic guided wave scattering information is transformed into audio (rather than tactile) range signals. After sufficient training, the structural damage condition, including occurrence and location, can be identified through the encoded audio waveforms. Different algorithms are employed in this paper to generate the transformed audio signals and the performance of each encoding algorithms is compared, and also compared with standard machine learning classifiers. In the long run, the haptic decision-making is aiming to detect and classify structural damages in a more rigorous environment, and approaching a baseline-free fashion with embedded temperature compensation.
Collusion-Resistant Audio Fingerprinting System in the Modulated Complex Lapped Transform Domain

PubMed Central

Garcia-Hernandez, Jose Juan; Feregrino-Uribe, Claudia; Cumplido, Rene

2013-01-01

Collusion-resistant fingerprinting paradigm seems to be a practical solution to the piracy problem as it allows media owners to detect any unauthorized copy and trace it back to the dishonest users. Despite the billionaire losses in the music industry, most of the collusion-resistant fingerprinting systems are devoted to digital images and very few to audio signals. In this paper, state-of-the-art collusion-resistant fingerprinting ideas are extended to audio signals and the corresponding parameters and operation conditions are proposed. Moreover, in order to carry out fingerprint detection using just a fraction of the pirate audio clip, block-based embedding and its corresponding detector is proposed. Extensive simulations show the robustness of the proposed system against average collusion attack. Moreover, by using an efficient Fast Fourier Transform core and standard computer machines it is shown that the proposed system is suitable for real-world scenarios. PMID:23762455
Using machine learning to disentangle homonyms in large text corpora.

PubMed

Roll, Uri; Correia, Ricardo A; Berger-Tal, Oded

2018-06-01

Systematic reviews are an increasingly popular decision-making tool that provides an unbiased summary of evidence to support conservation action. These reviews bridge the gap between researchers and managers by presenting a comprehensive overview of all studies relating to a particular topic and identify specifically where and under which conditions an effect is present. However, several technical challenges can severely hinder the feasibility and applicability of systematic reviews, for example, homonyms (terms that share spelling but differ in meaning). Homonyms add noise to search results and cannot be easily identified or removed. We developed a semiautomated approach that can aid in the classification of homonyms among narratives. We used a combination of automated content analysis and artificial neural networks to quickly and accurately sift through large corpora of academic texts and classify them to distinct topics. As an example, we explored the use of the word reintroduction in academic texts. Reintroduction is used within the conservation context to indicate the release of organisms to their former native habitat; however, a Web of Science search for this word returned thousands of publications in which the term has other meanings and contexts. Using our method, we automatically classified a sample of 3000 of these publications with over 99% accuracy, relative to a manual classification. Our approach can be used easily with other homonyms and can greatly facilitate systematic reviews or similar work in which homonyms hinder the harnessing of large text corpora. Beyond homonyms we see great promise in combining automated content analysis and machine-learning methods to handle and screen big data for relevant information in conservation science. © 2017 Society for Conservation Biology.
Internet Video Telephony Allows Speech Reading by Deaf Individuals and Improves Speech Perception by Cochlear Implant Users

PubMed Central

Mantokoudis, Georgios; Dähler, Claudia; Dubach, Patrick; Kompis, Martin; Caversaccio, Marco D.; Senn, Pascal

2013-01-01

Objective To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI) users. Methods Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM) sentence test. We presented video simulations using different video resolutions (1280×720, 640×480, 320×240, 160×120 px), frame rates (30, 20, 10, 7, 5 frames per second (fps)), speech velocities (three different speakers), webcameras (Logitech Pro9000, C600 and C500) and image/sound delays (0–500 ms). All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. Results Higher frame rate (>7 fps), higher camera resolution (>640×480 px) and shorter picture/sound delay (<100 ms) were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009) in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11) showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032). Conclusion Webcameras have the potential to improve telecommunication of hearing-impaired individuals. PMID:23359119
Internet video telephony allows speech reading by deaf individuals and improves speech perception by cochlear implant users.

PubMed

Mantokoudis, Georgios; Dähler, Claudia; Dubach, Patrick; Kompis, Martin; Caversaccio, Marco D; Senn, Pascal

2013-01-01

To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI) users. Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM) sentence test. We presented video simulations using different video resolutions (1280 × 720, 640 × 480, 320 × 240, 160 × 120 px), frame rates (30, 20, 10, 7, 5 frames per second (fps)), speech velocities (three different speakers), webcameras (Logitech Pro9000, C600 and C500) and image/sound delays (0-500 ms). All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. Higher frame rate (>7 fps), higher camera resolution (>640 × 480 px) and shorter picture/sound delay (<100 ms) were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009) in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11) showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032). Webcameras have the potential to improve telecommunication of hearing-impaired individuals.
Natural speech algorithm applied to baseline interview data can predict which patients will respond to psilocybin for treatment-resistant depression.

PubMed

Carrillo, Facundo; Sigman, Mariano; Fernández Slezak, Diego; Ashton, Philip; Fitzgerald, Lily; Stroud, Jack; Nutt, David J; Carhart-Harris, Robin L

2018-04-01

Natural speech analytics has seen some improvements over recent years, and this has opened a window for objective and quantitative diagnosis in psychiatry. Here, we used a machine learning algorithm applied to natural speech to ask whether language properties measured before psilocybin for treatment-resistant can predict for which patients it will be effective and for which it will not. A baseline autobiographical memory interview was conducted and transcribed. Patients with treatment-resistant depression received 2 doses of psilocybin, 10 mg and 25 mg, 7 days apart. Psychological support was provided before, during and after all dosing sessions. Quantitative speech measures were applied to the interview data from 17 patients and 18 untreated age-matched healthy control subjects. A machine learning algorithm was used to classify between controls and patients and predict treatment response. Speech analytics and machine learning successfully differentiated depressed patients from healthy controls and identified treatment responders from non-responders with a significant level of 85% of accuracy (75% precision). Automatic natural language analysis was used to predict effective response to treatment with psilocybin, suggesting that these tools offer a highly cost-effective facility for screening individuals for treatment suitability and sensitivity. The sample size was small and replication is required to strengthen inferences on these results. Copyright © 2018 Elsevier B.V. All rights reserved.
Data-driven analysis of functional brain interactions during free listening to music and speech.

PubMed

Fang, Jun; Hu, Xintao; Han, Junwei; Jiang, Xi; Zhu, Dajiang; Guo, Lei; Liu, Tianming

2015-06-01

Natural stimulus functional magnetic resonance imaging (N-fMRI) such as fMRI acquired when participants were watching video streams or listening to audio streams has been increasingly used to investigate functional mechanisms of the human brain in recent years. One of the fundamental challenges in functional brain mapping based on N-fMRI is to model the brain's functional responses to continuous, naturalistic and dynamic natural stimuli. To address this challenge, in this paper we present a data-driven approach to exploring functional interactions in the human brain during free listening to music and speech streams. Specifically, we model the brain responses using N-fMRI by measuring the functional interactions on large-scale brain networks with intrinsically established structural correspondence, and perform music and speech classification tasks to guide the systematic identification of consistent and discriminative functional interactions when multiple subjects were listening music and speech in multiple categories. The underlying premise is that the functional interactions derived from N-fMRI data of multiple subjects should exhibit both consistency and discriminability. Our experimental results show that a variety of brain systems including attention, memory, auditory/language, emotion, and action networks are among the most relevant brain systems involved in classic music, pop music and speech differentiation. Our study provides an alternative approach to investigating the human brain's mechanism in comprehension of complex natural music and speech.
Influence of speech sample on perceptual rating of hypernasality.

PubMed

Medeiros, Maria Natália Leite de; Fukushiro, Ana Paula; Yamashita, Renata Paciello

2016-07-07

To investigate the influence of speech sample of spontaneous conversation or sentences repetition on intra and inter-rater hypernasality reliability. One hundred and twenty audio recorded speech samples (60 containing spontaneous conversation and 60 containing repeated sentences) of individuals with repaired cleft palate±lip, both genders, aged between 6 and 52 years old (mean=21±10) were selected and edited. Three experienced speech and language pathologists rated hypernasality according to their own criteria using 4-point scale: 1=absence of hypernasality, 2=mild hypernasality, 3=moderate hypernasality and 4=severe hypernasality, first in spontaneous speech samples and 30 days after, in sentences repetition samples. Intra- and inter-rater agreements were calculated for both speech samples and were statistically compared by the Z test at a significance level of 5%. Comparison of intra-rater agreements between both speech samples showed an increase of the coefficients obtained in the analysis of sentences repetition compared to those obtained in spontaneous conversation. Comparison between inter-rater agreement showed no significant difference among the three raters for the two speech samples. Sentences repetition improved intra-raters reliability of perceptual judgment of hypernasality. However, the speech sample had no influence on reliability among different raters.

A study of speech interfaces for the vehicle environment.

DOT National Transportation Integrated Search

2013-05-01

Over the past few years, there has been a shift in automotive human machine interfaces from : visual-manual interactions (pushing buttons and rotating knobs) to speech interaction. In terms of : distraction, the industry views speech interaction as a...
Two Stage Data Augmentation for Low Resourced Speech Recognition (Author’s Manuscript)

DTIC Science & Technology

2016-09-12

speech recognition, deep neural networks, data augmentation 1. Introduction When training data is limited—whether it be audio or text—the obvious...Schwartz, and S. Tsakalidis, “Enhancing low resource keyword spotting with au- tomatically retrieved web documents,” in Interspeech, 2015, pp. 839–843. [2...and F. Seide, “Feature learning in deep neural networks - a study on speech recognition tasks,” in International Conference on Learning Representations
LANDMARK-BASED SPEECH RECOGNITION: REPORT OF THE 2004 JOHNS HOPKINS SUMMER WORKSHOP.

PubMed

Hasegawa-Johnson, Mark; Baker, James; Borys, Sarah; Chen, Ken; Coogan, Emily; Greenberg, Steven; Juneja, Amit; Kirchhoff, Katrin; Livescu, Karen; Mohan, Srividya; Muller, Jennifer; Sonmez, Kemal; Wang, Tianyu

2005-01-01

Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perception and phonology (specifically landmark-based speech perception, nonlinear phonology, and articulatory phonology). All three systems begin with a high-dimensional multiframe acoustic-to-distinctive feature transformation, implemented using support vector machines trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the support vector machines are then integrated using one of three pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a first-pass recognizer, and the resulting combination score is used to compute a second-pass speech recognition output.
Digital Audio Application to Short Wave Broadcasting

NASA Technical Reports Server (NTRS)

Chen, Edward Y.

1997-01-01

Digital audio is becoming prevalent not only in consumer electornics, but also in different broadcasting media. Terrestrial analog audio broadcasting in the AM and FM bands will be eventually be replaced by digital systems.
A social feedback loop for speech development and its reduction in autism.

PubMed

Warlaumont, Anne S; Richards, Jeffrey A; Gilkerson, Jill; Oller, D Kimbrough

2014-07-01

We analyzed the microstructure of child-adult interaction during naturalistic, daylong, automatically labeled audio recordings (13,836 hr total) of children (8- to 48-month-olds) with and without autism. We found that an adult was more likely to respond when the child's vocalization was speech related rather than not speech related. In turn, a child's vocalization was more likely to be speech related if the child's previous speech-related vocalization had received an immediate adult response rather than no response. Taken together, these results are consistent with the idea that there is a social feedback loop between child and caregiver that promotes speech development. Although this feedback loop applies in both typical development and autism, children with autism produced proportionally fewer speech-related vocalizations, and the responses they received were less contingent on whether their vocalizations were speech related. We argue that such differences will diminish the strength of the social feedback loop and have cascading effects on speech development over time. Differences related to socioeconomic status are also reported. © The Author(s) 2014.
Can you hear me yet? An intracranial investigation of speech and non-speech audiovisual interactions in human cortex.

PubMed

Rhone, Ariane E; Nourski, Kirill V; Oya, Hiroyuki; Kawasaki, Hiroto; Howard, Matthew A; McMurray, Bob

In everyday conversation, viewing a talker's face can provide information about the timing and content of an upcoming speech signal, resulting in improved intelligibility. Using electrocorticography, we tested whether human auditory cortex in Heschl's gyrus (HG) and on superior temporal gyrus (STG) and motor cortex on precentral gyrus (PreC) were responsive to visual/gestural information prior to the onset of sound and whether early stages of auditory processing were sensitive to the visual content (speech syllable versus non-speech motion). Event-related band power (ERBP) in the high gamma band was content-specific prior to acoustic onset on STG and PreC, and ERBP in the beta band differed in all three areas. Following sound onset, we found with no evidence for content-specificity in HG, evidence for visual specificity in PreC, and specificity for both modalities in STG. These results support models of audio-visual processing in which sensory information is integrated in non-primary cortical areas.
Aircraft noise and speech intelligibility in an outdoor living space.

PubMed

Alvarsson, Jesper J; Nordström, Henrik; Lundén, Peter; Nilsson, Mats E

2014-06-01

Studies of effects on speech intelligibility from aircraft noise in outdoor places are currently lacking. To explore these effects, first-order ambisonic recordings of aircraft noise were reproduced outdoors in a pergola. The average background level was 47 dB LA eq. Lists of phonetically balanced words (LAS max,word = 54 dB) were reproduced simultaneously with aircraft passage noise (LAS max,noise = 72-84 dB). Twenty individually tested listeners wrote down each presented word while seated in the pergola. The main results were (i) aircraft noise negatively affects speech intelligibility at sound pressure levels that exceed those of the speech sound (signal-to-noise ratio, S/N < 0), and (ii) the simple A-weighted S/N ratio was nearly as good an indicator of speech intelligibility as were two more advanced models, the Speech Intelligibility Index and Glasberg and Moore's [J. Audio Eng. Soc. 53, 906-918 (2005)] partial loudness model. This suggests that any of these indicators is applicable for predicting effects of aircraft noise on speech intelligibility outdoors.
Applications of Hilbert Spectral Analysis for Speech and Sound Signals

NASA Technical Reports Server (NTRS)

Huang, Norden E.

2003-01-01

A new method for analyzing nonlinear and nonstationary data has been developed, and the natural applications are to speech and sound signals. The key part of the method is the Empirical Mode Decomposition method with which any complicated data set can be decomposed into a finite and often small number of Intrinsic Mode Functions (IMF). An IMF is defined as any function having the same numbers of zero-crossing and extrema, and also having symmetric envelopes defined by the local maxima and minima respectively. The IMF also admits well-behaved Hilbert transform. This decomposition method is adaptive, and, therefore, highly efficient. Since the decomposition is based on the local characteristic time scale of the data, it is applicable to nonlinear and nonstationary processes. With the Hilbert transform, the Intrinsic Mode Functions yield instantaneous frequencies as functions of time, which give sharp identifications of imbedded structures. This method invention can be used to process all acoustic signals. Specifically, it can process the speech signals for Speech synthesis, Speaker identification and verification, Speech recognition, and Sound signal enhancement and filtering. Additionally, as the acoustical signals from machinery are essentially the way the machines are talking to us. Therefore, the acoustical signals, from the machines, either from sound through air or vibration on the machines, can tell us the operating conditions of the machines. Thus, we can use the acoustic signal to diagnosis the problems of machines.
Musical examination to bridge audio data and sheet music

NASA Astrophysics Data System (ADS)

Pan, Xunyu; Cross, Timothy J.; Xiao, Liangliang; Hei, Xiali

2015-03-01

The digitalization of audio is commonly implemented for the purpose of convenient storage and transmission of music and songs in today's digital age. Analyzing digital audio for an insightful look at a specific musical characteristic, however, can be quite challenging for various types of applications. Many existing musical analysis techniques can examine a particular piece of audio data. For example, the frequency of digital sound can be easily read and identified at a specific section in an audio file. Based on this information, we could determine the musical note being played at that instant, but what if you want to see a list of all the notes played in a song? While most existing methods help to provide information about a single piece of the audio data at a time, few of them can analyze the available audio file on a larger scale. The research conducted in this work considers how to further utilize the examination of audio data by storing more information from the original audio file. In practice, we develop a novel musical analysis system Musicians Aid to process musical representation and examination of audio data. Musicians Aid solves the previous problem by storing and analyzing the audio information as it reads it rather than tossing it aside. The system can provide professional musicians with an insightful look at the music they created and advance their understanding of their work. Amateur musicians could also benefit from using it solely for the purpose of obtaining feedback about a song they were attempting to play. By comparing our system's interpretation of traditional sheet music with their own playing, a musician could ensure what they played was correct. More specifically, the system could show them exactly where they went wrong and how to adjust their mistakes. In addition, the application could be extended over the Internet to allow users to play music with one another and then review the audio data they produced. This would be particularly
Department of Cybernetic Acoustics

NASA Astrophysics Data System (ADS)

The development of the theory, instrumentation and applications of methods and systems for the measurement, analysis, processing and synthesis of acoustic signals within the audio frequency range, particularly of the speech signal and the vibro-acoustic signal emitted by technical and industrial equipments treated as noise and vibration sources was discussed. The research work, both theoretical and experimental, aims at applications in various branches of science, and medicine, such as: acoustical diagnostics and phoniatric rehabilitation of pathological and postoperative states of the speech organ; bilateral ""man-machine'' speech communication based on the analysis, recognition and synthesis of the speech signal; vibro-acoustical diagnostics and continuous monitoring of the state of machines, technical equipments and technological processes.
Reference-free automatic quality assessment of tracheoesophageal speech.

PubMed

Huang, Andy; Falk, Tiago H; Chan, Wai-Yip; Parsa, Vijay; Doyle, Philip

2009-01-01

Evaluation of the quality of tracheoesophageal (TE) speech using machines instead of human experts can enhance the voice rehabilitation process for patients who have undergone total laryngectomy and voice restoration. Towards the goal of devising a reference-free TE speech quality estimation algorithm, we investigate the efficacy of speech signal features that are used in standard telephone-speech quality assessment algorithms, in conjunction with a recently introduced speech modulation spectrum measure. Tests performed on two TE speech databases demonstrate that the modulation spectral measure and a subset of features in the standard ITU-T P.563 algorithm estimate TE speech quality with better correlation (up to 0.9) than previously proposed features.
The influence of visual speech information on the intelligibility of English consonants produced by non-native speakers.

PubMed

Kawase, Saya; Hannah, Beverly; Wang, Yue

2014-09-01

This study examines how visual speech information affects native judgments of the intelligibility of speech sounds produced by non-native (L2) speakers. Native Canadian English perceivers as judges perceived three English phonemic contrasts (/b-v, θ-s, l-ɹ/) produced by native Japanese speakers as well as native Canadian English speakers as controls. These stimuli were presented under audio-visual (AV, with speaker voice and face), audio-only (AO), and visual-only (VO) conditions. The results showed that, across conditions, the overall intelligibility of Japanese productions of the native (Japanese)-like phonemes (/b, s, l/) was significantly higher than the non-Japanese phonemes (/v, θ, ɹ/). In terms of visual effects, the more visually salient non-Japanese phonemes /v, θ/ were perceived as significantly more intelligible when presented in the AV compared to the AO condition, indicating enhanced intelligibility when visual speech information is available. However, the non-Japanese phoneme /ɹ/ was perceived as less intelligible in the AV compared to the AO condition. Further analysis revealed that, unlike the native English productions, the Japanese speakers produced /ɹ/ without visible lip-rounding, indicating that non-native speakers' incorrect articulatory configurations may decrease the degree of intelligibility. These results suggest that visual speech information may either positively or negatively affect L2 speech intelligibility.
Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.

PubMed

Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina

2015-07-01

It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line
Action Unit Models of Facial Expression of Emotion in the Presence of Speech

PubMed Central

Shah, Miraj; Cooper, David G.; Cao, Houwei; Gur, Ruben C.; Nenkova, Ani; Verma, Ragini

2014-01-01

Automatic recognition of emotion using facial expressions in the presence of speech poses a unique challenge because talking reveals clues for the affective state of the speaker but distorts the canonical expression of emotion on the face. We introduce a corpus of acted emotion expression where speech is either present (talking) or absent (silent). The corpus is uniquely suited for analysis of the interplay between the two conditions. We use a multimodal decision level fusion classifier to combine models of emotion from talking and silent faces as well as from audio to recognize five basic emotions: anger, disgust, fear, happy and sad. Our results strongly indicate that emotion prediction in the presence of speech from action unit facial features is less accurate when the person is talking. Modeling talking and silent expressions separately and fusing the two models greatly improves accuracy of prediction in the talking setting. The advantages are most pronounced when silent and talking face models are fused with predictions from audio features. In this multi-modal prediction both the combination of modalities and the separate models of talking and silent facial expression of emotion contribute to the improvement. PMID:25525561
Nasal and Oral Inspiration During Natural Speech Breathing

PubMed Central

Lester, Rosemary A.; Hoit, Jeannette D.

2015-01-01

Purpose The purpose of this study was to determine the typical pattern for inspiration during speech breathing in healthy adults, as well as the factors that might influence it. Method Ten healthy adults, 18–45 years of age, performed a variety of speaking tasks while nasal ram pressure, audio, and video recordings were obtained. Inspirations were categorized as a nasal only, oral only, simultaneous nasal and oral, or alternating nasal and oral inspiration. The method was validated using nasal airflow, oral airflow, audio, and video recordings for two participants. Results The predominant pattern was simultaneous nasal and oral inspirations for all speaking tasks. This pattern was not affected by the nature of the speaking task or by the phonetic context surrounding the inspiration. The validation procedure confirmed that nearly all inspirations during counting and paragraph reading were simultaneous nasal and oral inspirations; whereas for sentence reading, the predominant pattern was alternating nasal and oral inspirations across the three phonetic contexts. Conclusions Healthy adults inspire through both the nose and mouth during natural speech breathing. This pattern of inspiration is likely beneficial in reducing pathway resistance while preserving some of the benefits of nasal breathing. PMID:24129013
Advances to the development of a basic Mexican sign-to-speech and text language translator

NASA Astrophysics Data System (ADS)

Garcia-Bautista, G.; Trujillo-Romero, F.; Diaz-Gonzalez, G.

2016-09-01

Sign Language (SL) is the basic alternative communication method between deaf people. However, most of the hearing people have trouble understanding the SL, making communication with deaf people almost impossible and taking them apart from daily activities. In this work we present an automatic basic real-time sign language translator capable of recognize a basic list of Mexican Sign Language (MSL) signs of 10 meaningful words, letters (A-Z) and numbers (1-10) and translate them into speech and text. The signs were collected from a group of 35 MSL signers executed in front of a Microsoft Kinect™ Sensor. The hand gesture recognition system use the RGB-D camera to build and storage data point clouds, color and skeleton tracking information. In this work we propose a method to obtain the representative hand trajectory pattern information. We use Euclidean Segmentation method to obtain the hand shape and Hierarchical Centroid as feature extraction method for images of numbers and letters. A pattern recognition method based on a Back Propagation Artificial Neural Network (ANN) is used to interpret the hand gestures. Finally, we use K-Fold Cross Validation method for training and testing stages. Our results achieve an accuracy of 95.71% on words, 98.57% on numbers and 79.71% on letters. In addition, an interactive user interface was designed to present the results in voice and text format.
A Prospectus for the Future Development of a Speech Lab: Hypertext Applications.

ERIC Educational Resources Information Center

Berube, David M.

This paper presents a plan for the next generation of speech laboratories which integrates technologies of modern communication in order to improve and modernize the instructional process. The paper first examines the application of intermediate technologies including audio-video recording and playback, computer assisted instruction and testing…
Timing of Gestures: Gestures Anticipating or Simultaneous with Speech as Indexes of Text Comprehension in Children and Adults

ERIC Educational Resources Information Center

Ianì, Francesco; Cutica, Ilaria; Bucciarelli, Monica

2017-01-01

The deep comprehension of a text is tantamount to the construction of an articulated mental model of that text. The number of correct recollections is an index of a learner's mental model of a text. We assume that another index of comprehension is the timing of the gestures produced during text recall; gestures are simultaneous with speech when…
Speech emotion recognition methods: A literature review

NASA Astrophysics Data System (ADS)

Basharirad, Babak; Moradhaseli, Mohammadreza

2017-10-01

Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.
The Automation System Censor Speech for the Indonesian Rude Swear Words Based on Support Vector Machine and Pitch Analysis

NASA Astrophysics Data System (ADS)

Endah, S. N.; Nugraheni, D. M. K.; Adhy, S.; Sutikno

2017-04-01

According to Law No. 32 of 2002 and the Indonesian Broadcasting Commission Regulation No. 02/P/KPI/12/2009 & No. 03/P/KPI/12/2009, stated that broadcast programs should not scold with harsh words, not harass, insult or demean minorities and marginalized groups. However, there are no suitable tools to censor those words automatically. Therefore, researches to develop a system of intelligent software to censor the words automatically are needed. To conduct censor, the system must be able to recognize the words in question. This research proposes the classification of speech divide into two classes using Support Vector Machine (SVM), first class is set of rude words and the second class is set of properly words. The speech pitch values as an input in SVM, it used for the development of the system for the Indonesian rude swear word. The results of the experiment show that SVM is good for this system.

Calibration of Clinical Audio Recording and Analysis Systems for Sound Intensity Measurement.

PubMed

Maryn, Youri; Zarowski, Andrzej

2015-11-01

Sound intensity is an important acoustic feature of voice/speech signals. Yet recordings are performed with different microphone, amplifier, and computer configurations, and it is therefore crucial to calibrate sound intensity measures of clinical audio recording and analysis systems on the basis of output of a sound-level meter. This study was designed to evaluate feasibility, validity, and accuracy of calibration methods, including audiometric speech noise signals and human voice signals under typical speech conditions. Calibration consisted of 3 comparisons between data from 29 measurement microphone-and-computer systems and data from the sound-level meter: signal-specific comparison with audiometric speech noise at 5 levels, signal-specific comparison with natural voice at 3 levels, and cross-signal comparison with natural voice at 3 levels. Intensity measures from recording systems were then linearly converted into calibrated data on the basis of these comparisons, and validity and accuracy of calibrated sound intensity were investigated. Very strong correlations and quasisimilarity were found between calibrated data and sound-level meter data across calibration methods and recording systems. Calibration of clinical sound intensity measures according to this method is feasible, valid, accurate, and representative for a heterogeneous set of microphones and data acquisition systems in real-life circumstances with distinct noise contexts.
Design Foundations for Content-Rich Acoustic Interfaces: Investigating Audemes as Referential Non-Speech Audio Cues

ERIC Educational Resources Information Center

Ferati, Mexhid Adem

2012-01-01

To access interactive systems, blind and visually impaired users can leverage their auditory senses by using non-speech sounds. The current structure of non-speech sounds, however, is geared toward conveying user interface operations (e.g., opening a file) rather than large theme-based information (e.g., a history passage) and, thus, is ill-suited…
The development of co-speech gesture in the communication of children with autism spectrum disorders.

PubMed

Sowden, Hannah; Clegg, Judy; Perkins, Michael

2013-12-01

Co-speech gestures have a close semantic relationship to speech in adult conversation. In typically developing children co-speech gestures which give additional information to speech facilitate the emergence of multi-word speech. A difficulty with integrating audio-visual information is known to exist for individuals with Autism Spectrum Disorder (ASD), which may affect development of the speech-gesture system. A longitudinal observational study was conducted with four children with ASD, aged 2;4 to 3;5 years. Participants were video-recorded for 20 min every 2 weeks during their attendance on an intervention programme. Recording continued for up to 8 months, thus affording a rich analysis of gestural practices from pre-verbal to multi-word speech across the group. All participants combined gesture with either speech or vocalisations. Co-speech gestures providing additional information to speech were observed to be either absent or rare. Findings suggest that children with ASD do not make use of the facilitating communicative effects of gesture in the same way as typically developing children.
Hierarchical vs non-hierarchical audio indexation and classification for video genres

NASA Astrophysics Data System (ADS)

Dammak, Nouha; BenAyed, Yassine

2018-04-01

In this paper, Support Vector Machines (SVMs) are used for segmenting and indexing video genres based on only audio features extracted at block level, which has a prominent asset by capturing local temporal information. The main contribution of our study is to show the wide effect on the classification accuracies while using an hierarchical categorization structure based on Mel Frequency Cepstral Coefficients (MFCC) audio descriptor. In fact, the classification consists in three common video genres: sports videos, music clips and news scenes. The sub-classification may divide each genre into several multi-speaker and multi-dialect sub-genres. The validation of this approach was carried out on over 360 minutes of video span yielding a classification accuracy of over 99%.
Audio-Vision: Audio-Visual Interaction in Desktop Multimedia.

ERIC Educational Resources Information Center

Daniels, Lee

Although sophisticated multimedia authoring applications are now available to amateur programmers, the use of audio in of these programs has been inadequate. Due to the lack of research in the use of audio in instruction, there are few resources to assist the multimedia producer in using sound effectively and efficiently. This paper addresses the…
Techniques for decoding speech phonemes and sounds: A concept

NASA Technical Reports Server (NTRS)

Lokerson, D. C.; Holby, H. G.

1975-01-01

Techniques studied involve conversion of speech sounds into machine-compatible pulse trains. (1) Voltage-level quantizer produces number of output pulses proportional to amplitude characteristics of vowel-type phoneme waveforms. (2) Pulses produced by quantizer of first speech formants are compared with pulses produced by second formants.
Characteristics of audio and sub-audio telluric signals

DOE Office of Scientific and Technical Information (OSTI.GOV)

Telford, W.M.

1977-06-01

Telluric current measurements in the audio and sub-audio frequency range, made in various parts of Canada and South America over the past four years, indicate that the signal amplitude is relatively uniform over 6 to 8 midday hours (LMT) except in Chile and that the signal anisotropy is reasonably constant in azimuth.
Individual Differences in Audio-Vocal Speech Imitation Aptitude in Late Bilinguals: Functional Neuro-Imaging and Brain Morphology

PubMed Central

Reiterer, Susanne Maria; Hu, Xiaochen; Erb, Michael; Rota, Giuseppina; Nardo, Davide; Grodd, Wolfgang; Winkler, Susanne; Ackermann, Hermann

2011-01-01

An unanswered question in adult language learning or late bi and multilingualism is why individuals show marked differences in their ability to imitate foreign accents. While recent research acknowledges that more adults than previously assumed can still acquire a “native” foreign accent, very little is known about the neuro-cognitive correlates of this special ability. We investigated 140 German-speaking individuals displaying varying degrees of “mimicking” capacity, based on natural language text, sentence, and word imitations either in their second language English or in Hindi and Tamil, languages they had never been exposed to. The large subject pool was strictly controlled for previous language experience prior to magnetic resonance imaging. The late-onset (around 10 years) bilinguals showed significant individual differences as to how they employed their left-hemisphere speech areas: higher hemodynamic activation in a distinct fronto-parietal network accompanied low ability, while high ability paralleled enhanced gray matter volume in these areas concomitant with decreased hemodynamic responses. Finally and unexpectedly, males were found to be more talented foreign speech mimics. PMID:22059077
Space Shuttle Orbiter audio subsystem. [to communication and tracking system

NASA Technical Reports Server (NTRS)

Stewart, C. H.

1978-01-01

The selection of the audio multiplex control configuration for the Space Shuttle Orbiter audio subsystem is discussed and special attention is given to the evaluation criteria of cost, weight and complexity. The specifications and design of the subsystem are described and detail is given to configurations of the audio terminal and audio central control unit (ATU, ACCU). The audio input from the ACCU, at a signal level of -12.2 to 14.8 dBV, nominal range, at 1 kHz, was found to have balanced source impedance and a balanced local impedance of 6000 + or - 600 ohms at 1 kHz, dc isolated. The Lyndon B. Johnson Space Center (JSC) electroacoustic test laboratory, an audio engineering facility consisting of a collection of acoustic test chambers, analyzed problems of speaker and headset performance, multiplexed control data coupled with audio channels, and the Orbiter cabin acoustic effects on the operational performance of voice communications. This system allows technical management and project engineering to address key constraining issues, such as identifying design deficiencies of the headset interface unit and the assessment of the Orbiter cabin performance of voice communications, which affect the subsystem development.
The Cleft Care UK study. Part 4: perceptual speech outcomes.

PubMed

Sell, D; Mildinhall, S; Albery, L; Wills, A K; Sandy, J R; Ness, A R

2015-11-01

To describe the perceptual speech outcomes from the Cleft Care UK (CCUK) study and compare them to the 1998 Clinical Standards Advisory Group (CSAG) audit. A cross-sectional study of 248 children born with complete unilateral cleft lip and palate, between 1 April 2005 and 31 March 2007 who underwent speech assessment. Centre-based specialist speech and language therapists (SLT) took speech audio-video recordings according to nationally agreed guidelines. Two independent listeners undertook the perceptual analysis using the CAPS-A Audit tool. Intra- and inter-rater reliability were tested. For each speech parameter of intelligibility/distinctiveness, hypernasality, palatal/palatalization, backed to velar/uvular, glottal, weak and nasalized consonants, and nasal realizations, there was strong evidence that speech outcomes were better in the CCUK children compared to CSAG children. The parameters which did not show improvement were nasal emission, nasal turbulence, hyponasality and lateral/lateralization. These results suggest that centralization of cleft care into high volume centres has resulted in improvements in UK speech outcomes in five-year-olds with unilateral cleft lip and palate. This may be associated with the development of a specialized workforce. Nevertheless, there still remains a group of children with significant difficulties at school entry. © The Authors. Orthodontics & Craniofacial Research Published by John Wiley & Sons Ltd.
Collaborative human-machine analysis to disambiguate entities in unstructured text and structured datasets

NASA Astrophysics Data System (ADS)

Davenport, Jack H.

2016-05-01

Intelligence analysts demand rapid information fusion capabilities to develop and maintain accurate situational awareness and understanding of dynamic enemy threats in asymmetric military operations. The ability to extract relationships between people, groups, and locations from a variety of text datasets is critical to proactive decision making. The derived network of entities must be automatically created and presented to analysts to assist in decision making. DECISIVE ANALYTICS Corporation (DAC) provides capabilities to automatically extract entities, relationships between entities, semantic concepts about entities, and network models of entities from text and multi-source datasets. DAC's Natural Language Processing (NLP) Entity Analytics model entities as complex systems of attributes and interrelationships which are extracted from unstructured text via NLP algorithms. The extracted entities are automatically disambiguated via machine learning algorithms, and resolution recommendations are presented to the analyst for validation; the analyst's expertise is leveraged in this hybrid human/computer collaborative model. Military capability is enhanced by these NLP Entity Analytics because analysts can now create/update an entity profile with intelligence automatically extracted from unstructured text, thereby fusing entity knowledge from structured and unstructured data sources. Operational and sustainment costs are reduced since analysts do not have to manually tag and resolve entities.
Implementation of support vector machine for classification of speech marked hijaiyah letters based on Mel frequency cepstrum coefficient feature extraction

NASA Astrophysics Data System (ADS)

Adhi Pradana, Wisnu; Adiwijaya; Novia Wisesty, Untari

2018-03-01

Support Vector Machine or commonly called SVM is one method that can be used to process the classification of a data. SVM classifies data from 2 different classes with hyperplane. In this study, the system was built using SVM to develop Arabic Speech Recognition. In the development of the system, there are 2 kinds of speakers that have been tested that is dependent speakers and independent speakers. The results from this system is an accuracy of 85.32% for speaker dependent and 61.16% for independent speakers.
Auditory and audio-vocal responses of single neurons in the monkey ventral premotor cortex.

PubMed

Hage, Steffen R

2018-03-20

Monkey vocalization is a complex behavioral pattern, which is flexibly used in audio-vocal communication. A recently proposed dual neural network model suggests that cognitive control might be involved in this behavior, originating from a frontal cortical network in the prefrontal cortex and mediated via projections from the rostral portion of the ventral premotor cortex (PMvr) and motor cortex to the primary vocal motor network in the brainstem. For the rapid adjustment of vocal output to external acoustic events, strong interconnections between vocal motor and auditory sites are needed, which are present at cortical and subcortical levels. However, the role of the PMvr in audio-vocal integration processes remains unclear. In the present study, single neurons in the PMvr were recorded in rhesus monkeys (Macaca mulatta) while volitionally producing vocalizations in a visual detection task or passively listening to monkey vocalizations. Ten percent of randomly selected neurons in the PMvr modulated their discharge rate in response to acoustic stimulation with species-specific calls. More than four-fifths of these auditory neurons showed an additional modulation of their discharge rates either before and/or during the monkeys' motor production of the vocalization. Based on these audio-vocal interactions, the PMvr might be well positioned to mediate higher order auditory processing with cognitive control of the vocal motor output to the primary vocal motor network. Such audio-vocal integration processes in the premotor cortex might constitute a precursor for the evolution of complex learned audio-vocal integration systems, ultimately giving rise to human speech. Copyright © 2018 Elsevier B.V. All rights reserved.
Atypical audio-visual speech perception and McGurk effects in children with specific language impairment

PubMed Central

Leybaert, Jacqueline; Macchi, Lucie; Huyse, Aurélie; Champoux, François; Bayard, Clémence; Colin, Cécile; Berthommier, Frédéric

2014-01-01

Audiovisual speech perception of children with specific language impairment (SLI) and children with typical language development (TLD) was compared in two experiments using /aCa/ syllables presented in the context of a masking release paradigm. Children had to repeat syllables presented in auditory alone, visual alone (speechreading), audiovisual congruent and incongruent (McGurk) conditions. Stimuli were masked by either stationary (ST) or amplitude modulated (AM) noise. Although children with SLI were less accurate in auditory and audiovisual speech perception, they showed similar auditory masking release effect than children with TLD. Children with SLI also had less correct responses in speechreading than children with TLD, indicating impairment in phonemic processing of visual speech information. In response to McGurk stimuli, children with TLD showed more fusions in AM noise than in ST noise, a consequence of the auditory masking release effect and of the influence of visual information. Children with SLI did not show this effect systematically, suggesting they were less influenced by visual speech. However, when the visual cues were easily identified, the profile of responses to McGurk stimuli was similar in both groups, suggesting that children with SLI do not suffer from an impairment of audiovisual integration. An analysis of percent of information transmitted revealed a deficit in the children with SLI, particularly for the place of articulation feature. Taken together, the data support the hypothesis of an intact peripheral processing of auditory speech information, coupled with a supra modal deficit of phonemic categorization in children with SLI. Clinical implications are discussed. PMID:24904454
Atypical audio-visual speech perception and McGurk effects in children with specific language impairment.

PubMed

Leybaert, Jacqueline; Macchi, Lucie; Huyse, Aurélie; Champoux, François; Bayard, Clémence; Colin, Cécile; Berthommier, Frédéric

2014-01-01

Audiovisual speech perception of children with specific language impairment (SLI) and children with typical language development (TLD) was compared in two experiments using /aCa/ syllables presented in the context of a masking release paradigm. Children had to repeat syllables presented in auditory alone, visual alone (speechreading), audiovisual congruent and incongruent (McGurk) conditions. Stimuli were masked by either stationary (ST) or amplitude modulated (AM) noise. Although children with SLI were less accurate in auditory and audiovisual speech perception, they showed similar auditory masking release effect than children with TLD. Children with SLI also had less correct responses in speechreading than children with TLD, indicating impairment in phonemic processing of visual speech information. In response to McGurk stimuli, children with TLD showed more fusions in AM noise than in ST noise, a consequence of the auditory masking release effect and of the influence of visual information. Children with SLI did not show this effect systematically, suggesting they were less influenced by visual speech. However, when the visual cues were easily identified, the profile of responses to McGurk stimuli was similar in both groups, suggesting that children with SLI do not suffer from an impairment of audiovisual integration. An analysis of percent of information transmitted revealed a deficit in the children with SLI, particularly for the place of articulation feature. Taken together, the data support the hypothesis of an intact peripheral processing of auditory speech information, coupled with a supra modal deficit of phonemic categorization in children with SLI. Clinical implications are discussed.
Combinatorial Markov Random Fields and Their Applications to Information Organization

DTIC Science & Technology

2008-02-01

titles, part-of- speech tags; • Image processing: images, colors, texture, blobs, interest points, caption words; • Video processing: video signal, audio...McGurk and MacDonald published their pioneering work [80] that revealed the multi-modal nature of speech perception: sound and moving lips compose one... Speech (POS) n-grams (that correspond to the syntactic structure of text). POS n-grams are extracted from sentences in an incremental manner: the first n
Three-Dimensional Audio Client Library

NASA Technical Reports Server (NTRS)

Rizzi, Stephen A.

2005-01-01

The Three-Dimensional Audio Client Library (3DAudio library) is a group of software routines written to facilitate development of both stand-alone (audio only) and immersive virtual-reality application programs that utilize three-dimensional audio displays. The library is intended to enable the development of three-dimensional audio client application programs by use of a code base common to multiple audio server computers. The 3DAudio library calls vendor-specific audio client libraries and currently supports the AuSIM Gold-Server and Lake Huron audio servers. 3DAudio library routines contain common functions for (1) initiation and termination of a client/audio server session, (2) configuration-file input, (3) positioning functions, (4) coordinate transformations, (5) audio transport functions, (6) rendering functions, (7) debugging functions, and (8) event-list-sequencing functions. The 3DAudio software is written in the C++ programming language and currently operates under the Linux, IRIX, and Windows operating systems.
Speech watermarking: an approach for the forensic analysis of digital telephonic recordings.

PubMed

Faundez-Zanuy, Marcos; Lucena-Molina, Jose J; Hagmüller, Martin

2010-07-01

In this article, the authors discuss the problem of forensic authentication of digital audio recordings. Although forensic audio has been addressed in several articles, the existing approaches are focused on analog magnetic recordings, which are less prevalent because of the large amount of digital recorders available on the market (optical, solid state, hard disks, etc.). An approach based on digital signal processing that consists of spread spectrum techniques for speech watermarking is presented. This approach presents the advantage that the authentication is based on the signal itself rather than the recording format. Thus, it is valid for usual recording devices in police-controlled telephone intercepts. In addition, our proposal allows for the introduction of relevant information such as the recording date and time and all the relevant data (this is not always possible with classical systems). Our experimental results reveal that the speech watermarking procedure does not interfere in a significant way with the posterior forensic speaker identification.
A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech

PubMed Central

Tóth, László; Hoffmann, Ildikó; Gosztolya, Gábor; Vincze, Veronika; Szatlóczki, Gréta; Bánréti, Zoltán; Pákáski, Magdolna; Kálmán, János

2018-01-01

Background: Even today the reliable diagnosis of the prodromal stages of Alzheimer’s disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive de-cline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Methods: Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech sig-nals, first manually (using the Praat software), and then automatically, with an automatic speech recogni-tion (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. Results: The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process – that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78
A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech.

PubMed

Toth, Laszlo; Hoffmann, Ildiko; Gosztolya, Gabor; Vincze, Veronika; Szatloczki, Greta; Banreti, Zoltan; Pakaski, Magdolna; Kalman, Janos

2018-01-01

Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. The temporal analysis of spontaneous speech

Spatio-temporal distribution of brain activity associated with audio-visually congruent and incongruent speech and the McGurk Effect.

PubMed

Pratt, Hillel; Bleich, Naomi; Mittelman, Nomi

2015-11-01

Spatio-temporal distributions of cortical activity to audio-visual presentations of meaningless vowel-consonant-vowels and the effects of audio-visual congruence/incongruence, with emphasis on the McGurk effect, were studied. The McGurk effect occurs when a clearly audible syllable with one consonant, is presented simultaneously with a visual presentation of a face articulating a syllable with a different consonant and the resulting percept is a syllable with a consonant other than the auditorily presented one. Twenty subjects listened to pairs of audio-visually congruent or incongruent utterances and indicated whether pair members were the same or not. Source current densities of event-related potentials to the first utterance in the pair were estimated and effects of stimulus-response combinations, brain area, hemisphere, and clarity of visual articulation were assessed. Auditory cortex, superior parietal cortex, and middle temporal cortex were the most consistently involved areas across experimental conditions. Early (<200 msec) processing of the consonant was overall prominent in the left hemisphere, except right hemisphere prominence in superior parietal cortex and secondary visual cortex. Clarity of visual articulation impacted activity in secondary visual cortex and Wernicke's area. McGurk perception was associated with decreased activity in primary and secondary auditory cortices and Wernicke's area before 100 msec, increased activity around 100 msec which decreased again around 180 msec. Activity in Broca's area was unaffected by McGurk perception and was only increased to congruent audio-visual stimuli 30-70 msec following consonant onset. The results suggest left hemisphere prominence in the effects of stimulus and response conditions on eight brain areas involved in dynamically distributed parallel processing of audio-visual integration. Initially (30-70 msec) subcortical contributions to auditory cortex, superior parietal cortex, and middle temporal
Speech-to-Speech Relay Service

MedlinePlus

... are specifically trained in understanding a variety of speech disorders, which enables them to repeat what the caller says in a manner that makes the caller’s words clear and understandable to the ... people with speech disabilities cannot communicate by telephone because the parties ...
Authenticity examination of compressed audio recordings using detection of multiple compression and encoders' identification.

PubMed

Korycki, Rafal

2014-05-01

Since the appearance of digital audio recordings, audio authentication has been becoming increasingly difficult. The currently available technologies and free editing software allow a forger to cut or paste any single word without audible artifacts. Nowadays, the only method referring to digital audio files commonly approved by forensic experts is the ENF criterion. It consists in fluctuation analysis of the mains frequency induced in electronic circuits of recording devices. Therefore, its effectiveness is strictly dependent on the presence of mains signal in the recording, which is a rare occurrence. Recently, much attention has been paid to authenticity analysis of compressed multimedia files and several solutions were proposed for detection of double compression in both digital video and digital audio. This paper addresses the problem of tampering detection in compressed audio files and discusses new methods that can be used for authenticity analysis of digital recordings. Presented approaches consist in evaluation of statistical features extracted from the MDCT coefficients as well as other parameters that may be obtained from compressed audio files. Calculated feature vectors are used for training selected machine learning algorithms. The detection of multiple compression covers up tampering activities as well as identification of traces of montage in digital audio recordings. To enhance the methods' robustness an encoder identification algorithm was developed and applied based on analysis of inherent parameters of compression. The effectiveness of tampering detection algorithms is tested on a predefined large music database consisting of nearly one million of compressed audio files. The influence of compression algorithms' parameters on the classification performance is discussed, based on the results of the current study. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Development of Learning Modules for Machine Shop Occupations. Final Report.

ERIC Educational Resources Information Center

Kent, Randall

This final report contains an eight-page narrative and materials/products of a program to produce (the final) sixty-eight individualized machine shop skill tasks modules (and fifty-two master audio tapes for students with serious reading disabilities). The narrative also describes the determination of the vital few skills used by machine tool…
A voyage to Mars: A challenge to collaboration between man and machines

NASA Technical Reports Server (NTRS)

Statler, Irving C.

1991-01-01

A speech addressing the design of man machine systems for exploration of space beyond Earth orbit from the human factors perspective is presented. Concerns relative to the design of automated and intelligent systems for the NASA Space Exploration Initiative (SEI) missions are largely based on experiences with integrating humans and comparable systems in aviation. The history, present status, and future prospect, of human factors in machine design are discussed in relation to a manned voyage to Mars. Three different cases for design philosophy are presented. The use of simulation is discussed. Recommendations for required research are given.
The Use of Asynchronous Audio Feedback with Online RN-BSN Students

ERIC Educational Resources Information Center

London, Julie E.

2013-01-01

The use of audio technology by online nursing educators is a recent phenomenon. Research has been conducted in the area of audio technology in different domains and populations, but very few researchers have focused on nursing. Preliminary results have indicated that using audio in place of text can increase student cognition and socialization.…
Comprehension of synthetic speech and digitized natural speech by adults with aphasia.

PubMed

Hux, Karen; Knollman-Porter, Kelly; Brown, Jessica; Wallace, Sarah E

2017-09-01

Using text-to-speech technology to provide simultaneous written and auditory content presentation may help compensate for chronic reading challenges if people with aphasia can understand synthetic speech output; however, inherent auditory comprehension challenges experienced by people with aphasia may make understanding synthetic speech difficult. This study's purpose was to compare the preferences and auditory comprehension accuracy of people with aphasia when listening to sentences generated with digitized natural speech, Alex synthetic speech (i.e., Macintosh platform), or David synthetic speech (i.e., Windows platform). The methodology required each of 20 participants with aphasia to select one of four images corresponding in meaning to each of 60 sentences comprising three stimulus sets. Results revealed significantly better accuracy given digitized natural speech than either synthetic speech option; however, individual participant performance analyses revealed three patterns: (a) comparable accuracy regardless of speech condition for 30% of participants, (b) comparable accuracy between digitized natural speech and one, but not both, synthetic speech option for 45% of participants, and (c) greater accuracy with digitized natural speech than with either synthetic speech option for remaining participants. Ranking and Likert-scale rating data revealed a preference for digitized natural speech and David synthetic speech over Alex synthetic speech. Results suggest many individuals with aphasia can comprehend synthetic speech options available on popular operating systems. Further examination of synthetic speech use to support reading comprehension through text-to-speech technology is thus warranted. Copyright © 2017 Elsevier Inc. All rights reserved.
A token centric part-of-speech tagger for biomedical text.

PubMed

Barrett, Neil; Weber-Jahnke, Jens

2014-05-01

Difficulties with part-of-speech (POS) tagging of biomedical text is accessing and annotating appropriate training corpora. These difficulties may result in POS taggers trained on corpora that differ from the tagger's target biomedical text (cross-domain tagging). In such cases where training and target corpora differ tagging accuracy decreases. This paper presents a POS tagger for cross-domain tagging called TcT. TcT estimates a tag's likelihood for a given token by combining token collocation probabilities and the token's tag probabilities calculated using a Naive Bayes classifier. We compared TcT to three POS taggers used in the biomedical domain (mxpost, Brill and TnT). We trained each tagger on a non-biomedical corpus and evaluated it on biomedical corpora. TcT was more accurate in cross-domain tagging than mxpost, Brill and TnT (respective averages 83.9, 81.0, 79.5 and 78.8). Our analysis of tagger performance suggests that lexical differences between corpora have more effect on tagging accuracy than originally considered by previous research work. Biomedical POS tagging algorithms may be modified to improve their cross-domain tagging accuracy without requiring extra training or large training data sets. Future work should reexamine POS tagging methods for biomedical text. This differs from the work to date that has focused on retraining existing POS taggers. Copyright © 2014 Elsevier B.V. All rights reserved.
A Tool for Assessing the Text Legibility of Digital Human Machine Interfaces

DOE Office of Scientific and Technical Information (OSTI.GOV)

Roger Lew; Ronald L. Boring; Thomas A. Ulrich

2015-08-01

A tool intended to aid qualified professionals in the assessment of the legibility of text presented on a digital display is described. The assessment of legibility is primarily for the purposes of designing and analyzing human machine interfaces in accordance with NUREG-0700 and MIL-STD 1472G. The tool addresses shortcomings of existing guidelines by providing more accurate metrics of text legibility with greater sensitivity to design alternatives.
Basic to Applied Research: The Benefits of Audio-Visual Speech Perception Research in Teaching Foreign Languages

ERIC Educational Resources Information Center

Erdener, Dogu

2016-01-01

Traditionally, second language (L2) instruction has emphasised auditory-based instruction methods. However, this approach is restrictive in the sense that speech perception by humans is not just an auditory phenomenon but a multimodal one, and specifically, a visual one as well. In the past decade, experimental studies have shown that the…
Quantum Neural Network Based Machine Translator for Hindi to English

PubMed Central

Singh, V. P.; Chakraverty, S.

2014-01-01

This paper presents the machine learning based machine translation system for Hindi to English, which learns the semantically correct corpus. The quantum neural based pattern recognizer is used to recognize and learn the pattern of corpus, using the information of part of speech of individual word in the corpus, like a human. The system performs the machine translation using its knowledge gained during the learning by inputting the pair of sentences of Devnagri-Hindi and English. To analyze the effectiveness of the proposed approach, 2600 sentences have been evaluated during simulation and evaluation. The accuracy achieved on BLEU score is 0.7502, on NIST score is 6.5773, on ROUGE-L score is 0.9233, and on METEOR score is 0.5456, which is significantly higher in comparison with Google Translation and Bing Translation for Hindi to English Machine Translation. PMID:24977198
Quantum neural network based machine translator for Hindi to English.

PubMed

Narayan, Ravi; Singh, V P; Chakraverty, S

2014-01-01

This paper presents the machine learning based machine translation system for Hindi to English, which learns the semantically correct corpus. The quantum neural based pattern recognizer is used to recognize and learn the pattern of corpus, using the information of part of speech of individual word in the corpus, like a human. The system performs the machine translation using its knowledge gained during the learning by inputting the pair of sentences of Devnagri-Hindi and English. To analyze the effectiveness of the proposed approach, 2600 sentences have been evaluated during simulation and evaluation. The accuracy achieved on BLEU score is 0.7502, on NIST score is 6.5773, on ROUGE-L score is 0.9233, and on METEOR score is 0.5456, which is significantly higher in comparison with Google Translation and Bing Translation for Hindi to English Machine Translation.
A Robust Approach For Acoustic Noise Suppression In Speech Using ANFIS

NASA Astrophysics Data System (ADS)

Martinek, Radek; Kelnar, Michal; Vanus, Jan; Bilik, Petr; Zidek, Jan

2015-11-01

The authors of this article deals with the implementation of a combination of techniques of the fuzzy system and artificial intelligence in the application area of non-linear noise and interference suppression. This structure used is called an Adaptive Neuro Fuzzy Inference System (ANFIS). This system finds practical use mainly in audio telephone (mobile) communication in a noisy environment (transport, production halls, sports matches, etc). Experimental methods based on the two-input adaptive noise cancellation concept was clearly outlined. Within the experiments carried out, the authors created, based on the ANFIS structure, a comprehensive system for adaptive suppression of unwanted background interference that occurs in audio communication and degrades the audio signal. The system designed has been tested on real voice signals. This article presents the investigation and comparison amongst three distinct approaches to noise cancellation in speech; they are LMS (least mean squares) and RLS (recursive least squares) adaptive filtering and ANFIS. A careful review of literatures indicated the importance of non-linear adaptive algorithms over linear ones in noise cancellation. It was concluded that the ANFIS approach had the overall best performance as it efficiently cancelled noise even in highly noise-degraded speech. Results were drawn from the successful experimentation, subjective-based tests were used to analyse their comparative performance while objective tests were used to validate them. Implementation of algorithms was experimentally carried out in Matlab to justify the claims and determine their relative performances.
Reviewing the connection between speech and obstructive sleep apnea.

PubMed

Espinoza-Cuadros, Fernando; Fernández-Pozo, Rubén; Toledano, Doroteo T; Alcázar-Ramírez, José D; López-Gonzalo, Eduardo; Hernández-Gómez, Luis A

2016-02-20

Sleep apnea (OSA) is a common sleep disorder characterized by recurring breathing pauses during sleep caused by a blockage of the upper airway (UA). The altered UA structure or function in OSA speakers has led to hypothesize the automatic analysis of speech for OSA assessment. In this paper we critically review several approaches using speech analysis and machine learning techniques for OSA detection, and discuss the limitations that can arise when using machine learning techniques for diagnostic applications. A large speech database including 426 male Spanish speakers suspected to suffer OSA and derived to a sleep disorders unit was used to study the clinical validity of several proposals using machine learning techniques to predict the apnea-hypopnea index (AHI) or classify individuals according to their OSA severity. AHI describes the severity of patients' condition. We first evaluate AHI prediction using state-of-the-art speaker recognition technologies: speech spectral information is modelled using supervectors or i-vectors techniques, and AHI is predicted through support vector regression (SVR). Using the same database we then critically review several OSA classification approaches previously proposed. The influence and possible interference of other clinical variables or characteristics available for our OSA population: age, height, weight, body mass index, and cervical perimeter, are also studied. The poor results obtained when estimating AHI using supervectors or i-vectors followed by SVR contrast with the positive results reported by previous research. This fact prompted us to a careful review of these approaches, also testing some reported results over our database. Several methodological limitations and deficiencies were detected that may have led to overoptimistic results. The methodological deficiencies observed after critically reviewing previous research can be relevant examples of potential pitfalls when using machine learning techniques for
2009 PEPNet Postsecondary Interpreting and Speech-to-Text Survey Summary. Advancing Educational Opportunities for People Who Are Deaf or Hard of Hearing

ERIC Educational Resources Information Center

Riehl, Bambi

2010-01-01

PEPNet gets frequent requests for information about interpreter/speech-to-text position development. Determining or analyzing a salary is often part of this challenge and the people at PEPNet trust the information in this paper will be helpful to postsecondary programs. This is the sixth survey produced by PEPNet-Midwest and the University of…
Online EEG Classification of Covert Speech for Brain-Computer Interfacing.

PubMed

Sereshkeh, Alborz Rezazadeh; Trott, Robert; Bricout, Aurélien; Chau, Tom

2017-12-01

Brain-computer interfaces (BCIs) for communication can be nonintuitive, often requiring the performance of hand motor imagery or some other conversation-irrelevant task. In this paper, electroencephalography (EEG) was used to develop two intuitive online BCIs based solely on covert speech. The goal of the first BCI was to differentiate between 10[Formula: see text]s of mental repetitions of the word "no" and an equivalent duration of unconstrained rest. The second BCI was designed to discern between 10[Formula: see text]s each of covert repetition of the words "yes" and "no". Twelve participants used these two BCIs to answer yes or no questions. Each participant completed four sessions, comprising two offline training sessions and two online sessions, one for testing each of the BCIs. With a support vector machine and a combination of spectral and time-frequency features, an average accuracy of [Formula: see text] was reached across participants in the online classification of no versus rest, with 10 out of 12 participants surpassing the chance level (60.0% for [Formula: see text]). The online classification of yes versus no yielded an average accuracy of [Formula: see text], with eight participants exceeding the chance level. Task-specific changes in EEG beta and gamma power in language-related brain areas tended to provide discriminatory information. To our knowledge, this is the first report of online EEG classification of covert speech. Our findings support further study of covert speech as a BCI activation task, potentially leading to the development of more intuitive BCIs for communication.
Comparing live to recorded speech in training the perception of spectrally shifted noise-vocoded speech.

PubMed

Faulkner, Andrew; Rosen, Stuart; Green, Tim

2012-10-01

Two experimental groups were trained for 2 h with live or recorded speech that was noise-vocoded and spectrally shifted and was from the same text and talker. These two groups showed equivalent improvements in performance for vocoded and shifted sentences, and the group trained with recorded speech showed consistently greater improvements than untrained controls. Another group trained with unshifted noise-vocoded speech improved no more than untrained controls. Computer-based training thus appears at least as effective as labor-intensive live-voice training for improving the perception of spectrally shifted noise-vocoded speech, and by implication, for training of users of cochlear implants.
Improved Open-Microphone Speech Recognition

NASA Astrophysics Data System (ADS)

Abrash, Victor

2002-12-01

Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken
Improved Open-Microphone Speech Recognition

NASA Technical Reports Server (NTRS)

Abrash, Victor

2002-01-01

Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken
User Evaluation of a Communication System That Automatically Generates Captions to Improve Telephone Communication

PubMed Central

Zekveld, Adriana A.; Kramer, Sophia E.; Kessens, Judith M.; Vlaming, Marcel S. M. G.; Houtgast, Tammo

2009-01-01

This study examined the subjective benefit obtained from automatically generated captions during telephone-speech comprehension in the presence of babble noise. Short stories were presented by telephone either with or without captions that were generated offline by an automatic speech recognition (ASR) system. To simulate online ASR, the word accuracy (WA) level of the captions was 60% or 70% and the text was presented delayed to the speech. After each test, the hearing impaired participants (n = 20) completed the NASA-Task Load Index and several rating scales evaluating the support from the captions. Participants indicated that using the erroneous text in speech comprehension was difficult and the reported task load did not differ between the audio + text and audio-only conditions. In a follow-up experiment (n = 10), the perceived benefit of presenting captions increased with an increase of WA levels to 80% and 90%, and elimination of the text delay. However, in general, the task load did not decrease when captions were presented. These results suggest that the extra effort required to process the text could have been compensated for by less effort required to comprehend the speech. Future research should aim at reducing the complexity of the task to increase the willingness of hearing impaired persons to use an assistive communication system automatically providing captions. The current results underline the need for obtaining both objective and subjective measures of benefit when evaluating assistive communication systems. PMID:19126551

Joint Spatial-Spectral Feature Space Clustering for Speech Activity Detection from ECoG Signals

PubMed Central

Kanas, Vasileios G.; Mporas, Iosif; Benz, Heather L.; Sgarbas, Kyriakos N.; Bezerianos, Anastasios; Crone, Nathan E.

2014-01-01

Brain machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines (SVM) as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and non-speech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllable repetition tasks and may contribute to the development of portable ECoG-based communication. PMID:24658248
Design of an efficient music-speech discriminator.

PubMed

Tardón, Lorenzo J; Sammartino, Simone; Barbancho, Isabel

2010-01-01

In this paper, the problem of the design of a simple and efficient music-speech discriminator for large audio data sets in which advanced music playing techniques are taught and voice and music are intrinsically interleaved is addressed. In the process, a number of features used in speech-music discrimination are defined and evaluated over the available data set. Specifically, the data set contains pieces of classical music played with different and unspecified instruments (or even lyrics) and the voice of a teacher (a top music performer) or even the overlapped voice of the translator and other persons. After an initial test of the performance of the features implemented, a selection process is started, which takes into account the type of classifier selected beforehand, to achieve good discrimination performance and computational efficiency, as shown in the experiments. The discrimination application has been defined and tested on a large data set supplied by Fundacion Albeniz, containing a large variety of classical music pieces played with different instrument, which include comments and speeches of famous performers.
Resource Guide for Persons with Speech or Language Impairments.

ERIC Educational Resources Information Center

IBM, Atlanta, GA. National Support Center for Persons with Disabilities.

The resource guide identifies products which assist speech or language impaired individuals in accessing IBM (International Business Machine) Personal Computers or the IBM Personal System/2 family of products. An introduction provides a general overview of ways computers can help persons with speech or language handicaps. The document then…
Applications of Speech-to-Text Recognition and Computer-Aided Translation for Facilitating Cross-Cultural Learning through a Learning Activity: Issues and Their Solutions

ERIC Educational Resources Information Center

Shadiev, Rustam; Wu, Ting-Ting; Sun, Ai; Huang, Yueh-Min

2018-01-01

In this study, 21 university students, who represented thirteen nationalities, participated in an online cross-cultural learning activity. The participants were engaged in interactions and exchanges carried out on Facebook® and Skype® platforms, and their multilingual communications were supported by speech-to-text recognition (STR) and…
An Evaluation of Output Signal to Noise Ratio as a Predictor of Cochlear Implant Speech Intelligibility.

PubMed

Watkins, Greg D; Swanson, Brett A; Suaning, Gregg J

2018-02-22

Cochlear implant (CI) sound processing strategies are usually evaluated in clinical studies involving experienced implant recipients. Metrics which estimate the capacity to perceive speech for a given set of audio and processing conditions provide an alternative means to assess the effectiveness of processing strategies. The aim of this research was to assess the ability of the output signal to noise ratio (OSNR) to accurately predict speech perception. It was hypothesized that compared with the other metrics evaluated in this study (1) OSNR would have equivalent or better accuracy and (2) OSNR would be the most accurate in the presence of variable levels of speech presentation. For the first time, the accuracy of OSNR as a metric which predicts speech intelligibility was compared, in a retrospective study, with that of the input signal to noise ratio (ISNR) and the short-term objective intelligibility (STOI) metric. Because STOI measured audio quality at the input to a CI sound processor, a vocoder was applied to the sound processor output and STOI was also calculated for the reconstructed audio signal (vocoder short-term objective intelligibility [VSTOI] metric). The figures of merit calculated for each metric were Pearson correlation of the metric and a psychometric function fitted to sentence scores at each predictor value (Pearson sigmoidal correlation [PSIG]), epsilon insensitive root mean square error (RMSE*) of the psychometric function and the sentence scores, and the statistical deviance of the fitted curve to the sentence scores (D). Sentence scores were taken from three existing data sets of Australian Sentence Tests in Noise results. The AuSTIN tests were conducted with experienced users of the Nucleus CI system. The score for each sentence was the proportion of morphemes the participant correctly repeated. In data set 1, all sentences were presented at 65 dB sound pressure level (SPL) in the presence of four-talker Babble noise. Each block of
Automatic lip reading by using multimodal visual features

NASA Astrophysics Data System (ADS)

Takahashi, Shohei; Ohya, Jun

2013-12-01

Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.
Towards Artificial Speech Therapy: A Neural System for Impaired Speech Segmentation.

PubMed

Iliya, Sunday; Neri, Ferrante

2016-09-01

This paper presents a neural system-based technique for segmenting short impaired speech utterances into silent, unvoiced, and voiced sections. Moreover, the proposed technique identifies those points of the (voiced) speech where the spectrum becomes steady. The resulting technique thus aims at detecting that limited section of the speech which contains the information about the potential impairment of the speech. This section is of interest to the speech therapist as it corresponds to the possibly incorrect movements of speech organs (lower lip and tongue with respect to the vocal tract). Two segmentation models to detect and identify the various sections of the disordered (impaired) speech signals have been developed and compared. The first makes use of a combination of four artificial neural networks. The second is based on a support vector machine (SVM). The SVM has been trained by means of an ad hoc nested algorithm whose outer layer is a metaheuristic while the inner layer is a convex optimization algorithm. Several metaheuristics have been tested and compared leading to the conclusion that some variants of the compact differential evolution (CDE) algorithm appears to be well-suited to address this problem. Numerical results show that the SVM model with a radial basis function is capable of effective detection of the portion of speech that is of interest to a therapist. The best performance has been achieved when the system is trained by the nested algorithm whose outer layer is hybrid-population-based/CDE. A population-based approach displays the best performance for the isolation of silence/noise sections, and the detection of unvoiced sections. On the other hand, a compact approach appears to be clearly well-suited to detect the beginning of the steady state of the voiced signal. Both the proposed segmentation models display outperformed two modern segmentation techniques based on Gaussian mixture model and deep learning.
Vector excitation speech or audio coder for transmission or storage

NASA Technical Reports Server (NTRS)

Davidson, Grant (Inventor); Gersho, Allen (Inventor)

1989-01-01

A vector excitation coder compresses vectors by using an optimum codebook designed off line, using an initial arbitrary codebook and a set of speech training vectors exploiting codevector sparsity (i.e., by making zero all but a selected number of samples of lowest amplitude in each of N codebook vectors). A fast-search method selects a number N.sub.c of good excitation vectors from the codebook, where N.sub.c is much smaller tha ORIGIN OF INVENTION The invention described herein was made in the performance of work under a NASA contract, and is subject to the provisions of Public Law 96-517 (35 USC 202) under which the inventors were granted a request to retain title.
Audio-based bolt-loosening detection technique of bolt joint

NASA Astrophysics Data System (ADS)

Zhang, Yang; Zhao, Xuefeng; Su, Wensheng; Xue, Zhigang

2018-03-01

Bolt joint, as the commonest coupling structure, is widely used in electro-mechanical system. However, it is the weakest part of the whole system. The increase of preload tension force can raise the reliability and strength of the bolt joint. Therefore, the pretension force is one of the most important factors to ensure the stability of bolt joint. According to the way of generating pretension force, the pretension force can be monitored by bolt torque, degrees and elongation. The existing bolt-loosening monitoring methods all require expensive equipment, which greatly restricts the practicality of the bolt-loosening monitoring. In this paper, a new method of bolt-loosening detection technique based on audio is proposed. The sound that bolt is hit by a hammer is recorded on the Smartphone, and the collected audio signal is classified and identified by support vector machine algorithm. First, a verification test was designed and the results show that this new method can identify the damage of bolt looseness accurately. Second, a variety of bolt-loosening was identified. The results indicate that this method has a high accuracy in multiclass classification of the bolt looseness. This bolt-loosening detection technique based on audio not only can reduce the requirements of technical and professional experience, but also make bolt-loosening monitoring simpler and easier.
Teaching mindfulness meditation to adults with severe speech and physical impairments: An exploratory study.

PubMed

Goodrich, Elena; Wahbeh, Helané; Mooney, Aimee; Miller, Meghan; Oken, Barry S

2015-01-01

People with severe speech and physical impairments may benefit from mindfulness meditation training because it has the potential to enhance their ability to cope with anxiety, depression and pain and improve their attentional capacity to use brain-computer interface systems. Seven adults with severe speech and physical impairments (SSPI) - defined as speech that is understood less than 25% of the time and/or severely reduced hand function for writing/typing - participated in this exploratory, uncontrolled intervention study. The objectives were to describe the development and implementation of a six-week mindfulness meditation intervention and to identify feasible outcome measures in this population. The weekly intervention was delivered by an instructor in the participant's home, and participants were encouraged to practise daily using audio recordings. The objective adherence to home practice was 10.2 minutes per day. Exploratory outcome measures were an n-back working memory task, the Attention Process Training-II Attention Questionnaire, the Pittsburgh Sleep Quality Index, the Perceived Stress Scale, the Positive and Negative Affect Schedule, and a qualitative feedback survey. There were no statistically significant pre-post results in this small sample, yet administration of the measures proved feasible, and qualitative reports were overall positive. Obstacles to teaching mindfulness meditation to persons with SSPI are reported, and solutions are proposed.
Multimedia Classifier

NASA Astrophysics Data System (ADS)

Costache, G. N.; Gavat, I.

2004-09-01

Along with the aggressive growing of the amount of digital data available (text, audio samples, digital photos and digital movies joined all in the multimedia domain) the need for classification, recognition and retrieval of this kind of data became very important. In this paper will be presented a system structure to handle multimedia data based on a recognition perspective. The main processing steps realized for the interesting multimedia objects are: first, the parameterization, by analysis, in order to obtain a description based on features, forming the parameter vector; second, a classification, generally with a hierarchical structure to make the necessary decisions. For audio signals, both speech and music, the derived perceptual features are the melcepstral (MFCC) and the perceptual linear predictive (PLP) coefficients. For images, the derived features are the geometric parameters of the speaker mouth. The hierarchical classifier consists generally in a clustering stage, based on the Kohonnen Self-Organizing Maps (SOM) and a final stage, based on a powerful classification algorithm called Support Vector Machines (SVM). The system, in specific variants, is applied with good results in two tasks: the first, is a bimodal speech recognition which uses features obtained from speech signal fused to features obtained from speaker's image and the second is a music retrieval from large music database.
Neural networks supporting audiovisual integration for speech: A large-scale lesion study.

PubMed

Hickok, Gregory; Rogalsky, Corianne; Matchin, William; Basilakos, Alexandra; Cai, Julia; Pillay, Sara; Ferrill, Michelle; Mickelsen, Soren; Anderson, Steven W; Love, Tracy; Binder, Jeffrey; Fridriksson, Julius

2018-06-01

Auditory and visual speech information are often strongly integrated resulting in perceptual enhancements for audiovisual (AV) speech over audio alone and sometimes yielding compelling illusory fusion percepts when AV cues are mismatched, the McGurk-MacDonald effect. Previous research has identified three candidate regions thought to be critical for AV speech integration: the posterior superior temporal sulcus (STS), early auditory cortex, and the posterior inferior frontal gyrus. We assess the causal involvement of these regions (and others) in the first large-scale (N = 100) lesion-based study of AV speech integration. Two primary findings emerged. First, behavioral performance and lesion maps for AV enhancement and illusory fusion measures indicate that classic metrics of AV speech integration are not necessarily measuring the same process. Second, lesions involving superior temporal auditory, lateral occipital visual, and multisensory zones in the STS are the most disruptive to AV speech integration. Further, when AV speech integration fails, the nature of the failure-auditory vs visual capture-can be predicted from the location of the lesions. These findings show that AV speech processing is supported by unimodal auditory and visual cortices as well as multimodal regions such as the STS at their boundary. Motor related frontal regions do not appear to play a role in AV speech integration. Copyright © 2018 Elsevier Ltd. All rights reserved.
How should a speech recognizer work?

PubMed

Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M

2005-11-12

Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.
The Bilingual Language Interaction Network for Comprehension of Speech*

PubMed Central

Marian, Viorica

2013-01-01

During speech comprehension, bilinguals co-activate both of their languages, resulting in cross-linguistic interaction at various levels of processing. This interaction has important consequences for both the structure of the language system and the mechanisms by which the system processes spoken language. Using computational modeling, we can examine how cross-linguistic interaction affects language processing in a controlled, simulated environment. Here we present a connectionist model of bilingual language processing, the Bilingual Language Interaction Network for Comprehension of Speech (BLINCS), wherein interconnected levels of processing are created using dynamic, self-organizing maps. BLINCS can account for a variety of psycholinguistic phenomena, including cross-linguistic interaction at and across multiple levels of processing, cognate facilitation effects, and audio-visual integration during speech comprehension. The model also provides a way to separate two languages without requiring a global language-identification system. We conclude that BLINCS serves as a promising new model of bilingual spoken language comprehension. PMID:24363602
Audio-visual biofeedback for respiratory-gated radiotherapy: Impact of audio instruction and audio-visual biofeedback on respiratory-gated radiotherapy

DOE Office of Scientific and Technical Information (OSTI.GOV)

George, Rohini; Department of Biomedical Engineering, Virginia Commonwealth University, Richmond, VA; Chung, Theodore D.

2006-07-01

Purpose: Respiratory gating is a commercially available technology for reducing the deleterious effects of motion during imaging and treatment. The efficacy of gating is dependent on the reproducibility within and between respiratory cycles during imaging and treatment. The aim of this study was to determine whether audio-visual biofeedback can improve respiratory reproducibility by decreasing residual motion and therefore increasing the accuracy of gated radiotherapy. Methods and Materials: A total of 331 respiratory traces were collected from 24 lung cancer patients. The protocol consisted of five breathing training sessions spaced about a week apart. Within each session the patients initially breathedmore » without any instruction (free breathing), with audio instructions and with audio-visual biofeedback. Residual motion was quantified by the standard deviation of the respiratory signal within the gating window. Results: Audio-visual biofeedback significantly reduced residual motion compared with free breathing and audio instruction. Displacement-based gating has lower residual motion than phase-based gating. Little reduction in residual motion was found for duty cycles less than 30%; for duty cycles above 50% there was a sharp increase in residual motion. Conclusions: The efficiency and reproducibility of gating can be improved by: incorporating audio-visual biofeedback, using a 30-50% duty cycle, gating during exhalation, and using displacement-based gating.« less
The Use of Audio and Animation in Computer Based Instruction.

ERIC Educational Resources Information Center

Koroghlanian, Carol; Klein, James D.

This study investigated the effects of audio, animation, and spatial ability in a computer-based instructional program for biology. The program presented instructional material via test or audio with lean text and included eight instructional sequences presented either via static illustrations or animations. High school students enrolled in a…
Current trends in small vocabulary speech recognition for equipment control

NASA Astrophysics Data System (ADS)

Doukas, Nikolaos; Bardis, Nikolaos G.

2017-09-01

Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.
Perception of Audio-Visual Speech Synchrony in Spanish-Speaking Children with and without Specific Language Impairment

ERIC Educational Resources Information Center

Pons, Ferran; Andreu, Llorenc; Sanz-Torrent, Monica; Buil-Legaz, Lucia; Lewkowicz, David J.

2013-01-01

Speech perception involves the integration of auditory and visual articulatory information, and thus requires the perception of temporal synchrony between this information. There is evidence that children with specific language impairment (SLI) have difficulty with auditory speech perception but it is not known if this is also true for the…
Breath Group Analysis for Reading and Spontaneous Speech in Healthy Adults

PubMed Central

Wang, Yu-Tsai; Green, Jordan R.; Nip, Ignatius S.B.; Kent, Ray D.; Kent, Jane Finley

2010-01-01

Aims The breath group can serve as a functional unit to define temporal and fundamental frequency (f0) features in continuous speech. These features of the breath group are determined by the physiologic, linguistic, and cognitive demands of communication. Reading and spontaneous speech are two speaking tasks that vary in these demands and are commonly used to evaluate speech performance for research and clinical applications. The purpose of this study is to examine differences between reading and spontaneous speech in the temporal and f0 aspects of their breath groups. Methods Sixteen participants read two passages and answered six questions while wearing a circumferentially vented mask connected to a pneumotach. The aerodynamic signal was used to identify inspiratory locations. The audio signal was used to analyze task differences in breath group structure, including temporal and f0 components. Results The main findings were that spontaneous speech task exhibited significantly more grammatically inappropriate breath group locations and longer breath group duration than did the passage reading task. Conclusion The task differences in the percentage of grammatically inadequate breath group locations and in breath group duration for healthy adult speakers partly explain the differences in cognitive-linguistic load between the passage reading and spontaneous speech. PMID:20588052
Breath group analysis for reading and spontaneous speech in healthy adults.

PubMed

Wang, Yu-Tsai; Green, Jordan R; Nip, Ignatius S B; Kent, Ray D; Kent, Jane Finley

2010-01-01

The breath group can serve as a functional unit to define temporal and fundamental frequency (f0) features in continuous speech. These features of the breath group are determined by the physiologic, linguistic, and cognitive demands of communication. Reading and spontaneous speech are two speaking tasks that vary in these demands and are commonly used to evaluate speech performance for research and clinical applications. The purpose of this study is to examine differences between reading and spontaneous speech in the temporal and f0 aspects of their breath groups. Sixteen participants read two passages and answered six questions while wearing a circumferentially vented mask connected to a pneumotach. The aerodynamic signal was used to identify inspiratory locations. The audio signal was used to analyze task differences in breath group structure, including temporal and f0 components. The main findings were that spontaneous speech task exhibited significantly more grammatically inappropriate breath group locations and longer breath group duration than did the passage reading task. The task differences in the percentage of grammatically inadequate breath group locations and in breath group duration for healthy adult speakers partly explain the differences in cognitive-linguistic load between the passage reading and spontaneous speech. Copyright © 2010 S. Karger AG, Basel.

Audio Tracking in Noisy Environments by Acoustic Map and Spectral Signature.

PubMed

Crocco, Marco; Martelli, Samuele; Trucco, Andrea; Zunino, Andrea; Murino, Vittorio

2018-05-01

A novel method is proposed for generic target tracking by audio measurements from a microphone array. To cope with noisy environments characterized by persistent and high energy interfering sources, a classification map (CM) based on spectral signatures is calculated by means of a machine learning algorithm. Next, the CM is combined with the acoustic map, describing the spatial distribution of sound energy, in order to obtain a cleaned joint map in which contributions from the disturbing sources are removed. A likelihood function is derived from this map and fed to a particle filter yielding the target location estimation on the acoustic image. The method is tested on two real environments, addressing both speaker and vehicle tracking. The comparison with a couple of trackers, relying on the acoustic map only, shows a sharp improvement in performance, paving the way to the application of audio tracking in real challenging environments.
Re-Presenting Subversive Songs: Applying Strategies for Invention and Arrangement to Nontraditional Speech Texts

ERIC Educational Resources Information Center

Charlesworth, Dacia

2010-01-01

Invention deals with the content of a speech, arrangement involves placing the content in an order that is most strategic, style focuses on selecting linguistic devices, such as metaphor, to make the message more appealing, memory assists the speaker in delivering the message correctly, and delivery ideally enables great reception of the message.…
Bringing text display digital radio to consumers with hearing loss.

PubMed

Sheffield, Ellyn G; Starling, Michael; Schwab, Daniel

2011-01-01

Radio is migrating to digital transmission, expanding its offerings to include captioning for individuals with hearing loss. Text display radio requires a large amount of word throughput with minimal screen display area, making good user interface design crucial to its success. In two experiments, we presented hearing, hard-of-hearing, and deaf consumers with National Public Radio stories converted to text and examined their preferences for and reactions to midsized and small radio text displays. We focused on physical display attributes such as text color, font style, line length, and scrolling type as well as emergency alert messages and emergency prompts for drivers, announcer identification schemes, and synchronization of audio and text. Results suggest that midsized, Global Positioning System (GPS)-style displays were well liked, synchronization of audio and text was important to comprehension and retrieval of story details, identification of announcers was served best with a combination of name change in parenthesis and color change, and a mixture of color and flashing symbols was preferred for emergency alerting.
Hands-free human-machine interaction with voice

NASA Astrophysics Data System (ADS)

Juang, B. H.

2004-05-01

Voice is natural communication interface between a human and a machine. The machine, when placed in today's communication networks, may be configured to provide automation to save substantial operating cost, as demonstrated in AT&T's VRCP (Voice Recognition Call Processing), or to facilitate intelligent services, such as virtual personal assistants, to enhance individual productivity. These intelligent services often need to be accessible anytime, anywhere (e.g., in cars when the user is in a hands-busy-eyes-busy situation or during meetings where constantly talking to a microphone is either undersirable or impossible), and thus call for advanced signal processing and automatic speech recognition techniques which support what we call ``hands-free'' human-machine communication. These techniques entail a broad spectrum of technical ideas, ranging from use of directional microphones and acoustic echo cancellatiion to robust speech recognition. In this talk, we highlight a number of key techniques that were developed for hands-free human-machine communication in the mid-1990s after Bell Labs became a unit of Lucent Technologies. A video clip will be played to demonstrate the accomplishement.
Audio-visual speech perception in infants and toddlers with Down syndrome, fragile X syndrome, and Williams syndrome.

PubMed

D'Souza, Dean; D'Souza, Hana; Johnson, Mark H; Karmiloff-Smith, Annette

2016-08-01

Typically-developing (TD) infants can construct unified cross-modal percepts, such as a speaking face, by integrating auditory-visual (AV) information. This skill is a key building block upon which higher-level skills, such as word learning, are built. Because word learning is seriously delayed in most children with neurodevelopmental disorders, we assessed the hypothesis that this delay partly results from a deficit in integrating AV speech cues. AV speech integration has rarely been investigated in neurodevelopmental disorders, and never previously in infants. We probed for the McGurk effect, which occurs when the auditory component of one sound (/ba/) is paired with the visual component of another sound (/ga/), leading to the perception of an illusory third sound (/da/ or /tha/). We measured AV integration in 95 infants/toddlers with Down, fragile X, or Williams syndrome, whom we matched on Chronological and Mental Age to 25 TD infants. We also assessed a more basic AV perceptual ability: sensitivity to matching vs. mismatching AV speech stimuli. Infants with Williams syndrome failed to demonstrate a McGurk effect, indicating poor AV speech integration. Moreover, while the TD children discriminated between matching and mismatching AV stimuli, none of the other groups did, hinting at a basic deficit or delay in AV speech processing, which is likely to constrain subsequent language development. Copyright © 2016 Elsevier Inc. All rights reserved.
Effects of text-to-speech software use on the reading proficiency of high school struggling readers.

PubMed

Park, Hye Jin; Takahashi, Kiriko; Roberts, Kelly D; Delise, Danielle

2017-01-01

The literature highlights the benefits of text-to-speech (TTS) software when used as an assistive technology facilitating struggling readers' access to print. However, the effects of TTS software use, upon students' unassisted reading proficiency, have remained relatively unexplored. The researchers utilized an experimental design to investigate whether 9th grade struggling readers who use TTS software to read course materials demonstrate significant improvements in unassisted reading performance. A total of 164 students of 30 teachers in Hawaii participated in the study. Analyses of covariance results indicated that the TTS intervention had a significant, positive effect on student reading vocabulary and reading comprehension after 10 weeks of TTS software use (average 582 minutes). There are several limitations to the study; however, the current study opens up for discussions and need for further studies investigating TTS software as a viable reading intervention for adolescent struggling readers.
International Collegium of Rehabilitative Audiology (ICRA) recommendations for the construction of multilingual speech tests. ICRA Working Group on Multilingual Speech Tests.

PubMed

Akeroyd, Michael A; Arlinger, Stig; Bentler, Ruth A; Boothroyd, Arthur; Dillier, Norbert; Dreschler, Wouter A; Gagné, Jean-Pierre; Lutman, Mark; Wouters, Jan; Wong, Lena; Kollmeier, Birger

2015-01-01

To provide guidelines for the development of two types of closed-set speech-perception tests that can be applied and interpreted in the same way across languages. The guidelines cover the digit triplet and the matrix sentence tests that are most commonly used to test speech recognition in noise. They were developed by a working group on Multilingual Speech Tests of the International Collegium of Rehabilitative Audiology (ICRA). The recommendations are based on reviews of existing evaluations of the digit triplet and matrix tests as well as on the research experience of members of the ICRA Working Group. They represent the results of a consensus process. The resulting recommendations deal with: Test design and word selection; Talker characteristics; Audio recording and stimulus preparation; Masking noise; Test administration; and Test validation. By following these guidelines for the development of any new test of this kind, clinicians and researchers working in any language will be able to perform tests whose results can be compared and combined in cross-language studies.
Speech and pause characteristics in multiple sclerosis: A preliminary study of speakers with high and low neuropsychological test performance

PubMed Central

FEENAUGHTY, LYNDA; TJADEN, KRIS; BENEDICT, RALPH H.B.; WEINSTOCK-GUTTMAN, BIANCA

2017-01-01

This preliminary study investigated how cognitive-linguistic status in multiple sclerosis (MS) is reflected in two speech tasks (i.e. oral reading, narrative) that differ in cognitive-linguistic demand. Twenty individuals with MS were selected to comprise High and Low performance groups based on clinical tests of executive function and information processing speed and efficiency. Ten healthy controls were included for comparison. Speech samples were audio-recorded and measures of global speech timing were obtained. Results indicated predicted differences in global speech timing (i.e. speech rate and pause characteristics) for speech tasks differing in cognitive-linguistic demand, but the magnitude of these task-related differences was similar for all speaker groups. Findings suggest that assumptions concerning the cognitive-linguistic demands of reading aloud as compared to spontaneous speech may need to be re-considered for individuals with cognitive impairment. Qualitative trends suggest that additional studies investigating the association between cognitive-linguistic and speech motor variables in MS are warranted. PMID:23294227
High-Fidelity Piezoelectric Audio Device

NASA Technical Reports Server (NTRS)

Woodward, Stanley E.; Fox, Robert L.; Bryant, Robert G.

2003-01-01

ModalMax is a very innovative means of harnessing the vibration of a piezoelectric actuator to produce an energy efficient low-profile device with high-bandwidth high-fidelity audio response. The piezoelectric audio device outperforms many commercially available speakers made using speaker cones. The piezoelectric device weighs substantially less (4 g) than the speaker cones which use magnets (10 g). ModalMax devices have extreme fabrication simplicity. The entire audio device is fabricated by lamination. The simplicity of the design lends itself to lower cost. The piezoelectric audio device can be used without its acoustic chambers and thereby resulting in a very low thickness of 0.023 in. (0.58 mm). The piezoelectric audio device can be completely encapsulated, which makes it very attractive for use in wet environments. Encapsulation does not significantly alter the audio response. Its small size (see Figure 1) is applicable to many consumer electronic products, such as pagers, portable radios, headphones, laptop computers, computer monitors, toys, and electronic games. The audio device can also be used in automobile or aircraft sound systems.
Quantitative characterisation of audio data by ordinal symbolic dynamics

NASA Astrophysics Data System (ADS)

Aschenbrenner, T.; Monetti, R.; Amigó, J. M.; Bunk, W.

2013-06-01

Ordinal symbolic dynamics has developed into a valuable method to describe complex systems. Recently, using the concept of transcripts, the coupling behaviour of systems was assessed, combining the properties of the symmetric group with information theoretic ideas. In this contribution, methods from the field of ordinal symbolic dynamics are applied to the characterisation of audio data. Coupling complexity between frequency bands of solo violin music, as a fingerprint of the instrument, is used for classification purposes within a support vector machine scheme. Our results suggest that coupling complexity is able to capture essential characteristics, sufficient to distinguish among different violins.
Smartphone Application for the Analysis of Prosodic Features in Running Speech with a Focus on Bipolar Disorders: System Performance Evaluation and Case Study.

PubMed

Guidi, Andrea; Salvi, Sergio; Ottaviano, Manuel; Gentili, Claudio; Bertschy, Gilles; de Rossi, Danilo; Scilingo, Enzo Pasquale; Vanello, Nicola

2015-11-06

Bipolar disorder is one of the most common mood disorders characterized by large and invalidating mood swings. Several projects focus on the development of decision support systems that monitor and advise patients, as well as clinicians. Voice monitoring and speech signal analysis can be exploited to reach this goal. In this study, an Android application was designed for analyzing running speech using a smartphone device. The application can record audio samples and estimate speech fundamental frequency, F0, and its changes. F0-related features are estimated locally on the smartphone, with some advantages with respect to remote processing approaches in terms of privacy protection and reduced upload costs. The raw features can be sent to a central server and further processed. The quality of the audio recordings, algorithm reliability and performance of the overall system were evaluated in terms of voiced segment detection and features estimation. The results demonstrate that mean F0 from each voiced segment can be reliably estimated, thus describing prosodic features across the speech sample. Instead, features related to F0 variability within each voiced segment performed poorly. A case study performed on a bipolar patient is presented.
Smartphone Application for the Analysis of Prosodic Features in Running Speech with a Focus on Bipolar Disorders: System Performance Evaluation and Case Study

PubMed Central

Guidi, Andrea; Salvi, Sergio; Ottaviano, Manuel; Gentili, Claudio; Bertschy, Gilles; de Rossi, Danilo; Scilingo, Enzo Pasquale; Vanello, Nicola

2015-01-01

Bipolar disorder is one of the most common mood disorders characterized by large and invalidating mood swings. Several projects focus on the development of decision support systems that monitor and advise patients, as well as clinicians. Voice monitoring and speech signal analysis can be exploited to reach this goal. In this study, an Android application was designed for analyzing running speech using a smartphone device. The application can record audio samples and estimate speech fundamental frequency, F0, and its changes. F0-related features are estimated locally on the smartphone, with some advantages with respect to remote processing approaches in terms of privacy protection and reduced upload costs. The raw features can be sent to a central server and further processed. The quality of the audio recordings, algorithm reliability and performance of the overall system were evaluated in terms of voiced segment detection and features estimation. The results demonstrate that mean F0 from each voiced segment can be reliably estimated, thus describing prosodic features across the speech sample. Instead, features related to F0 variability within each voiced segment performed poorly. A case study performed on a bipolar patient is presented. PMID:26561811
[A modified speech enhancement algorithm for electronic cochlear implant and its digital signal processing realization].

PubMed

Wang, Yulin; Tian, Xuelong

2014-08-01

In order to improve the speech quality and auditory perceptiveness of electronic cochlear implant under strong noise background, a speech enhancement system used for electronic cochlear implant front-end was constructed. Taking digital signal processing (DSP) as the core, the system combines its multi-channel buffered serial port (McBSP) data transmission channel with extended audio interface chip TLV320AIC10, so speech signal acquisition and output with high speed are realized. Meanwhile, due to the traditional speech enhancement method which has the problems as bad adaptability, slow convergence speed and big steady-state error, versiera function and de-correlation principle were used to improve the existing adaptive filtering algorithm, which effectively enhanced the quality of voice communications. Test results verified the stability of the system and the de-noising performance of the algorithm, and it also proved that they could provide clearer speech signals for the deaf or tinnitus patients.
Audio-visual interactions in environment assessment.

PubMed

Preis, Anna; Kociński, Jędrzej; Hafke-Dys, Honorata; Wrzosek, Małgorzata

2015-08-01

The aim of the study was to examine how visual and audio information influences audio-visual environment assessment. Original audio-visual recordings were made at seven different places in the city of Poznań. Participants of the psychophysical experiments were asked to rate, on a numerical standardized scale, the degree of comfort they would feel if they were in such an environment. The assessments of audio-visual comfort were carried out in a laboratory in four different conditions: (a) audio samples only, (b) original audio-visual samples, (c) video samples only, and (d) mixed audio-visual samples. The general results of this experiment showed a significant difference between the investigated conditions, but not for all the investigated samples. There was a significant improvement in comfort assessment when visual information was added (in only three out of 7 cases), when conditions (a) and (b) were compared. On the other hand, the results show that the comfort assessment of audio-visual samples could be changed by manipulating the audio rather than the video part of the audio-visual sample. Finally, it seems, that people could differentiate audio-visual representations of a given place in the environment based rather of on the sound sources' compositions than on the sound level. Object identification is responsible for both landscape and soundscape grouping. Copyright © 2015. Published by Elsevier B.V.
Benefits of Music Training for Perception of Emotional Speech Prosody in Deaf Children With Cochlear Implants.

PubMed

Good, Arla; Gordon, Karen A; Papsin, Blake C; Nespoli, Gabe; Hopyan, Talar; Peretz, Isabelle; Russo, Frank A

Children who use cochlear implants (CIs) have characteristic pitch processing deficits leading to impairments in music perception and in understanding emotional intention in spoken language. Music training for normal-hearing children has previously been shown to benefit perception of emotional prosody. The purpose of the present study was to assess whether deaf children who use CIs obtain similar benefits from music training. We hypothesized that music training would lead to gains in auditory processing and that these gains would transfer to emotional speech prosody perception. Study participants were 18 child CI users (ages 6 to 15). Participants received either 6 months of music training (i.e., individualized piano lessons) or 6 months of visual art training (i.e., individualized painting lessons). Measures of music perception and emotional speech prosody perception were obtained pre-, mid-, and post-training. The Montreal Battery for Evaluation of Musical Abilities was used to measure five different aspects of music perception (scale, contour, interval, rhythm, and incidental memory). The emotional speech prosody task required participants to identify the emotional intention of a semantically neutral sentence under audio-only and audiovisual conditions. Music training led to improved performance on tasks requiring the discrimination of melodic contour and rhythm, as well as incidental memory for melodies. These improvements were predominantly found from mid- to post-training. Critically, music training also improved emotional speech prosody perception. Music training was most advantageous in audio-only conditions. Art training did not lead to the same improvements. Music training can lead to improvements in perception of music and emotional speech prosody, and thus may be an effective supplementary technique for supporting auditory rehabilitation following cochlear implantation.
Music and speech listening enhance the recovery of early sensory processing after stroke.

PubMed

Särkämö, Teppo; Pihko, Elina; Laitinen, Sari; Forsblom, Anita; Soinila, Seppo; Mikkonen, Mikko; Autti, Taina; Silvennoinen, Heli M; Erkkilä, Jaakko; Laine, Matti; Peretz, Isabelle; Hietanen, Marja; Tervaniemi, Mari

2010-12-01

Our surrounding auditory environment has a dramatic influence on the development of basic auditory and cognitive skills, but little is known about how it influences the recovery of these skills after neural damage. Here, we studied the long-term effects of daily music and speech listening on auditory sensory memory after middle cerebral artery (MCA) stroke. In the acute recovery phase, 60 patients who had middle cerebral artery stroke were randomly assigned to a music listening group, an audio book listening group, or a control group. Auditory sensory memory, as indexed by the magnetic MMN (MMNm) response to changes in sound frequency and duration, was measured 1 week (baseline), 3 months, and 6 months after the stroke with whole-head magnetoencephalography recordings. Fifty-four patients completed the study. Results showed that the amplitude of the frequency MMNm increased significantly more in both music and audio book groups than in the control group during the 6-month poststroke period. In contrast, the duration MMNm amplitude increased more in the audio book group than in the other groups. Moreover, changes in the frequency MMNm amplitude correlated significantly with the behavioral improvement of verbal memory and focused attention induced by music listening. These findings demonstrate that merely listening to music and speech after neural damage can induce long-term plastic changes in early sensory processing, which, in turn, may facilitate the recovery of higher cognitive functions. The neural mechanisms potentially underlying this effect are discussed.
Recognition of Speech from the Television with Use of a Wireless Technology Designed for Cochlear Implants.

PubMed

Duke, Mila Morais; Wolfe, Jace; Schafer, Erin

2016-05-01

Cochlear implant (CI) recipients often experience difficulty understanding speech in noise and speech that originates from a distance. Many CI recipients also experience difficulty understanding speech originating from a television. Use of hearing assistance technology (HAT) may improve speech recognition in noise and for signals that originate from more than a few feet from the listener; however, there are no published studies evaluating the potential benefits of a wireless HAT designed to deliver audio signals from a television directly to a CI sound processor. The objective of this study was to compare speech recognition in quiet and in noise of CI recipients with the use of their CI alone and with the use of their CI and a wireless HAT (Cochlear Wireless TV Streamer). A two-way repeated measures design was used to evaluate performance differences obtained in quiet and in competing noise (65 dBA) with the CI sound processor alone and with the sound processor coupled to the Cochlear Wireless TV Streamer. Sixteen users of Cochlear Nucleus 24 Freedom, CI512, and CI422 implants were included in the study. Participants were evaluated in four conditions including use of the sound processor alone and use of the sound processor with the wireless streamer in quiet and in the presence of competing noise at 65 dBA. Speech recognition was evaluated in each condition with two full lists of Computer-Assisted Speech Perception Testing and Training Sentence-Level Test sentences presented from a light-emitting diode television. Speech recognition in noise was significantly better with use of the wireless streamer compared to participants' performance with their CI sound processor alone. There was also a nonsignificant trend toward better performance in quiet with use of the TV Streamer. Performance was significantly poorer when evaluated in noise compared to performance in quiet when the TV Streamer was not used. Use of the Cochlear Wireless TV Streamer designed to stream audio
Prevalence of Vocal Problems: Speech-Language Pathologists' Evaluation of Music and Non-Music Teacher Recordings

ERIC Educational Resources Information Center

Hackworth, Rhonda S.

2013-01-01

The current study, a preliminary examination of whether music teachers are more susceptible to vocal problems than teachers of other subjects, asked for expert evaluation of audio recordings from licensed speech-language pathologists. Participants (N = 41) taught music (n = 23) or another subject (n = 18) in either elementary (n = 21), middle (n =…
Recording high quality speech during tagged cine-MRI studies using a fiber optic microphone.

PubMed

NessAiver, Moriel S; Stone, Maureen; Parthasarathy, Vijay; Kahana, Yuvi; Paritsky, Alexander; Paritsky, Alex

2006-01-01

To investigate the feasibility of obtaining high quality speech recordings during cine imaging of tongue movement using a fiber optic microphone. A Complementary Spatial Modulation of Magnetization (C-SPAMM) tagged cine sequence triggered by an electrocardiogram (ECG) simulator was used to image a volunteer while speaking the syllable pairs /a/-/u/, /i/-/u/, and the words "golly" and "Tamil" in sync with the imaging sequence. A noise-canceling, optical microphone was fastened approximately 1-2 inches above the mouth of the volunteer. The microphone was attached via optical fiber to a laptop computer, where the speech was sampled at 44.1 kHz. A reference recording of gradient activity with no speech was subtracted from target recordings. Good quality speech was discernible above the background gradient sound using the fiber optic microphone without reference subtraction. The audio waveform of gradient activity was extremely stable and reproducible. Subtraction of the reference gradient recording further reduced gradient noise by roughly 21 dB, resulting in exceptionally high quality speech waveforms. It is possible to obtain high quality speech recordings using an optical microphone even during exceptionally loud cine imaging sequences. This opens up the possibility of more elaborate MRI studies of speech including spectral analysis of the speech signal in all types of MRI.
Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces

PubMed Central

Bocquelet, Florent; Hueber, Thomas; Girin, Laurent; Savariaux, Christophe; Yvert, Blaise

2016-01-01

Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer. PMID:27880768

Influence of Telecommunication Modality, Internet Transmission Quality, and Accessories on Speech Perception in Cochlear Implant Users

PubMed Central

Koller, Roger; Guignard, Jérémie; Caversaccio, Marco; Kompis, Martin; Senn, Pascal

2017-01-01

Background Telecommunication is limited or even impossible for more than one-thirds of all cochlear implant (CI) users. Objective We sought therefore to study the impact of voice quality on speech perception with voice over Internet protocol (VoIP) under real and adverse network conditions. Methods Telephone speech perception was assessed in 19 CI users (15-69 years, average 42 years), using the German HSM (Hochmair-Schulz-Moser) sentence test comparing Skype and conventional telephone (public switched telephone networks, PSTN) transmission using a personal computer (PC) and a digital enhanced cordless telecommunications (DECT) telephone dual device. Five different Internet transmission quality modes and four accessories (PC speakers, headphones, 3.5 mm jack audio cable, and induction loop) were compared. As a secondary outcome, the subjective perceived voice quality was assessed using the mean opinion score (MOS). Results Speech telephone perception was significantly better (median 91.6%, P<.001) with Skype compared with PSTN (median 42.5%) under optimal conditions. Skype calls under adverse network conditions (data packet loss > 15%) were not superior to conventional telephony. In addition, there were no significant differences between the tested accessories (P>.05) using a PC. Coupling a Skype DECT phone device with an audio cable to the CI, however, resulted in higher speech perception (median 65%) and subjective MOS scores (3.2) than using PSTN (median 7.5%, P<.001). Conclusions Skype calls significantly improve speech perception for CI users compared with conventional telephony under real network conditions. Listening accessories do not further improve listening experience. Current Skype DECT telephone devices do not fully offer technical advantages in voice quality. PMID:28438727
Vocal Tract Representation in the Recognition of Cerebral Palsied Speech

ERIC Educational Resources Information Center

Rudzicz, Frank; Hirst, Graeme; van Lieshout, Pascal

2012-01-01

Purpose: In this study, the authors explored articulatory information as a means of improving the recognition of dysarthric speech by machine. Method: Data were derived chiefly from the TORGO database of dysarthric articulation (Rudzicz, Namasivayam, & Wolff, 2011) in which motions of various points in the vocal tract are measured during speech.…
Robotics control using isolated word recognition of voice input

NASA Technical Reports Server (NTRS)

Weiner, J. M.

1977-01-01

A speech input/output system is presented that can be used to communicate with a task oriented system. Human speech commands and synthesized voice output extend conventional information exchange capabilities between man and machine by utilizing audio input and output channels. The speech input facility is comprised of a hardware feature extractor and a microprocessor implemented isolated word or phrase recognition system. The recognizer offers a medium sized (100 commands), syntactically constrained vocabulary, and exhibits close to real time performance. The major portion of the recognition processing required is accomplished through software, minimizing the complexity of the hardware feature extractor.
Machine aided indexing from natural language text

NASA Technical Reports Server (NTRS)

Silvester, June P.; Genuardi, Michael T.; Klingbiel, Paul H.

1993-01-01

The NASA Lexical Dictionary (NLD) Machine Aided Indexing (MAI) system was designed to (1) reuse the indexing of the Defense Technical Information Center (DTIC); (2) reuse the indexing of the Department of Energy (DOE); and (3) reduce the time required for original indexing. This was done by automatically generating appropriate NASA thesaurus terms from either the other agency's index terms, or, for original indexing, from document titles and abstracts. The NASA STI Program staff devised two different ways to generate thesaurus terms from text. The first group of programs identified noun phrases by a parsing method that allowed for conjunctions and certain prepositions, on the assumption that indexable concepts are found in such phrases. Results were not always satisfactory, and it was noted that indexable concepts often occurred outside of noun phrases. The first method also proved to be too slow for the ultimate goal of interactive (online) MAI. The second group of programs used the knowledge base (KB), word proximity, and frequency of word and phrase occurrence to identify indexable concepts. Both methods are described and illustrated. Online MAI has been achieved, as well as several spinoff benefits, which are also described.
Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

NASA Astrophysics Data System (ADS)

Heracleous, Panikos; Kaino, Tomomi; Saruwatari, Hiroshi; Shikano, Kiyohiro

2006-12-01

We present the use of stethoscope and silicon NAM (nonaudible murmur) microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible) speech, but also very quietly uttered speech (nonaudible murmur). As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc.) for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a[InlineEquation not available: see fulltext.] word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.
Audio-Visual, Visuo-Tactile and Audio-Tactile Correspondences in Preschoolers.

PubMed

Nava, Elena; Grassi, Massimo; Turati, Chiara

2016-01-01

Interest in crossmodal correspondences has recently seen a renaissance thanks to numerous studies in human adults. Yet, still very little is known about crossmodal correspondences in children, particularly in sensory pairings other than audition and vision. In the current study, we investigated whether 4-5-year-old children match auditory pitch to the spatial motion of visual objects (audio-visual condition). In addition, we investigated whether this correspondence extends to touch, i.e., whether children also match auditory pitch to the spatial motion of touch (audio-tactile condition) and the spatial motion of visual objects to touch (visuo-tactile condition). In two experiments, two different groups of children were asked to indicate which of two stimuli fitted best with a centrally located third stimulus (Experiment 1), or to report whether two presented stimuli fitted together well (Experiment 2). We found sensitivity to the congruency of all of the sensory pairings only in Experiment 2, suggesting that only under specific circumstances can these correspondences be observed. Our results suggest that pitch-height correspondences for audio-visual and audio-tactile combinations may still be weak in preschool children, and speculate that this could be due to immature linguistic and auditory cues that are still developing at age five.
Creating speech-synchronized animation.

PubMed

King, Scott A; Parent, Richard E

2005-01-01

We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements.
The effect of bone conduction microphone placement on intensity and spectrum of transmitted speech items.

PubMed

Tran, Phuong K; Letowski, Tomasz R; McBride, Maranda E

2013-06-01

Speech signals can be converted into electrical audio signals using either conventional air conduction (AC) microphone or a contact bone conduction (BC) microphone. The goal of this study was to investigate the effects of the location of a BC microphone on the intensity and frequency spectrum of the recorded speech. Twelve locations, 11 on the talker's head and 1 on the collar bone, were investigated. The speech sounds were three vowels (/u/, /a/, /i/) and two consonants (/m/, /∫/). The sounds were produced by 12 talkers. Each sound was recorded simultaneously with two BC microphones and an AC microphone. Analyzed spectral data showed that the BC recordings made at the forehead of the talker were the most similar to the AC recordings, whereas the collar bone recordings were most different. Comparison of the spectral data with speech intelligibility data collected in another study revealed a strong negative relationship between BC speech intelligibility and the degree of deviation of the BC speech spectrum from the AC spectrum. In addition, the head locations that resulted in the highest speech intelligibility were associated with the lowest output signals among all tested locations. Implications of these findings for BC communication are discussed.
Speech Analysis and Synthesis and Man-Machine Speech Communications for Air Operations. (Synthese et Analyse de la Parole et Liaisons Vocales Homme- Machine dans les Operations Aeriennes)

DTIC Science & Technology

1990-05-01

speech produced by these systems. Finally, perhaps the greatest recent impetus in advancing digital Finally, in the area of speech and speaker recognitio ...XX) Ilz and logarithmic beyond I(XX) Hz (91. ts(n) *n) n)mW0) SWS BNLP LOGO *) -KQfl1 BANoPASS FILTER LOWPASS FILTER 0 fLi fHl f 0 fLP f FIgure 2
Random Deep Belief Networks for Recognizing Emotions from Speech Signals.

PubMed

Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

2017-01-01

Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.
Random Deep Belief Networks for Recognizing Emotions from Speech Signals

PubMed Central

Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

2017-01-01

Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition. PMID:28356908
Sparse gammatone signal model optimized for English speech does not match the human auditory filters.

PubMed

Strahl, Stefan; Mertins, Alfred

2008-07-18

Evidence that neurosensory systems use sparse signal representations as well as improved performance of signal processing algorithms using sparse signal models raised interest in sparse signal coding in the last years. For natural audio signals like speech and environmental sounds, gammatone atoms have been derived as expansion functions that generate a nearly optimal sparse signal model (Smith, E., Lewicki, M., 2006. Efficient auditory coding. Nature 439, 978-982). Furthermore, gammatone functions are established models for the human auditory filters. Thus far, a practical application of a sparse gammatone signal model has been prevented by the fact that deriving the sparsest representation is, in general, computationally intractable. In this paper, we applied an accelerated version of the matching pursuit algorithm for gammatone dictionaries allowing real-time and large data set applications. We show that a sparse signal model in general has advantages in audio coding and that a sparse gammatone signal model encodes speech more efficiently in terms of sparseness than a sparse modified discrete cosine transform (MDCT) signal model. We also show that the optimal gammatone parameters derived for English speech do not match the human auditory filters, suggesting for signal processing applications to derive the parameters individually for each applied signal class instead of using psychometrically derived parameters. For brain research, it means that care should be taken with directly transferring findings of optimality for technical to biological systems.
Audio distribution and Monitoring Circuit

NASA Technical Reports Server (NTRS)

Kirkland, J. M.

1983-01-01

Versatile circuit accepts and distributes TV audio signals. Three-meter audio distribution and monitoring circuit provides flexibility in monitoring, mixing, and distributing audio inputs and outputs at various signal and impedance levels. Program material is simultaneously monitored on three channels, or single-channel version built to monitor transmitted or received signal levels, drive speakers, interface to building communications, and drive long-line circuits.
"The Seesaw Is a Machine that Goes up and down": Young Children's Narrative Responses to Science-Related Informational Text

ERIC Educational Resources Information Center

Mantzicopoulos, Panayota; Patrick, Helen

2010-01-01

Research Findings: We report on an assessment developed to document young children's narrative production after listening to short segments of science-related informational text (SciT) on life science, earth and space, and simple machines. We examine differences between kindergarten boys (n = 39) and girls (n = 29) on several indices of narrative…
Benefits of Music Training for Perception of Emotional Speech Prosody in Deaf Children With Cochlear Implants

PubMed Central

Gordon, Karen A.; Papsin, Blake C.; Nespoli, Gabe; Hopyan, Talar; Peretz, Isabelle; Russo, Frank A.

2017-01-01

Objectives: Children who use cochlear implants (CIs) have characteristic pitch processing deficits leading to impairments in music perception and in understanding emotional intention in spoken language. Music training for normal-hearing children has previously been shown to benefit perception of emotional prosody. The purpose of the present study was to assess whether deaf children who use CIs obtain similar benefits from music training. We hypothesized that music training would lead to gains in auditory processing and that these gains would transfer to emotional speech prosody perception. Design: Study participants were 18 child CI users (ages 6 to 15). Participants received either 6 months of music training (i.e., individualized piano lessons) or 6 months of visual art training (i.e., individualized painting lessons). Measures of music perception and emotional speech prosody perception were obtained pre-, mid-, and post-training. The Montreal Battery for Evaluation of Musical Abilities was used to measure five different aspects of music perception (scale, contour, interval, rhythm, and incidental memory). The emotional speech prosody task required participants to identify the emotional intention of a semantically neutral sentence under audio-only and audiovisual conditions. Results: Music training led to improved performance on tasks requiring the discrimination of melodic contour and rhythm, as well as incidental memory for melodies. These improvements were predominantly found from mid- to post-training. Critically, music training also improved emotional speech prosody perception. Music training was most advantageous in audio-only conditions. Art training did not lead to the same improvements. Conclusions: Music training can lead to improvements in perception of music and emotional speech prosody, and thus may be an effective supplementary technique for supporting auditory rehabilitation following cochlear implantation. PMID:28085739
Accurate visible speech synthesis based on concatenating variable length motion capture data.

PubMed

Ma, Jiyong; Cole, Ron; Pellom, Bryan; Ward, Wayne; Wise, Barbara

2006-01-01

We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
Audio 2008: Audio Fixation

ERIC Educational Resources Information Center

Kaye, Alan L.

2008-01-01

Take a look around the bus or subway and see just how many people are bumping along to an iPod or an MP3 player. What they are listening to is their secret, but the many signature earbuds in sight should give one a real sense of just how pervasive digital audio has become. This article describes how that popularity is mirrored in library audio…
Cluster: Metals. Course: Machine Shop. Research Project.

ERIC Educational Resources Information Center

Sanford - Lee County Schools, NC.

The set of 13 units is designed for use with an instructor in actual machine shop practice and is also keyed to audio visual and textual materials. Each unit contains a series of task packages which: specify prerequisites within the series (minimum is Unit 1); provide a narrative rationale for learning; list both general and specific objectives in…
Audio-visual integration during speech perception in prelingually deafened Japanese children revealed by the McGurk effect.

PubMed

Tona, Risa; Naito, Yasushi; Moroto, Saburo; Yamamoto, Rinko; Fujiwara, Keizo; Yamazaki, Hiroshi; Shinohara, Shogo; Kikuchi, Masahiro

2015-12-01

To investigate the McGurk effect in profoundly deafened Japanese children with cochlear implants (CI) and in normal-hearing children. This was done to identify how children with profound deafness using CI established audiovisual integration during the speech acquisition period. Twenty-four prelingually deafened children with CI and 12 age-matched normal-hearing children participated in this study. Responses to audiovisual stimuli were compared between deafened and normal-hearing controls. Additionally, responses of the children with CI younger than 6 years of age were compared with those of the children with CI at least 6 years of age at the time of the test. Responses to stimuli combining auditory labials and visual non-labials were significantly different between deafened children with CI and normal-hearing controls (p<0.05). Additionally, the McGurk effect tended to be more induced in deafened children older than 6 years of age than in their younger counterparts. The McGurk effect was more significantly induced in prelingually deafened Japanese children with CI than in normal-hearing, age-matched Japanese children. Despite having good speech-perception skills and auditory input through their CI, from early childhood, deafened children may use more visual information in speech perception than normal-hearing children. As children using CI need to communicate based on insufficient speech signals coded by CI, additional activities of higher-order brain function may be necessary to compensate for the incomplete auditory input. This study provided information on the influence of deafness on the development of audiovisual integration related to speech, which could contribute to our further understanding of the strategies used in spoken language communication by prelingually deafened children. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
TRAINING TYPISTS IN THE INDUSTRIAL ENVIRONMENT--PRELIMINARY REPORT OF A PROTOTYPE SYSTEM OF SIMULTANEOUS, MULTILEVEL, MULTIPHASIC AUDIO PROGRAMMING.

ERIC Educational Resources Information Center

ADAMS, CHARLES F.

IN 1965 TEN NEGRO AND PUERTO RICAN GIRLS BEGAN CLERICAL TRAINING IN THE NATIONAL ASSOCIATION OF MANUFACTURERS (NAM) TYPING LABORATORY I (TEELAB-I), A PILOT PROJECT TO DEVELOP A SYSTEM OF TRAINING TYPISTS WITHIN THE INDUSTRIAL ENVIRONMENT. THE INITIAL SYSTEM, AN ADAPTATION OF GREGG AUDIO MATERIALS TO A MACHINE TECHNOLOGY, TAUGHT ACCURACY, SPEED…

Brain networks engaged in audiovisual integration during speech perception revealed by persistent homology-based network filtration.

PubMed

Kim, Heejung; Hahm, Jarang; Lee, Hyekyoung; Kang, Eunjoo; Kang, Hyejin; Lee, Dong Soo

2015-05-01

The human brain naturally integrates audiovisual information to improve speech perception. However, in noisy environments, understanding speech is difficult and may require much effort. Although the brain network is supposed to be engaged in speech perception, it is unclear how speech-related brain regions are connected during natural bimodal audiovisual or unimodal speech perception with counterpart irrelevant noise. To investigate the topological changes of speech-related brain networks at all possible thresholds, we used a persistent homological framework through hierarchical clustering, such as single linkage distance, to analyze the connected component of the functional network during speech perception using functional magnetic resonance imaging. For speech perception, bimodal (audio-visual speech cue) or unimodal speech cues with counterpart irrelevant noise (auditory white-noise or visual gum-chewing) were delivered to 15 subjects. In terms of positive relationship, similar connected components were observed in bimodal and unimodal speech conditions during filtration. However, during speech perception by congruent audiovisual stimuli, the tighter couplings of left anterior temporal gyrus-anterior insula component and right premotor-visual components were observed than auditory or visual speech cue conditions, respectively. Interestingly, visual speech is perceived under white noise by tight negative coupling in the left inferior frontal region-right anterior cingulate, left anterior insula, and bilateral visual regions, including right middle temporal gyrus, right fusiform components. In conclusion, the speech brain network is tightly positively or negatively connected, and can reflect efficient or effortful processes during natural audiovisual integration or lip-reading, respectively, in speech perception.
Real-Time Speech-to-Text Services. [A Report of the] National Task Force on Quality of Services in the Postsecondary Education of Deaf and Hard of Hearing Students.

ERIC Educational Resources Information Center

Stinson, Michael; Eisenberg, Sandy; Horn, Christy; Larson, Judy; Levitt, Harry; Stuckless, Ross

This report describes and discusses several applications of new computer-based technologies which enable postsecondary students with deafness or hearing impairments to read the text of the language being spoken by the instructor and fellow students virtually in real time. Two current speech-to-text options are described: (1) steno-based systems in…
Instrumental Landing Using Audio Indication

NASA Astrophysics Data System (ADS)

Burlak, E. A.; Nabatchikov, A. M.; Korsun, O. N.

2018-02-01

The paper proposes an audio indication method for presenting to a pilot the information regarding the relative positions of an aircraft in the tasks of precision piloting. The implementation of the method is presented, the use of such parameters of audio signal as loudness, frequency and modulation are discussed. To confirm the operability of the audio indication channel the experiments using modern aircraft simulation facility were carried out. The simulated performed the instrument landing using the proposed audio method to indicate the aircraft deviations in relation to the slide path. The results proved compatible with the simulated instrumental landings using the traditional glidescope pointers. It inspires to develop the method in order to solve other precision piloting tasks.
Influence of Telecommunication Modality, Internet Transmission Quality, and Accessories on Speech Perception in Cochlear Implant Users.

PubMed

Mantokoudis, Georgios; Koller, Roger; Guignard, Jérémie; Caversaccio, Marco; Kompis, Martin; Senn, Pascal

2017-04-24

Telecommunication is limited or even impossible for more than one-thirds of all cochlear implant (CI) users. We sought therefore to study the impact of voice quality on speech perception with voice over Internet protocol (VoIP) under real and adverse network conditions. Telephone speech perception was assessed in 19 CI users (15-69 years, average 42 years), using the German HSM (Hochmair-Schulz-Moser) sentence test comparing Skype and conventional telephone (public switched telephone networks, PSTN) transmission using a personal computer (PC) and a digital enhanced cordless telecommunications (DECT) telephone dual device. Five different Internet transmission quality modes and four accessories (PC speakers, headphones, 3.5 mm jack audio cable, and induction loop) were compared. As a secondary outcome, the subjective perceived voice quality was assessed using the mean opinion score (MOS). Speech telephone perception was significantly better (median 91.6%, P<.001) with Skype compared with PSTN (median 42.5%) under optimal conditions. Skype calls under adverse network conditions (data packet loss > 15%) were not superior to conventional telephony. In addition, there were no significant differences between the tested accessories (P>.05) using a PC. Coupling a Skype DECT phone device with an audio cable to the CI, however, resulted in higher speech perception (median 65%) and subjective MOS scores (3.2) than using PSTN (median 7.5%, P<.001). Skype calls significantly improve speech perception for CI users compared with conventional telephony under real network conditions. Listening accessories do not further improve listening experience. Current Skype DECT telephone devices do not fully offer technical advantages in voice quality. ©Georgios Mantokoudis, Roger Koller, Jérémie Guignard, Marco Caversaccio, Martin Kompis, Pascal Senn. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 24.04.2017.
Evaluation of selected speech parameters after prosthesis supply in patients with maxillary or mandibular defects.

PubMed

Müller, Rainer; Höhlein, Andreas; Wolf, Annette; Markwardt, Jutta; Schulz, Matthias C; Range, Ursula; Reitemeier, Bernd

2013-01-01

Ablative surgery of oropharyngeal tumors frequently leads to defects in the speech organs, resulting in impairment of speech up to the point of unintelligibility. The aim of the present study was the assessment of selected parameters of speech with and without resection prostheses. The speech sounds of 22 patients suffering from maxillary and mandibular defects were recorded using a digital audio tape (DAT) recorder with and without resection prostheses. Evaluation of the resonance and the production of the sounds /s/, /sch/, and /ch/ was performed by 2 experienced speech therapists. Additionally, the patients completed a non-standardized questionnaire containing a linguistic self-assessment. After prosthesis supply, the number of patients with rhinophonia aperta decreased from 7 to 2 while the number of patients with intelligible speech increased from 2 to 20. Correct production of the sounds /s/, /sch/, and /ch/ increased from 2 to 13 patients. A significant improvement of the evaluated parameters could be observed only in patients with maxillary defects. The linguistic self-assessment showed a higher satisfaction in patients with maxillary defects. In patients with maxillary defects due to ablative tumor surgery, an increase in speech performance and intelligibility is possible by supplying resection prostheses. © 2013 S. Karger GmbH, Freiburg.
Holographic disk with high data transfer rate: its application to an audio response memory.

PubMed

Kubota, K; Ono, Y; Kondo, M; Sugama, S; Nishida, N; Sakaguchi, M

1980-03-15

This paper describes a memory realized with a high data transfer rate using the holographic parallel-processing function and its application to an audio response system that supplies many audio messages to many terminals simultaneously. Digitalized audio messages are recorded as tiny 1-D Fourier transform holograms on a holographic disk. A hologram recorder and a hologram reader were constructed to test and demonstrate the holographic audio response memory feasibility. Experimental results indicate the potentiality of an audio response system with a 2000-word vocabulary and 250-Mbit/sec bit transfer rate.
47 CFR Figure 2 to Subpart N of... - Typical Audio Wave

Code of Federal Regulations, 2011 CFR

2011-10-01

... 47 Telecommunication 1 2011-10-01 2011-10-01 false Typical Audio Wave 2 Figure 2 to Subpart N of Part 2 Telecommunication FEDERAL COMMUNICATIONS COMMISSION GENERAL FREQUENCY ALLOCATIONS AND RADIO... Audio Wave EC03JN91.006 ...
ENERGY STAR Certified Audio Video

EPA Pesticide Factsheets

Certified models meet all ENERGY STAR requirements as listed in the Version 3.0 ENERGY STAR Program Requirements for Audio Video Equipment that are effective as of May 1, 2013. A detailed listing of key efficiency criteria are available at http://www.energystar.gov/index.cfm?c=audio_dvd.pr_crit_audio_dvd
The priming function of in-car audio instruction.

PubMed

Keyes, Helen; Whitmore, Antony; Naneva, Stanislava; McDermott, Daragh

2018-05-01

Studies to date have focused on the priming power of visual road signs, but not the priming potential of audio road scene instruction. Here, the relative priming power of visual, audio, and multisensory road scene instructions was assessed. In a lab-based study, participants responded to target road scene turns following visual, audio, or multisensory road turn primes which were congruent or incongruent to the primes in direction, or control primes. All types of instruction (visual, audio, and multisensory) were successful in priming responses to a road scene. Responses to multisensory-primed targets (both audio and visual) were faster than responses to either audio or visual primes alone. Incongruent audio primes did not affect performance negatively in the manner of incongruent visual or multisensory primes. Results suggest that audio instructions have the potential to prime drivers to respond quickly and safely to their road environment. Peak performance will be observed if audio and visual road instruction primes can be timed to co-occur.
Paradox in AI - AI 2.0: The Way to Machine Consciousness

NASA Astrophysics Data System (ADS)

Palensky, Peter; Bruckner, Dietmar; Tmej, Anna; Deutsch, Tobias

Artificial Intelligence, the big promise of the last millennium, has apparently made its way into our daily lives. Cell phones with speech control, evolutionary computing in data mining or power grids, optimized via neural network, show its applicability in industrial environments. The original expectation of true intelligence and thinking machines lies still ahead of us. Researchers are, however, optimistic as never before. This paper tries to compare the views, challenges and approaches of several disciplines: engineering, psychology, neuroscience, philosophy. It gives a short introduction to Psychoanalysis, discusses the term consciousness, social implications of intelligent machines, related theories, and expectations and shall serve as a starting point for first attempts of combining these diverse thoughts.
Monaural room acoustic parameters from music and speech.

PubMed

Kendrick, Paul; Cox, Trevor J; Li, Francis F; Zhang, Yonggang; Chambers, Jonathon A

2008-07-01

This paper compares two methods for extracting room acoustic parameters from reverberated speech and music. An approach which uses statistical machine learning, previously developed for speech, is extended to work with music. For speech, reverberation time estimations are within a perceptual difference limen of the true value. For music, virtually all early decay time estimations are within a difference limen of the true value. The estimation accuracy is not good enough in other cases due to differences between the simulated data set used to develop the empirical model and real rooms. The second method carries out a maximum likelihood estimation on decay phases at the end of notes or speech utterances. This paper extends the method to estimate parameters relating to the balance of early and late energies in the impulse response. For reverberation time and speech, the method provides estimations which are within the perceptual difference limen of the true value. For other parameters such as clarity, the estimations are not sufficiently accurate due to the natural reverberance of the excitation signals. Speech is a better test signal than music because of the greater periods of silence in the signal, although music is needed for low frequency measurement.
Text messaging as a strategy to address the limits of audio-based communication during mass-gathering events with high ambient noise.

PubMed

Lund, Adam; Wong, Daniel; Lewis, Kerrie; Turris, Sheila A; Vaisler, Sean; Gutman, Samuel

2013-02-01

The provision of medical care in environments with high levels of ambient noise (HLAN), such as concerts or sporting events, presents unique communication challenges. Audio transmissions can be incomprehensible to the receivers. Text-based communications may be a valuable primary and/or secondary means of communication in this type of setting. To evaluate the usability of text-based communications in parallel with standard two-way radio communications during mass-gathering (MG) events in the context of HLAN. This Canadian study used outcome survey methods to evaluate the performance of communication devices during MG events. Ten standard commercially available handheld smart phones loaded with basic voice and data plans were assigned to health care providers (HCPs) for use as an adjunct to the medical team's typical radio-based communication. Common text messaging and chat platforms were trialed. Both efficacy and provider satisfaction were evaluated. During a 23-month period, the smart phones were deployed at 17 events with HLAN for a total of 40 event days or approximately 460 hours of active use. Survey responses from health care providers (177) and dispatchers (26) were analyzed. The response rate was unknown due to the method of recruitment. Of the 155 HCP responses to the question measuring difficulty of communication in environments with HLAN, 68.4% agreed that they "occasionally" or "frequently" found it difficult to clearly understand voice communications via two-way radio. Similarly, of the 23 dispatcher responses to the same item, 65.2% of the responses indicated that "occasionally" or "frequently" HLAN negatively affected the ability to communicate clearly with team members. Of the 168 HCP responses to the item assessing whether text-based communication improved the ability to understand and respond to calls when compared to radio alone, 86.3% "agreed" or "strongly agreed" that this was the case. The dispatcher responses (n = 21) to the same item also
"Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing.

PubMed

Xiao, Bo; Imel, Zac E; Georgiou, Panayiotis G; Atkins, David C; Narayanan, Shrikanth S

2015-01-01

The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.
Real-time magnetic resonance imaging and electromagnetic articulography database for speech production research (TC)

PubMed Central

Narayanan, Shrikanth; Toutios, Asterios; Ramanarayanan, Vikram; Lammert, Adam; Kim, Jangwon; Lee, Sungbok; Nayak, Krishna; Kim, Yoon-Chul; Zhu, Yinghua; Goldstein, Louis; Byrd, Dani; Bresch, Erik; Ghosh, Prasanta; Katsamanis, Athanasios; Proctor, Michael

2014-01-01

USC-TIMIT is an extensive database of multimodal speech production data, developed to complement existing resources available to the speech research community and with the intention of being continuously refined and augmented. The database currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English. Electromagnetic articulography data have also been presently collected from four of these speakers. The two modalities were recorded in two independent sessions while the subjects produced the same 460 sentence corpus used previously in the MOCHA-TIMIT database. In both cases the audio signal was recorded and synchronized with the articulatory data. The database and companion software are freely available to the research community. PMID:25190403
Translation Analysis on Civil Engineering Text Produced by Machine Translator

NASA Astrophysics Data System (ADS)

Sutopo, Anam

2018-02-01

Translation is extremely needed in communication since people have serious problem in the language used. Translation activity is done by the person in charge for translating the material. Translation activity is also able to be done by machine. It is called machine translation, reflected in the programs developed by programmer. One of them is Transtool. Many people used Transtool for helping them in solving the problem related with translation activities. This paper wants to deliver how important is the Transtool program, how effective is Transtool program and how is the function of Transtool for human business. This study applies qualitative research. The sources of data were document and informant. This study used documentation and in dept-interviewing as the techniques for collecting data. The collected data were analyzed by using interactive analysis. The results of the study show that, first; Transtool program is helpful for people in translating the civil engineering text and it functions as the aid or helper, second; the working of Transtool software program is effective enough and third; the result of translation produced by Transtool is good for short and simple sentences and not readable, not understandable and not accurate for long sentences (compound, complex and compound complex) thought the result is informative. The translated material must be edited by the professional translator.
Word pair classification during imagined speech using direct brain recordings

NASA Astrophysics Data System (ADS)

Martin, Stephanie; Brunner, Peter; Iturrate, Iñaki; Millán, José Del R.; Schalk, Gerwin; Knight, Robert T.; Pasley, Brian N.

2016-05-01

People that cannot communicate due to neurological disorders would benefit from an internal speech decoder. Here, we showed the ability to classify individual words during imagined speech from electrocorticographic signals. In a word imagery task, we used high gamma (70-150 Hz) time features with a support vector machine model to classify individual words from a pair of words. To account for temporal irregularities during speech production, we introduced a non-linear time alignment into the SVM kernel. Classification accuracy reached 88% in a two-class classification framework (50% chance level), and average classification accuracy across fifteen word-pairs was significant across five subjects (mean = 58% p < 0.05). We also compared classification accuracy between imagined speech, overt speech and listening. As predicted, higher classification accuracy was obtained in the listening and overt speech conditions (mean = 89% and 86%, respectively; p < 0.0001), where speech stimuli were directly presented. The results provide evidence for a neural representation for imagined words in the temporal lobe, frontal lobe and sensorimotor cortex, consistent with previous findings in speech perception and production. These data represent a proof of concept study for basic decoding of speech imagery, and delineate a number of key challenges to usage of speech imagery neural representations for clinical applications.
Adaptation to spectrally-rotated speech.

PubMed

Green, Tim; Rosen, Stuart; Faulkner, Andrew; Paterson, Ruth

2013-08-01

Much recent interest surrounds listeners' abilities to adapt to various transformations that distort speech. An extreme example is spectral rotation, in which the spectrum of low-pass filtered speech is inverted around a center frequency (2 kHz here). Spectral shape and its dynamics are completely altered, rendering speech virtually unintelligible initially. However, intonation, rhythm, and contrasts in periodicity and aperiodicity are largely unaffected. Four normal hearing adults underwent 6 h of training with spectrally-rotated speech using Continuous Discourse Tracking. They and an untrained control group completed pre- and post-training speech perception tests, for which talkers differed from the training talker. Significantly improved recognition of spectrally-rotated sentences was observed for trained, but not untrained, participants. However, there were no significant improvements in the identification of medial vowels in /bVd/ syllables or intervocalic consonants. Additional tests were performed with speech materials manipulated so as to isolate the contribution of various speech features. These showed that preserving intonational contrasts did not contribute to the comprehension of spectrally-rotated speech after training, and suggested that improvements involved adaptation to altered spectral shape and dynamics, rather than just learning to focus on speech features relatively unaffected by the transformation.
Responding Effectively to Composition Students: Comparing Student Perceptions of Written and Audio Feedback

ERIC Educational Resources Information Center

Bilbro, J.; Iluzada, C.; Clark, D. E.

2013-01-01

The authors compared student perceptions of audio and written feedback in order to assess what types of students may benefit from receiving audio feedback on their essays rather than written feedback. Many instructors previously have reported the advantages they see in audio feedback, but little quantitative research has been done on how the…
Distracted While Reading? Changing to a Hard-to-Read Font Shields against the Effects of Environmental Noise and Speech on Text Memory

PubMed Central

Halin, Niklas

2016-01-01

The purpose of this study was to investigate the distractive effects of background speech, aircraft noise and road traffic noise on text memory and particularly to examine if displaying the texts in a hard-to-read font can shield against the detrimental effects of these types of background sounds. This issue was addressed in an experiment where 56 students read shorter texts about different classes of fictitious creatures (i.e., animals, fishes, birds, and dinosaurs) against a background of the aforementioned background sounds respectively and silence. For half of the participants the texts were displayed in an easy-to-read font (i.e., Times New Roman) and for the other half in a hard-to-read font (i.e., Haettenschweiler). The dependent measure was the proportion correct answers on the multiple-choice tests that followed each sound condition. Participants’ performance in the easy-to-read font condition was significantly impaired by all three background sound conditions compared to silence. In contrast, there were no effects of the three background sound conditions compared to silence in the hard-to-read font condition. These results suggest that an increase in task demand—by displaying the text in a hard-to-read font—shields against various types of distracting background sounds by promoting a more steadfast locus-of-attention and by reducing the processing of background sound. PMID:27555834
Could Audio-Described Films Benefit from Audio Introductions? An Audience Response Study

ERIC Educational Resources Information Center

Romero-Fresco, Pablo; Fryer, Louise

2013-01-01

Introduction: Time constraints limit the quantity and type of information conveyed in audio description (AD) for films, in particular the cinematic aspects. Inspired by introductory notes for theatre AD, this study developed audio introductions (AIs) for "Slumdog Millionaire" and "Man on Wire." Each AI comprised 10 minutes of…

Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

PubMed

Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

2011-01-01

The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.
Focal versus distributed temporal cortex activity for speech sound category assignment

PubMed Central

Bouton, Sophie; Chambon, Valérian; Tyrand, Rémi; Seeck, Margitta; Karkar, Sami; van de Ville, Dimitri; Giraud, Anne-Lise

2018-01-01

Percepts and words can be decoded from distributed neural activity measures. However, the existence of widespread representations might conflict with the more classical notions of hierarchical processing and efficient coding, which are especially relevant in speech processing. Using fMRI and magnetoencephalography during syllable identification, we show that sensory and decisional activity colocalize to a restricted part of the posterior superior temporal gyrus (pSTG). Next, using intracortical recordings, we demonstrate that early and focal neural activity in this region distinguishes correct from incorrect decisions and can be machine-decoded to classify syllables. Crucially, significant machine decoding was possible from neuronal activity sampled across different regions of the temporal and frontal lobes, despite weak or absent sensory or decision-related responses. These findings show that speech-sound categorization relies on an efficient readout of focal pSTG neural activity, while more distributed activity patterns, although classifiable by machine learning, instead reflect collateral processes of sensory perception and decision. PMID:29363598
Human phoneme recognition depending on speech-intrinsic variability.

PubMed

Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger

2010-11-01

The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).
An acoustic feature-based similarity scoring system for speech rehabilitation assistance.

PubMed

Syauqy, Dahnial; Wu, Chao-Min; Setyawati, Onny

2016-08-01

The purpose of this study is to develop a tool to assist speech therapy and rehabilitation, which focused on automatic scoring based on the comparison of the patient's speech with another normal speech on several aspects including pitch, vowel, voiced-unvoiced segments, strident fricative and sound intensity. The pitch estimation employed the use of cepstrum-based algorithm for its robustness; the vowel classification used multilayer perceptron (MLP) to classify vowel from pitch and formants; and the strident fricative detection was based on the major peak spectral intensity, location and the pitch existence in the segment. In order to evaluate the performance of the system, this study analyzed eight patient's speech recordings (four males, four females; 4-58-years-old), which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Taoyuan General Hospital. The experiment result on pitch algorithm showed that the cepstrum method had 5.3% of gross pitch error from a total of 2086 frames. On the vowel classification algorithm, MLP method provided 93% accuracy (men), 87% (women) and 84% (children). In total, the overall results showed that 156 tool's grading results (81%) were consistent compared to 192 audio and visual observations done by four experienced respondents. Implication for Rehabilitation Difficulties in communication may limit the ability of a person to transfer and exchange information. The fact that speech is one of the primary means of communication has encouraged the needs of speech diagnosis and rehabilitation. The advances of technology in computer-assisted speech therapy (CAST) improve the quality, time efficiency of the diagnosis and treatment of the disorders. The present study attempted to develop tool to assist speech therapy and rehabilitation, which provided simple interface to let the assessment be done even by the patient himself without the need of particular knowledge of speech processing while at the
Instructional Audio Guidelines: Four Design Principles to Consider for Every Instructional Audio Design Effort

ERIC Educational Resources Information Center

Carter, Curtis W.

2012-01-01

This article contends that instructional designers and developers should attend to four particular design principles when creating instructional audio. Support for this view is presented by referencing the limited research that has been done in this area, and by indicating how and why each of the four principles is important to the design process.…
What makes an automated teller machine usable by blind users?

PubMed

Manzke, J M; Egan, D H; Felix, D; Krueger, H

1998-07-01

Fifteen blind and sighted subjects, who featured as a control group for acceptance, were asked for their requirements for automated teller machines (ATMs). Both groups also tested the usability of a partially operational ATM mock-up. This machine was based on an existing cash dispenser, providing natural speech output, different function menus and different key arrangements. Performance and subjective evaluation data of blind and sighted subjects were collected. All blind subjects were able to operate the ATM successfully. The implemented speech output was the main usability factor for them. The different interface designs did not significantly affect performance and subjective evaluation. Nevertheless, design recommendations can be derived from the requirement assessment. The sighted subjects were rather open for design modifications, especially the implementation of speech output. However, there was also a mismatch of the requirements of the two subject groups, mainly concerning the key arrangement.
Digital Multicasting of Multiple Audio Streams

NASA Technical Reports Server (NTRS)

Macha, Mitchell; Bullock, John

2007-01-01

The Mission Control Center Voice Over Internet Protocol (MCC VOIP) system (see figure) comprises hardware and software that effect simultaneous, nearly real-time transmission of as many as 14 different audio streams to authorized listeners via the MCC intranet and/or the Internet. The original version of the MCC VOIP system was conceived to enable flight-support personnel located in offices outside a spacecraft mission control center to monitor audio loops within the mission control center. Different versions of the MCC VOIP system could be used for a variety of public and commercial purposes - for example, to enable members of the general public to monitor one or more NASA audio streams through their home computers, to enable air-traffic supervisors to monitor communication between airline pilots and air-traffic controllers in training, and to monitor conferences among brokers in a stock exchange. At the transmitting end, the audio-distribution process begins with feeding the audio signals to analog-to-digital converters. The resulting digital streams are sent through the MCC intranet, using a user datagram protocol (UDP), to a server that converts them to encrypted data packets. The encrypted data packets are then routed to the personal computers of authorized users by use of multicasting techniques. The total data-processing load on the portion of the system upstream of and including the encryption server is the total load imposed by all of the audio streams being encoded, regardless of the number of the listeners or the number of streams being monitored concurrently by the listeners. The personal computer of a user authorized to listen is equipped with special- purpose MCC audio-player software. When the user launches the program, the user is prompted to provide identification and a password. In one of two access- control provisions, the program is hard-coded to validate the user s identity and password against a list maintained on a domain-controller computer
An integrated analysis of speech and gestural characteristics in conversational child-computer interactions

NASA Astrophysics Data System (ADS)

Yildirim, Serdar; Montanari, Simona; Andersen, Elaine; Narayanan, Shrikanth S.

2003-10-01

Understanding the fine details of children's speech and gestural characteristics helps, among other things, in creating natural computer interfaces. We analyze the acoustic, lexical/non-lexical and spoken/gestural discourse characteristics of young children's speech using audio-video data gathered using a Wizard of Oz technique from 4 to 6 year old children engaged in resolving a series of age-appropriate cognitive challenges. Fundamental and formant frequencies exhibited greater variations between subjects consistent with previous results on read speech [Lee et al., J. Acoust. Soc. Am. 105, 1455-1468 (1999)]. Also, our analysis showed that, in a given bandwidth, phonemic information contained in the speech of young child is significantly less than that of older ones and adults. To enable an integrated analysis, a multi-track annotation board was constructed using the ANVIL tool kit [M. Kipp, Eurospeech 1367-1370 (2001)]. Along with speech transcriptions and acoustic analysis, non-lexical and discourse characteristics, and child's gesture (facial expressions, body movements, hand/head movements) were annotated in a synchronized multilayer system. Initial results showed that younger children rely more on gestures to emphasize their verbal assertions. Younger children use non-lexical speech (e.g., um, huh) associated with frustration and pondering/reflecting more frequently than older ones. Younger children also repair more with humans than with computer.
Supervised Machine Learning Algorithms Can Classify Open-Text Feedback of Doctor Performance With Human-Level Accuracy.

PubMed

Gibbons, Chris; Richards, Suzanne; Valderas, Jose Maria; Campbell, John

2017-03-15

Machine learning techniques may be an effective and efficient way to classify open-text reports on doctor's activity for the purposes of quality assurance, safety, and continuing professional development. The objective of the study was to evaluate the accuracy of machine learning algorithms trained to classify open-text reports of doctor performance and to assess the potential for classifications to identify significant differences in doctors' professional performance in the United Kingdom. We used 1636 open-text comments (34,283 words) relating to the performance of 548 doctors collected from a survey of clinicians' colleagues using the General Medical Council Colleague Questionnaire (GMC-CQ). We coded 77.75% (1272/1636) of the comments into 5 global themes (innovation, interpersonal skills, popularity, professionalism, and respect) using a qualitative framework. We trained 8 machine learning algorithms to classify comments and assessed their performance using several training samples. We evaluated doctor performance using the GMC-CQ and compared scores between doctors with different classifications using t tests. Individual algorithm performance was high (range F score=.68 to .83). Interrater agreement between the algorithms and the human coder was highest for codes relating to "popular" (recall=.97), "innovator" (recall=.98), and "respected" (recall=.87) codes and was lower for the "interpersonal" (recall=.80) and "professional" (recall=.82) codes. A 10-fold cross-validation demonstrated similar performance in each analysis. When combined together into an ensemble of multiple algorithms, mean human-computer interrater agreement was .88. Comments that were classified as "respected," "professional," and "interpersonal" related to higher doctor scores on the GMC-CQ compared with comments that were not classified (P<.05). Scores did not vary between doctors who were rated as popular or innovative and those who were not rated at all (P>.05). Machine learning
Supervised Machine Learning Algorithms Can Classify Open-Text Feedback of Doctor Performance With Human-Level Accuracy

PubMed Central

2017-01-01

Background Machine learning techniques may be an effective and efficient way to classify open-text reports on doctor’s activity for the purposes of quality assurance, safety, and continuing professional development. Objective The objective of the study was to evaluate the accuracy of machine learning algorithms trained to classify open-text reports of doctor performance and to assess the potential for classifications to identify significant differences in doctors’ professional performance in the United Kingdom. Methods We used 1636 open-text comments (34,283 words) relating to the performance of 548 doctors collected from a survey of clinicians’ colleagues using the General Medical Council Colleague Questionnaire (GMC-CQ). We coded 77.75% (1272/1636) of the comments into 5 global themes (innovation, interpersonal skills, popularity, professionalism, and respect) using a qualitative framework. We trained 8 machine learning algorithms to classify comments and assessed their performance using several training samples. We evaluated doctor performance using the GMC-CQ and compared scores between doctors with different classifications using t tests. Results Individual algorithm performance was high (range F score=.68 to .83). Interrater agreement between the algorithms and the human coder was highest for codes relating to “popular” (recall=.97), “innovator” (recall=.98), and “respected” (recall=.87) codes and was lower for the “interpersonal” (recall=.80) and “professional” (recall=.82) codes. A 10-fold cross-validation demonstrated similar performance in each analysis. When combined together into an ensemble of multiple algorithms, mean human-computer interrater agreement was .88. Comments that were classified as “respected,” “professional,” and “interpersonal” related to higher doctor scores on the GMC-CQ compared with comments that were not classified (P<.05). Scores did not vary between doctors who were rated as popular or
Effect of a Bluetooth-implemented hearing aid on speech recognition performance: subjective and objective measurement.

PubMed

Kim, Min-Beom; Chung, Won-Ho; Choi, Jeesun; Hong, Sung Hwa; Cho, Yang-Sun; Park, Gyuseok; Lee, Sangmin

2014-06-01

The object was to evaluate speech perception improvement through Bluetooth-implemented hearing aids in hearing-impaired adults. Thirty subjects with bilateral symmetric moderate sensorineural hearing loss participated in this study. A Bluetooth-implemented hearing aid was fitted unilaterally in all study subjects. Objective speech recognition score and subjective satisfaction were measured with a Bluetooth-implemented hearing aid to replace the acoustic connection from either a cellular phone or a loudspeaker system. In each system, participants were assigned to 4 conditions: wireless speech signal transmission into hearing aid (wireless mode) in quiet or noisy environment and conventional speech signal transmission using external microphone of hearing aid (conventional mode) in quiet or noisy environment. Also, participants completed questionnaires to investigate subjective satisfaction. Both cellular phone and loudspeaker system situation, participants showed improvements in sentence and word recognition scores with wireless mode compared to conventional mode in both quiet and noise conditions (P < .001). Participants also reported subjective improvements, including better sound quality, less noise interference, and better accuracy naturalness, when using the wireless mode (P < .001). Bluetooth-implemented hearing aids helped to improve subjective and objective speech recognition performances in quiet and noisy environments during the use of electronic audio devices.
Audiomotor Perceptual Training Enhances Speech Intelligibility in Background Noise.

PubMed

Whitton, Jonathon P; Hancock, Kenneth E; Shannon, Jeffrey M; Polley, Daniel B

2017-11-06

Sensory and motor skills can be improved with training, but learning is often restricted to practice stimuli. As an exception, training on closed-loop (CL) sensorimotor interfaces, such as action video games and musical instruments, can impart a broad spectrum of perceptual benefits. Here we ask whether computerized CL auditory training can enhance speech understanding in levels of background noise that approximate a crowded restaurant. Elderly hearing-impaired subjects trained for 8 weeks on a CL game that, like a musical instrument, challenged them to monitor subtle deviations between predicted and actual auditory feedback as they moved their fingertip through a virtual soundscape. We performed our study as a randomized, double-blind, placebo-controlled trial by training other subjects in an auditory working-memory (WM) task. Subjects in both groups improved at their respective auditory tasks and reported comparable expectations for improved speech processing, thereby controlling for placebo effects. Whereas speech intelligibility was unchanged after WM training, subjects in the CL training group could correctly identify 25% more words in spoken sentences or digit sequences presented in high levels of background noise. Numerically, CL audiomotor training provided more than three times the benefit of our subjects' hearing aids for speech processing in noisy listening conditions. Gains in speech intelligibility could be predicted from gameplay accuracy and baseline inhibitory control. However, benefits did not persist in the absence of continuing practice. These studies employ stringent clinical standards to demonstrate that perceptual learning on a computerized audio game can transfer to "real-world" communication challenges. Copyright © 2017 Elsevier Ltd. All rights reserved.
The contrast between alveolar and velar stops with typical speech data: acoustic and articulatory analyses.

PubMed

Melo, Roberta Michelon; Mota, Helena Bolli; Berti, Larissa Cristina

2017-06-08

This study used acoustic and articulatory analyses to characterize the contrast between alveolar and velar stops with typical speech data, comparing the parameters (acoustic and articulatory) of adults and children with typical speech development. The sample consisted of 20 adults and 15 children with typical speech development. The analyzed corpus was organized through five repetitions of each target-word (/'kap ə/, /'tapə/, /'galo/ e /'daɾə/). These words were inserted into a carrier phrase and the participant was asked to name them spontaneously. Simultaneous audio and video data were recorded (tongue ultrasound images). The data was submitted to acoustic analyses (voice onset time; spectral peak and burst spectral moments; vowel/consonant transition and relative duration measures) and articulatory analyses (proportion of significant axes of the anterior and posterior tongue regions and description of tongue curves). Acoustic and articulatory parameters were effective to indicate the contrast between alveolar and velar stops, mainly in the adult group. Both speech analyses showed statistically significant differences between the two groups. The acoustic and articulatory parameters provided signals to characterize the phonic contrast of speech. One of the main findings in the comparison between adult and child speech was evidence of articulatory refinement/maturation even after the period of segment acquisition.
Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.

ERIC Educational Resources Information Center

Makhoul, John I.

The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…
TEACHER'S GUIDE TO HIGH SCHOOL SPEECH.

ERIC Educational Resources Information Center

JENKINSON, EDWARD B., ED.

THIS GUIDE TO HIGH SCHOOL SPEECH FOCUSES ON SPEECH AS ORAL COMPOSITION, STRESSING THE IMPORTANCE OF CLEAR THINKING AND COMMUNICATION. THE PROPOSED 1-SEMESTER BASIC COURSE IN SPEECH ATTEMPTS TO IMPROVE THE STUDENT'S ABILITY TO COMPOSE AND DELIVER SPEECHES, TO THINK AND LISTEN CRITICALLY, AND TO UNDERSTAND THE SOCIAL FUNCTION OF SPEECH. IN ADDITION…
Social Robotics in Therapy of Apraxia of Speech

PubMed Central

Alonso-Martín, Fernando

2018-01-01

Apraxia of speech is a motor speech disorder in which messages from the brain to the mouth are disrupted, resulting in an inability for moving lips or tongue to the right place to pronounce sounds correctly. Current therapies for this condition involve a therapist that in one-on-one sessions conducts the exercises. Our aim is to work in the line of robotic therapies in which a robot is able to perform partially or autonomously a therapy session, endowing a social robot with the ability of assisting therapists in apraxia of speech rehabilitation exercises. Therefore, we integrate computer vision and machine learning techniques to detect the mouth pose of the user and, on top of that, our social robot performs autonomously the different steps of the therapy using multimodal interaction. PMID:29713440
Web Audio/Video Streaming Tool

NASA Technical Reports Server (NTRS)

Guruvadoo, Eranna K.

2003-01-01

In order to promote NASA-wide educational outreach program to educate and inform the public of space exploration, NASA, at Kennedy Space Center, is seeking efficient ways to add more contents to the web by streaming audio/video files. This project proposes a high level overview of a framework for the creation, management, and scheduling of audio/video assets over the web. To support short-term goals, the prototype of a web-based tool is designed and demonstrated to automate the process of streaming audio/video files. The tool provides web-enabled users interfaces to manage video assets, create publishable schedules of video assets for streaming, and schedule the streaming events. These operations are performed on user-defined and system-derived metadata of audio/video assets stored in a relational database while the assets reside on separate repository. The prototype tool is designed using ColdFusion 5.0.
An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hogden, J.

Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values aremore » constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.« less
Short-term variation of subglottal pressure for expressive purposes in singing and stage speech: a preliminary investigation.

PubMed

Sundberg, J; Elliot, N; Gramming, P; Nord, L

1993-09-01

According to previous investigations, subglottal pressure in singing is adapted not only to loudness but also to fundamental frequency. Here the significance of musical expression to subglottal pressure is analyzed in terms of alternations between stressed and unstressed bar positions. Esophageal pressure was recorded together with the audio signal in a male and a female professional singer using a paranasally introduced pressure transducer while the subjects performed vocal exercises. Also, the subjects gave examples of actors' speech by reading poetry aloud. The results show that subglottal pressure can be used for stressing the first beat in bars and also for increasing the sound level in voiced consonants in actor's speech.
366-AAA_audio

NASA Image and Video Library

1969-11-17

Apollo 12 Public Affairs Officer (PAO) Mission Commentary, November 17, 1969. This is an hour of audio covering communications occurring between 64 hours, 38 minutes into the mission, through 79 hours, 2 minutes which was on November 17, 1969, from 0300-17:09 CST. Transcript of attached audio is available at http://www.jsc.nasa.gov/history/mission_trans/AS12_PAO.PDF, on pages 207-224 of the 979-page document.

Word pair classification during imagined speech using direct brain recordings

PubMed Central

Martin, Stephanie; Brunner, Peter; Iturrate, Iñaki; Millán, José del R.; Schalk, Gerwin; Knight, Robert T.; Pasley, Brian N.

2016-01-01

People that cannot communicate due to neurological disorders would benefit from an internal speech decoder. Here, we showed the ability to classify individual words during imagined speech from electrocorticographic signals. In a word imagery task, we used high gamma (70–150 Hz) time features with a support vector machine model to classify individual words from a pair of words. To account for temporal irregularities during speech production, we introduced a non-linear time alignment into the SVM kernel. Classification accuracy reached 88% in a two-class classification framework (50% chance level), and average classification accuracy across fifteen word-pairs was significant across five subjects (mean = 58%; p < 0.05). We also compared classification accuracy between imagined speech, overt speech and listening. As predicted, higher classification accuracy was obtained in the listening and overt speech conditions (mean = 89% and 86%, respectively; p < 0.0001), where speech stimuli were directly presented. The results provide evidence for a neural representation for imagined words in the temporal lobe, frontal lobe and sensorimotor cortex, consistent with previous findings in speech perception and production. These data represent a proof of concept study for basic decoding of speech imagery, and delineate a number of key challenges to usage of speech imagery neural representations for clinical applications. PMID:27165452
Discriminative analysis of lip motion features for speaker identification and speech-reading.

PubMed

Cetingül, H Ertan; Yemez, Yücel; Erzin, Engin; Tekalp, A Murat

2006-10-01

There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: 1) Is using explicit lip motion information useful, and, 2) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. Several lip motion feature candidates have been considered including dense motion features within a bounding box about the lip, lip contour motion features, and combination of these with lip shape features. Furthermore, a novel two-stage, spatial, and temporal discrimination analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application.
Effect of Cartoon Illustrations on the Comprehension and Evaluation of Information Presented in the Print and Audio Mode.

ERIC Educational Resources Information Center

Sewell, Edward H., Jr.

This study investigates the effects of cartoon illustrations on female and male college student comprehension and evaluation of information presented in several combinations of print, audio, and visual formats. Subjects were assigned to one of five treatment groups: printed text, printed text with cartoons, audiovisual presentations, audio only…
The Audio Description as a Physics Teaching Tool

ERIC Educational Resources Information Center

Cozendey, Sabrina; Costa, Maria da Piedade

2016-01-01

This study analyses the use of audio description in teaching physics concepts, aiming to determine the variables that influence the understanding of the concept. One education resource was audio described. For make the audio description the screen was freezing. The video with and without audio description should be presented to students, so that…
36 CFR 1002.12 - Audio disturbances.

Code of Federal Regulations, 2014 CFR

2014-07-01

... 36 Parks, Forests, and Public Property 3 2014-07-01 2014-07-01 false Audio disturbances. 1002.12... RECREATION § 1002.12 Audio disturbances. (a) The following are prohibited: (1) Operating motorized equipment or machinery such as an electric generating plant, motor vehicle, motorized toy, or an audio device...
36 CFR 1002.12 - Audio disturbances.

Code of Federal Regulations, 2012 CFR

2012-07-01

... 36 Parks, Forests, and Public Property 3 2012-07-01 2012-07-01 false Audio disturbances. 1002.12... RECREATION § 1002.12 Audio disturbances. (a) The following are prohibited: (1) Operating motorized equipment or machinery such as an electric generating plant, motor vehicle, motorized toy, or an audio device...
50 CFR 27.72 - Audio equipment.

Code of Federal Regulations, 2010 CFR

2010-10-01

... 50 Wildlife and Fisheries 6 2010-10-01 2010-10-01 false Audio equipment. 27.72 Section 27.72 Wildlife and Fisheries UNITED STATES FISH AND WILDLIFE SERVICE, DEPARTMENT OF THE INTERIOR (CONTINUED) THE... Audio equipment. The operation or use of audio devices including radios, recording and playback devices...
36 CFR 1002.12 - Audio disturbances.

Code of Federal Regulations, 2011 CFR

2011-07-01

... 36 Parks, Forests, and Public Property 3 2011-07-01 2011-07-01 false Audio disturbances. 1002.12... RECREATION § 1002.12 Audio disturbances. (a) The following are prohibited: (1) Operating motorized equipment or machinery such as an electric generating plant, motor vehicle, motorized toy, or an audio device...
36 CFR 1002.12 - Audio disturbances.

Code of Federal Regulations, 2010 CFR

2010-07-01

... 36 Parks, Forests, and Public Property 3 2010-07-01 2010-07-01 false Audio disturbances. 1002.12... RECREATION § 1002.12 Audio disturbances. (a) The following are prohibited: (1) Operating motorized equipment or machinery such as an electric generating plant, motor vehicle, motorized toy, or an audio device...
50 CFR 27.72 - Audio equipment.

Code of Federal Regulations, 2011 CFR

2011-10-01

... 50 Wildlife and Fisheries 8 2011-10-01 2011-10-01 false Audio equipment. 27.72 Section 27.72 Wildlife and Fisheries UNITED STATES FISH AND WILDLIFE SERVICE, DEPARTMENT OF THE INTERIOR (CONTINUED) THE... Audio equipment. The operation or use of audio devices including radios, recording and playback devices...
50 CFR 27.72 - Audio equipment.

Code of Federal Regulations, 2012 CFR

2012-10-01

... 50 Wildlife and Fisheries 9 2012-10-01 2012-10-01 false Audio equipment. 27.72 Section 27.72 Wildlife and Fisheries UNITED STATES FISH AND WILDLIFE SERVICE, DEPARTMENT OF THE INTERIOR (CONTINUED) THE... Audio equipment. The operation or use of audio devices including radios, recording and playback devices...
Developing the Alphabetic Principle to Aid Text-Based Augmentative and Alternative Communication Use by Adults With Low Speech Intelligibility and Intellectual Disabilities.

PubMed

Schmidt-Naylor, Anna C; Saunders, Kathryn J; Brady, Nancy C

2017-05-17

We explored alphabet supplementation as an augmentative and alternative communication strategy for adults with minimal literacy. Study 1's goal was to teach onset-letter selection with spoken words and assess generalization to untaught words, demonstrating the alphabetic principle. Study 2 incorporated alphabet supplementation within a naming task and then assessed effects on speech intelligibility. Three men with intellectual disabilities (ID) and low speech intelligibility participated. Study 1 used a multiple-probe design, across three 20-word sets, to show that our computer-based training improved onset-letter selection. We also probed generalization to untrained words. Study 2 taught onset-letter selection for 30 new words chosen for functionality. Five listeners transcribed speech samples of the 30 words in 2 conditions: speech only and speech with alphabet supplementation. Across studies 1 and 2, participants demonstrated onset-letter selection for at least 90 words. Study 1 showed evidence of the alphabetic principle for some but not all word sets. In study 2, participants readily used alphabet supplementation, enabling listeners to understand twice as many words. This is the first demonstration of alphabet supplementation in individuals with ID and minimal literacy. The large number of words learned holds promise both for improving communication and providing a foundation for improved literacy.
Developing the Alphabetic Principle to Aid Text-Based Augmentative and Alternative Communication Use by Adults With Low Speech Intelligibility and Intellectual Disabilities

PubMed Central

Schmidt-Naylor, Anna C.; Brady, Nancy C.

2017-01-01

Purpose We explored alphabet supplementation as an augmentative and alternative communication strategy for adults with minimal literacy. Study 1's goal was to teach onset-letter selection with spoken words and assess generalization to untaught words, demonstrating the alphabetic principle. Study 2 incorporated alphabet supplementation within a naming task and then assessed effects on speech intelligibility. Method Three men with intellectual disabilities (ID) and low speech intelligibility participated. Study 1 used a multiple-probe design, across three 20-word sets, to show that our computer-based training improved onset-letter selection. We also probed generalization to untrained words. Study 2 taught onset-letter selection for 30 new words chosen for functionality. Five listeners transcribed speech samples of the 30 words in 2 conditions: speech only and speech with alphabet supplementation. Results Across studies 1 and 2, participants demonstrated onset-letter selection for at least 90 words. Study 1 showed evidence of the alphabetic principle for some but not all word sets. In study 2, participants readily used alphabet supplementation, enabling listeners to understand twice as many words. Conclusions This is the first demonstration of alphabet supplementation in individuals with ID and minimal literacy. The large number of words learned holds promise both for improving communication and providing a foundation for improved literacy. PMID:28474087
"Listen to This!" Utilizing Audio Recordings to Improve Instructor Feedback on Writing in Mathematics

ERIC Educational Resources Information Center

Weld, Christopher

2014-01-01

Providing audio files in lieu of written remarks on graded assignments is arguably a more effective means of feedback, allowing students to better process and understand the critique and improve their future work. With emerging technologies and software, this audio feedback alternative to the traditional paradigm of providing written comments…
The Auditory-Brainstem Response to Continuous, Non-repetitive Speech Is Modulated by the Speech Envelope and Reflects Speech Processing

PubMed Central

Reichenbach, Chagit S.; Braiman, Chananel; Schiff, Nicholas D.; Hudspeth, A. J.; Reichenbach, Tobias

2016-01-01

The auditory-brainstem response (ABR) to short and simple acoustical signals is an important clinical tool used to diagnose the integrity of the brainstem. The ABR is also employed to investigate the auditory brainstem in a multitude of tasks related to hearing, such as processing speech or selectively focusing on one speaker in a noisy environment. Such research measures the response of the brainstem to short speech signals such as vowels or words. Because the voltage signal of the ABR has a tiny amplitude, several hundred to a thousand repetitions of the acoustic signal are needed to obtain a reliable response. The large number of repetitions poses a challenge to assessing cognitive functions due to neural adaptation. Here we show that continuous, non-repetitive speech, lasting several minutes, may be employed to measure the ABR. Because the speech is not repeated during the experiment, the precise temporal form of the ABR cannot be determined. We show, however, that important structural features of the ABR can nevertheless be inferred. In particular, the brainstem responds at the fundamental frequency of the speech signal, and this response is modulated by the envelope of the voiced parts of speech. We accordingly introduce a novel measure that assesses the ABR as modulated by the speech envelope, at the fundamental frequency of speech and at the characteristic latency of the response. This measure has a high signal-to-noise ratio and can hence be employed effectively to measure the ABR to continuous speech. We use this novel measure to show that the ABR is weaker to intelligible speech than to unintelligible, time-reversed speech. The methods presented here can be employed for further research on speech processing in the auditory brainstem and can lead to the development of future clinical diagnosis of brainstem function. PMID:27303286
Speech Recognition of Bimodal Cochlear Implant Recipients Using a Wireless Audio Streaming Accessory for the Telephone.

PubMed

Wolfe, Jace; Morais, Mila; Schafer, Erin

2016-02-01

The goals of the present investigation were (1) to evaluate recognition of recorded speech presented over a mobile telephone for a group of adult bimodal cochlear implant users, and (2) to measure the potential benefits of wireless hearing assistance technology (HAT) for mobile telephone speech recognition using bimodal stimulation (i.e., a cochlear implant in one ear and a hearing aid on the other ear). A three-by-two-way repeated measures design was used to evaluate mobile telephone sentence-recognition performance differences obtained in quiet and in noise with and without the wireless HAT accessory coupled to the hearing aid alone, CI sound processor alone, and in the bimodal condition. Outpatient cochlear implant clinic. Sixteen bimodal users with Nucleus 24, Freedom, CI512, or CI422 cochlear implants participated in this study. Performance was measured with and without the use of a wireless HAT for the telephone used with the hearing aid alone, CI alone, and bimodal condition. CNC word recognition in quiet and in noise with and without the use of a wireless HAT telephone accessory in the hearing aid alone, CI alone, and bimodal conditions. Results suggested that the bimodal condition gave significantly better speech recognition on the mobile telephone with the wireless HAT. A wireless HAT for the mobile telephone provides bimodal users with significant improvement in word recognition in quiet and in noise over the mobile telephone.
36 CFR 2.12 - Audio disturbances.

Code of Federal Regulations, 2012 CFR

2012-07-01

... 36 Parks, Forests, and Public Property 1 2012-07-01 2012-07-01 false Audio disturbances. 2.12... RESOURCE PROTECTION, PUBLIC USE AND RECREATION § 2.12 Audio disturbances. (a) The following are prohibited..., motorized toy, or an audio device, such as a radio, television set, tape deck or musical instrument, in a...
36 CFR 2.12 - Audio disturbances.

Code of Federal Regulations, 2010 CFR

2010-07-01

... 36 Parks, Forests, and Public Property 1 2010-07-01 2010-07-01 false Audio disturbances. 2.12... RESOURCE PROTECTION, PUBLIC USE AND RECREATION § 2.12 Audio disturbances. (a) The following are prohibited..., motorized toy, or an audio device, such as a radio, television set, tape deck or musical instrument, in a...
36 CFR 2.12 - Audio disturbances.

Code of Federal Regulations, 2013 CFR

2013-07-01

... 36 Parks, Forests, and Public Property 1 2013-07-01 2013-07-01 false Audio disturbances. 2.12... RESOURCE PROTECTION, PUBLIC USE AND RECREATION § 2.12 Audio disturbances. (a) The following are prohibited..., motorized toy, or an audio device, such as a radio, television set, tape deck or musical instrument, in a...
36 CFR 2.12 - Audio disturbances.

Code of Federal Regulations, 2014 CFR

2014-07-01

... 36 Parks, Forests, and Public Property 1 2014-07-01 2014-07-01 false Audio disturbances. 2.12... RESOURCE PROTECTION, PUBLIC USE AND RECREATION § 2.12 Audio disturbances. (a) The following are prohibited..., motorized toy, or an audio device, such as a radio, television set, tape deck or musical instrument, in a...

Some links on this page may take you to non-federal websites. Their policies may differ from this site.