Nursing acceptance of a speech-input interface: a preliminary investigation.
Dillon, T W; McDowell, D; Norcio, A F; DeHaemer, M J
1994-01-01
Many new technologies are being developed to improve the efficiency and productivity of nursing staffs. User acceptance is a key to the success of these technologies. In this article, the authors present a discussion of nursing acceptance of computer systems, review the basic design issues for creating a speech-input interface, and report preliminary findings of a study of nursing acceptance of a prototype speech-input interface. Results of the study showed that the 19 nursing subjects expressed acceptance of the prototype speech-input interface.
Natural Language Based Multimodal Interface for UAV Mission Planning
NASA Technical Reports Server (NTRS)
Chandarana, Meghan; Meszaros, Erica L.; Trujillo, Anna; Allen, B. Danette
2017-01-01
As the number of viable applications for unmanned aerial vehicle (UAV) systems increases at an exponential rate, interfaces that reduce the reliance on highly skilled engineers and pilots must be developed. Recent work aims to make use of common human communication modalities such as speech and gesture. This paper explores a multimodal natural language interface that uses a combination of speech and gesture input modalities to build complex UAV flight paths by defining trajectory segment primitives. Gesture inputs are used to define the general shape of a segment while speech inputs provide additional geometric information needed to fully characterize a trajectory segment. A user study is conducted in order to evaluate the efficacy of the multimodal interface.
Fels, S S; Hinton, G E
1997-01-01
Glove-Talk II is a system which translates hand gestures to speech through an adaptive interface. Hand gestures are mapped continuously to ten control parameters of a parallel formant speech synthesizer. The mapping allows the hand to act as an artificial vocal tract that produces speech in real time. This gives an unlimited vocabulary in addition to direct control of fundamental frequency and volume. Currently, the best version of Glove-Talk II uses several input devices, a parallel formant speech synthesizer, and three neural networks. The gesture-to-speech task is divided into vowel and consonant production by using a gating network to weight the outputs of a vowel and a consonant neural network. The gating network and the consonant network are trained with examples from the user. The vowel network implements a fixed user-defined relationship between hand position and vowel sound and does not require any training examples from the user. Volume, fundamental frequency, and stop consonants are produced with a fixed mapping from the input devices. With Glove-Talk II, the subject can speak slowly but with far more natural sounding pitch variations than a text-to-speech synthesizer.
Voice Response Systems Technology.
ERIC Educational Resources Information Center
Gerald, Jeanette
1984-01-01
Examines two methods of generating synthetic speech in voice response systems, which allow computers to communicate in human terms (speech), using human interface devices (ears): phoneme and reconstructed voice systems. Considerations prior to implementation, current and potential applications, glossary, directory, and introduction to Input Output…
A video, text, and speech-driven realistic 3-d virtual head for human-machine interface.
Yu, Jun; Wang, Zeng-Fu
2015-05-01
A multiple inputs-driven realistic facial animation system based on 3-D virtual head for human-machine interface is proposed. The system can be driven independently by video, text, and speech, thus can interact with humans through diverse interfaces. The combination of parameterized model and muscular model is used to obtain a tradeoff between computational efficiency and high realism of 3-D facial animation. The online appearance model is used to track 3-D facial motion from video in the framework of particle filtering, and multiple measurements, i.e., pixel color value of input image and Gabor wavelet coefficient of illumination ratio image, are infused to reduce the influence of lighting and person dependence for the construction of online appearance model. The tri-phone model is used to reduce the computational consumption of visual co-articulation in speech synchronized viseme synthesis without sacrificing any performance. The objective and subjective experiments show that the system is suitable for human-machine interaction.
Fels, S S; Hinton, G E
1998-01-01
Glove-TalkII is a system which translates hand gestures to speech through an adaptive interface. Hand gestures are mapped continuously to ten control parameters of a parallel formant speech synthesizer. The mapping allows the hand to act as an artificial vocal tract that produces speech in real time. This gives an unlimited vocabulary in addition to direct control of fundamental frequency and volume. Currently, the best version of Glove-TalkII uses several input devices (including a Cyberglove, a ContactGlove, a three-space tracker, and a foot pedal), a parallel formant speech synthesizer, and three neural networks. The gesture-to-speech task is divided into vowel and consonant production by using a gating network to weight the outputs of a vowel and a consonant neural network. The gating network and the consonant network are trained with examples from the user. The vowel network implements a fixed user-defined relationship between hand position and vowel sound and does not require any training examples from the user. Volume, fundamental frequency, and stop consonants are produced with a fixed mapping from the input devices. One subject has trained to speak intelligibly with Glove-TalkII. He speaks slowly but with far more natural sounding pitch variations than a text-to-speech synthesizer.
Multimodal interfaces with voice and gesture input
DOE Office of Scientific and Technical Information (OSTI.GOV)
Milota, A.D.; Blattner, M.M.
1995-07-20
The modalities of speech and gesture have different strengths and weaknesses, but combined they create synergy where each modality corrects the weaknesses of the other. We believe that a multimodal system such a one interwining speech and gesture must start from a different foundation than ones which are based solely on pen input. In order to provide a basis for the design of a speech and gesture system, we have examined the research in other disciplines such as anthropology and linguistics. The result of this investigation was a taxonomy that gave us material for the incorporation of gestures whose meaningsmore » are largely transparent to the users. This study describes the taxonomy and gives examples of applications to pen input systems.« less
Intentional Voice Command Detection for Trigger-Free Speech Interface
NASA Astrophysics Data System (ADS)
Obuchi, Yasunari; Sumiyoshi, Takashi
In this paper we introduce a new framework of audio processing, which is essential to achieve a trigger-free speech interface for home appliances. If the speech interface works continually in real environments, it must extract occasional voice commands and reject everything else. It is extremely important to reduce the number of false alarms because the number of irrelevant inputs is much larger than the number of voice commands even for heavy users of appliances. The framework, called Intentional Voice Command Detection, is based on voice activity detection, but enhanced by various speech/audio processing techniques such as emotion recognition. The effectiveness of the proposed framework is evaluated using a newly-collected large-scale corpus. The advantages of combining various features were tested and confirmed, and the simple LDA-based classifier demonstrated acceptable performance. The effectiveness of various methods of user adaptation is also discussed.
Multimodal Interaction with Speech, Gestures and Haptic Feedback in a Media Center Application
NASA Astrophysics Data System (ADS)
Turunen, Markku; Hakulinen, Jaakko; Hella, Juho; Rajaniemi, Juha-Pekka; Melto, Aleksi; Mäkinen, Erno; Rantala, Jussi; Heimonen, Tomi; Laivo, Tuuli; Soronen, Hannu; Hansen, Mervi; Valkama, Pellervo; Miettinen, Toni; Raisamo, Roope
We demonstrate interaction with a multimodal media center application. Mobile phone-based interface includes speech and gesture input and haptic feedback. The setup resembles our long-term public pilot study, where a living room environment containing the application was constructed inside a local media museum allowing visitors to freely test the system.
Rojas, Mario; Ponce, Pedro; Molina, Arturo
2016-08-01
This paper presents the evaluation, under standardized metrics, of alternative input methods to steer and maneuver a semi-autonomous electric wheelchair. The Human-Machine Interface (HMI), which includes a virtual joystick, head movements and speech recognition controls, was designed to facilitate mobility skills for severely disabled people. Thirteen tasks, which are common to all the wheelchair users, were attempted five times by controlling it with the virtual joystick and the hands-free interfaces in different areas for disabled and non-disabled people. Even though the prototype has an intelligent navigation control, based on fuzzy logic and ultrasonic sensors, the evaluation was done without assistance. The scored values showed that both controls, the head movements and the virtual joystick have similar capabilities, 92.3% and 100%, respectively. However, the 54.6% capacity score obtained for the speech control interface indicates the needs of the navigation assistance to accomplish some of the goals. Furthermore, the evaluation time indicates those skills which require more user's training with the interface and specifications to improve the total performance of the wheelchair.
Van Ackeren, Markus Johannes; Barbero, Francesca M; Mattioni, Stefania; Bottini, Roberto
2018-01-01
The occipital cortex of early blind individuals (EB) activates during speech processing, challenging the notion of a hard-wired neurobiology of language. But, at what stage of speech processing do occipital regions participate in EB? Here we demonstrate that parieto-occipital regions in EB enhance their synchronization to acoustic fluctuations in human speech in the theta-range (corresponding to syllabic rate), irrespective of speech intelligibility. Crucially, enhanced synchronization to the intelligibility of speech was selectively observed in primary visual cortex in EB, suggesting that this region is at the interface between speech perception and comprehension. Moreover, EB showed overall enhanced functional connectivity between temporal and occipital cortices that are sensitive to speech intelligibility and altered directionality when compared to the sighted group. These findings suggest that the occipital cortex of the blind adopts an architecture that allows the tracking of speech material, and therefore does not fully abstract from the reorganized sensory inputs it receives. PMID:29338838
Real time speech formant analyzer and display
Holland, George E.; Struve, Walter S.; Homer, John F.
1987-01-01
A speech analyzer for interpretation of sound includes a sound input which converts the sound into a signal representing the sound. The signal is passed through a plurality of frequency pass filters to derive a plurality of frequency formants. These formants are converted to voltage signals by frequency-to-voltage converters and then are prepared for visual display in continuous real time. Parameters from the inputted sound are also derived and displayed. The display may then be interpreted by the user. The preferred embodiment includes a microprocessor which is interfaced with a television set for displaying of the sound formants. The microprocessor software enables the sound analyzer to present a variety of display modes for interpretive and therapeutic used by the user.
Real time speech formant analyzer and display
Holland, G.E.; Struve, W.S.; Homer, J.F.
1987-02-03
A speech analyzer for interpretation of sound includes a sound input which converts the sound into a signal representing the sound. The signal is passed through a plurality of frequency pass filters to derive a plurality of frequency formants. These formants are converted to voltage signals by frequency-to-voltage converters and then are prepared for visual display in continuous real time. Parameters from the inputted sound are also derived and displayed. The display may then be interpreted by the user. The preferred embodiment includes a microprocessor which is interfaced with a television set for displaying of the sound formants. The microprocessor software enables the sound analyzer to present a variety of display modes for interpretive and therapeutic used by the user. 19 figs.
Using Natural Language to Enable Mission Managers to Control Multiple Heterogeneous UAVs
NASA Technical Reports Server (NTRS)
Trujillo, Anna C.; Puig-Navarro, Javier; Mehdi, S. Bilal; Mcquarry, A. Kyle
2016-01-01
The availability of highly capable, yet relatively cheap, unmanned aerial vehicles (UAVs) is opening up new areas of use for hobbyists and for commercial activities. This research is developing methods beyond classical control-stick pilot inputs, to allow operators to manage complex missions without in-depth vehicle expertise. These missions may entail several heterogeneous UAVs flying coordinated patterns or flying multiple trajectories deconflicted in time or space to predefined locations. This paper describes the functionality and preliminary usability measures of an interface that allows an operator to define a mission using speech inputs. With a defined and simple vocabulary, operators can input the vast majority of mission parameters using simple, intuitive voice commands. Although the operator interface is simple, it is based upon autonomous algorithms that allow the mission to proceed with minimal input from the operator. This paper also describes these underlying algorithms that allow an operator to manage several UAVs.
Dietrich, Susanne; Hertrich, Ingo; Ackermann, Hermann
2015-01-01
In many functional magnetic resonance imaging (fMRI) studies blind humans were found to show cross-modal reorganization engaging the visual system in non-visual tasks. For example, blind people can manage to understand (synthetic) spoken language at very high speaking rates up to ca. 20 syllables/s (syl/s). FMRI data showed that hemodynamic activation within right-hemispheric primary visual cortex (V1), bilateral pulvinar (Pv), and left-hemispheric supplementary motor area (pre-SMA) covaried with their capability of ultra-fast speech (16 syllables/s) comprehension. It has been suggested that right V1 plays an important role with respect to the perception of ultra-fast speech features, particularly the detection of syllable onsets. Furthermore, left pre-SMA seems to be an interface between these syllabic representations and the frontal speech processing and working memory network. So far, little is known about the networks linking V1 to Pv, auditory cortex (A1), and (mesio-) frontal areas. Dynamic causal modeling (DCM) was applied to investigate (i) the input structure from A1 and Pv toward right V1 and (ii) output from right V1 and A1 to left pre-SMA. As concerns the input Pv was significantly connected to V1, in addition to A1, in blind participants, but not in sighted controls. Regarding the output V1 was significantly connected to pre-SMA in blind individuals, and the strength of V1-SMA connectivity correlated with the performance of ultra-fast speech comprehension. By contrast, in sighted controls, not understanding ultra-fast speech, pre-SMA did neither receive input from A1 nor V1. Taken together, right V1 might facilitate the “parsing” of the ultra-fast speech stream in blind subjects by receiving subcortical auditory input via the Pv (= secondary visual pathway) and transmitting this information toward contralateral pre-SMA. PMID:26148062
Dietrich, Susanne; Hertrich, Ingo; Ackermann, Hermann
2015-01-01
In many functional magnetic resonance imaging (fMRI) studies blind humans were found to show cross-modal reorganization engaging the visual system in non-visual tasks. For example, blind people can manage to understand (synthetic) spoken language at very high speaking rates up to ca. 20 syllables/s (syl/s). FMRI data showed that hemodynamic activation within right-hemispheric primary visual cortex (V1), bilateral pulvinar (Pv), and left-hemispheric supplementary motor area (pre-SMA) covaried with their capability of ultra-fast speech (16 syllables/s) comprehension. It has been suggested that right V1 plays an important role with respect to the perception of ultra-fast speech features, particularly the detection of syllable onsets. Furthermore, left pre-SMA seems to be an interface between these syllabic representations and the frontal speech processing and working memory network. So far, little is known about the networks linking V1 to Pv, auditory cortex (A1), and (mesio-) frontal areas. Dynamic causal modeling (DCM) was applied to investigate (i) the input structure from A1 and Pv toward right V1 and (ii) output from right V1 and A1 to left pre-SMA. As concerns the input Pv was significantly connected to V1, in addition to A1, in blind participants, but not in sighted controls. Regarding the output V1 was significantly connected to pre-SMA in blind individuals, and the strength of V1-SMA connectivity correlated with the performance of ultra-fast speech comprehension. By contrast, in sighted controls, not understanding ultra-fast speech, pre-SMA did neither receive input from A1 nor V1. Taken together, right V1 might facilitate the "parsing" of the ultra-fast speech stream in blind subjects by receiving subcortical auditory input via the Pv (= secondary visual pathway) and transmitting this information toward contralateral pre-SMA.
Design and Evaluation of Fusion Approach for Combining Brain and Gaze Inputs for Target Selection
Évain, Andéol; Argelaguet, Ferran; Casiez, Géry; Roussel, Nicolas; Lécuyer, Anatole
2016-01-01
Gaze-based interfaces and Brain-Computer Interfaces (BCIs) allow for hands-free human–computer interaction. In this paper, we investigate the combination of gaze and BCIs. We propose a novel selection technique for 2D target acquisition based on input fusion. This new approach combines the probabilistic models for each input, in order to better estimate the intent of the user. We evaluated its performance against the existing gaze and brain–computer interaction techniques. Twelve participants took part in our study, in which they had to search and select 2D targets with each of the evaluated techniques. Our fusion-based hybrid interaction technique was found to be more reliable than the previous gaze and BCI hybrid interaction techniques for 10 participants over 12, while being 29% faster on average. However, similarly to what has been observed in hybrid gaze-and-speech interaction, gaze-only interaction technique still provides the best performance. Our results should encourage the use of input fusion, as opposed to sequential interaction, in order to design better hybrid interfaces. PMID:27774048
Speech perception at the interface of neurobiology and linguistics.
Poeppel, David; Idsardi, William J; van Wassenhove, Virginie
2008-03-12
Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the speech perception enter into subsequent linguistic computation, the format that is used for lexical representation and processing fundamentally constrains the speech perceptual processes. Consequently, theories of speech perception must, at some level, be tightly linked to theories of lexical representation. Minimally, speech perception must yield representations that smoothly and rapidly interface with stored lexical items. Adopting the perspective of Marr, we argue and provide neurobiological and psychophysical evidence for the following research programme. First, at the implementational level, speech perception is a multi-time resolution process, with perceptual analyses occurring concurrently on at least two time scales (approx. 20-80 ms, approx. 150-300 ms), commensurate with (sub)segmental and syllabic analyses, respectively. Second, at the algorithmic level, we suggest that perception proceeds on the basis of internal forward models, or uses an 'analysis-by-synthesis' approach. Third, at the computational level (in the sense of Marr), the theory of lexical representation that we adopt is principally informed by phonological research and assumes that words are represented in the mental lexicon in terms of sequences of discrete segments composed of distinctive features. One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives (e.g. temporal primitives) and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.
Stated Preferences for Components of a Personal Guidance System for Nonvisual Navigation
ERIC Educational Resources Information Center
Golledge, Reginald G.; Marston, James R.; Loomis, Jack M.; Klatzky, Roberta L.
2004-01-01
This article reports on a survey of the preferences of visually impaired persons for a possible personal navigation device. The results showed that the majority of participants preferred speech input and output interfaces, were willing to use such a product, thought that they would make more trips with such a device, and had some concerns about…
Speech-based E-mail and driver behavior: effects of an in-vehicle message system interface.
Jamson, A Hamish; Westerman, Stephen J; Hockey, G Robert J; Carsten, Oliver M J
2004-01-01
As mobile office technology becomes more advanced, drivers have increased opportunity to process information "on the move." Although speech-based interfaces can minimize direct interference with driving, the cognitive demands associated with such systems may still cause distraction. We studied the effects on driving performance of an in-vehicle simulated "E-mail" message system; E-mails were either system controlled or driver controlled. A high-fidelity, fixed-base driving simulator was used to test 19 participants on a car-following task. Virtual traffic scenarios varying in driving demand. Drivers compensated for the secondary task by adopting longer headways but showed reduced anticipation of braking requirements and shorter time to collision. Drivers were also less reactive when processing E-mails, demonstrated by a reduction in steering wheel inputs. In most circumstances, there were advantages in providing drivers with control over when E-mails were opened. However, during periods without E-mail interaction in demanding traffic scenarios, drivers showed reduced braking anticipation. This may be a result of increased cognitive costs associated with the decision making process when using a driver-controlled interface when the task of scheduling E-mail acceptance is added to those of driving and E-mail response. Actual or potential applications of this research include the design of speech-based in-vehicle messaging systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
These proceedings discuss human factor issues related to aerospace systems, aging, communications, computer systems, consumer products, education and forensic topics, environmental design, industrial ergonomics, international technology transfer, organizational design and management, personality and individual differences in human performance, safety, system development, test and evaluation, training, and visual performance. Particular attention is given to HUDs, attitude indicators, and sensor displays; human factors of space exploration; behavior and aging; the design and evaluation of phone-based interfaces; knowledge acquisition and expert systems; handwriting, speech, and other input techniques; interface design for text, numerics, and speech; and human factor issues in medicine. Also discussedmore » are cumulative trauma disorders, industrial safety, evaluative techniques for automation impacts on the human operators, visual issues in training, and interpreting and organizing human factor concepts and information.« less
Automatic Speech Recognition in Air Traffic Control: a Human Factors Perspective
NASA Technical Reports Server (NTRS)
Karlsson, Joakim
1990-01-01
The introduction of Automatic Speech Recognition (ASR) technology into the Air Traffic Control (ATC) system has the potential to improve overall safety and efficiency. However, because ASR technology is inherently a part of the man-machine interface between the user and the system, the human factors issues involved must be addressed. Here, some of the human factors problems are identified and related methods of investigation are presented. Research at M.I.T.'s Flight Transportation Laboratory is being conducted from a human factors perspective, focusing on intelligent parser design, presentation of feedback, error correction strategy design, and optimal choice of input modalities.
Speech-recognition interfaces for music information retrieval
NASA Astrophysics Data System (ADS)
Goto, Masataka
2005-09-01
This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)
Decoding Speech With Integrated Hybrid Signals Recorded From the Human Ventral Motor Cortex.
Ibayashi, Kenji; Kunii, Naoto; Matsuo, Takeshi; Ishishita, Yohei; Shimada, Seijiro; Kawai, Kensuke; Saito, Nobuhito
2018-01-01
Restoration of speech communication for locked-in patients by means of brain computer interfaces (BCIs) is currently an important area of active research. Among the neural signals obtained from intracranial recordings, single/multi-unit activity (SUA/MUA), local field potential (LFP), and electrocorticography (ECoG) are good candidates for an input signal for BCIs. However, the question of which signal or which combination of the three signal modalities is best suited for decoding speech production remains unverified. In order to record SUA, LFP, and ECoG simultaneously from a highly localized area of human ventral sensorimotor cortex (vSMC), we fabricated an electrode the size of which was 7 by 13 mm containing sparsely arranged microneedle and conventional macro contacts. We determined which signal modality is the most capable of decoding speech production, and tested if the combination of these signals could improve the decoding accuracy of spoken phonemes. Feature vectors were constructed from spike frequency obtained from SUAs and event-related spectral perturbation derived from ECoG and LFP signals, then input to the decoder. The results showed that the decoding accuracy for five spoken vowels was highest when features from multiple signals were combined and optimized for each subject, and reached 59% when averaged across all six subjects. This result suggests that multi-scale signals convey complementary information for speech articulation. The current study demonstrated that simultaneous recording of multi-scale neuronal activities could raise decoding accuracy even though the recording area is limited to a small portion of cortex, which is advantageous for future implementation of speech-assisting BCIs.
Decoding Speech With Integrated Hybrid Signals Recorded From the Human Ventral Motor Cortex
Ibayashi, Kenji; Kunii, Naoto; Matsuo, Takeshi; Ishishita, Yohei; Shimada, Seijiro; Kawai, Kensuke; Saito, Nobuhito
2018-01-01
Restoration of speech communication for locked-in patients by means of brain computer interfaces (BCIs) is currently an important area of active research. Among the neural signals obtained from intracranial recordings, single/multi-unit activity (SUA/MUA), local field potential (LFP), and electrocorticography (ECoG) are good candidates for an input signal for BCIs. However, the question of which signal or which combination of the three signal modalities is best suited for decoding speech production remains unverified. In order to record SUA, LFP, and ECoG simultaneously from a highly localized area of human ventral sensorimotor cortex (vSMC), we fabricated an electrode the size of which was 7 by 13 mm containing sparsely arranged microneedle and conventional macro contacts. We determined which signal modality is the most capable of decoding speech production, and tested if the combination of these signals could improve the decoding accuracy of spoken phonemes. Feature vectors were constructed from spike frequency obtained from SUAs and event-related spectral perturbation derived from ECoG and LFP signals, then input to the decoder. The results showed that the decoding accuracy for five spoken vowels was highest when features from multiple signals were combined and optimized for each subject, and reached 59% when averaged across all six subjects. This result suggests that multi-scale signals convey complementary information for speech articulation. The current study demonstrated that simultaneous recording of multi-scale neuronal activities could raise decoding accuracy even though the recording area is limited to a small portion of cortex, which is advantageous for future implementation of speech-assisting BCIs. PMID:29674950
Projection Mapping User Interface for Disabled People
Simutis, Rimvydas; Maskeliūnas, Rytis
2018-01-01
Difficulty in communicating is one of the key challenges for people suffering from severe motor and speech disabilities. Often such person can communicate and interact with the environment only using assistive technologies. This paper presents a multifunctional user interface designed to improve communication efficiency and person independence. The main component of this interface is a projection mapping technique used to highlight objects in the environment. Projection mapping makes it possible to create a natural augmented reality information presentation method. The user interface combines a depth sensor and a projector to create camera-projector system. We provide a detailed description of camera-projector system calibration procedure. The described system performs tabletop object detection and automatic projection mapping. Multiple user input modalities have been integrated into the multifunctional user interface. Such system can be adapted to the needs of people with various disabilities. PMID:29686827
Projection Mapping User Interface for Disabled People.
Gelšvartas, Julius; Simutis, Rimvydas; Maskeliūnas, Rytis
2018-01-01
Difficulty in communicating is one of the key challenges for people suffering from severe motor and speech disabilities. Often such person can communicate and interact with the environment only using assistive technologies. This paper presents a multifunctional user interface designed to improve communication efficiency and person independence. The main component of this interface is a projection mapping technique used to highlight objects in the environment. Projection mapping makes it possible to create a natural augmented reality information presentation method. The user interface combines a depth sensor and a projector to create camera-projector system. We provide a detailed description of camera-projector system calibration procedure. The described system performs tabletop object detection and automatic projection mapping. Multiple user input modalities have been integrated into the multifunctional user interface. Such system can be adapted to the needs of people with various disabilities.
Automated speech understanding: the next generation
NASA Astrophysics Data System (ADS)
Picone, J.; Ebel, W. J.; Deshmukh, N.
1995-04-01
Modern speech understanding systems merge interdisciplinary technologies from Signal Processing, Pattern Recognition, Natural Language, and Linguistics into a unified statistical framework. These systems, which have applications in a wide range of signal processing problems, represent a revolution in Digital Signal Processing (DSP). Once a field dominated by vector-oriented processors and linear algebra-based mathematics, the current generation of DSP-based systems rely on sophisticated statistical models implemented using a complex software paradigm. Such systems are now capable of understanding continuous speech input for vocabularies of several thousand words in operational environments. The current generation of deployed systems, based on small vocabularies of isolated words, will soon be replaced by a new technology offering natural language access to vast information resources such as the Internet, and provide completely automated voice interfaces for mundane tasks such as travel planning and directory assistance.
Development of Speech Input/Output Interfaces for Tactical Aircraft
1983-07-01
PERFORMING ORGANIZATION NAME AND AOORESS 10 PROGRAM ELEMENT ROJECT TASK AREA & WORK UNIT NUMBERS Canyon Research Group , Inc. 741 Lakefield Road, Suite B...I. Strieb Canyon Research Group , Inc. 741 Lakefield Road, Suite B *Westlake Village, CA 91361 July 1983 Report for Period: December 1981 - July 1983...Approved for public release; distribution unlimited. 17 OISTRISUTION STATEMENT (of the abstract entered itn Block 20. II difer , nt from Report) 18
Boster, Jamie B; McCarthy, John W
2018-05-01
The purpose of this study was to gain insight from speech-language pathologists (SLPs) and parents of children with autism spectrum disorder (ASD) regarding appealing features of augmentative and alternative communication (AAC) applications. Two separate 1-hour focus groups were conducted with 8 SLPs and 5 parents of children with ASD to identify appealing design features of AAC Apps, their benefits and potential concerns. Participants were shown novel interface designs for communication mode, play mode and incentive systems. Participants responded to poll questions and provided benefits and drawbacks of the features as part of structured discussion. SLPs and parents identified a range of appealing features in communication mode (customization, animation and colour-coding) as well as in play mode (games and videos). SLPs preferred interfaces that supported motor planning and instruction while parents preferred those features such as character assistants that would appeal to their child. Overall SLPs and parents agreed on features for future AAC Apps. SLPs and parents have valuable input in regards to future AAC app design informed by their experiences with children with ASD. Both groups are key stakeholders in the design process and should be included in future design and research endeavors. Implications for Rehabilitation AAC applications for the iPad are often designed based on previous devices without consideration of new features. Ensuring the design of new interfaces are appealing and beneficial for children with ASD can potentially further support their communication. This study demonstrates how key stakeholders in AAC including speech language pathologists and parents can provide information to support the development of future AAC interface designs. Key stakeholders may be an untapped resource in the development of future AAC interfaces for children with ASD.
Recognizing speech under a processing load: dissociating energetic from informational factors.
Mattys, Sven L; Brooks, Joanna; Cooke, Martin
2009-11-01
Effects of perceptual and cognitive loads on spoken-word recognition have so far largely escaped investigation. This study lays the foundations of a psycholinguistic approach to speech recognition in adverse conditions that draws upon the distinction between energetic masking, i.e., listening environments leading to signal degradation, and informational masking, i.e., listening environments leading to depletion of higher-order, domain-general processing resources, independent of signal degradation. We show that severe energetic masking, such as that produced by background speech or noise, curtails reliance on lexical-semantic knowledge and increases relative reliance on salient acoustic detail. In contrast, informational masking, induced by a resource-depleting competing task (divided attention or a memory load), results in the opposite pattern. Based on this clear dissociation, we propose a model of speech recognition that addresses not only the mapping between sensory input and lexical representations, as traditionally advocated, but also the way in which this mapping interfaces with general cognition and non-linguistic processes.
Real-time classification of auditory sentences using evoked cortical activity in humans
NASA Astrophysics Data System (ADS)
Moses, David A.; Leonard, Matthew K.; Chang, Edward F.
2018-06-01
Objective. Recent research has characterized the anatomical and functional basis of speech perception in the human auditory cortex. These advances have made it possible to decode speech information from activity in brain regions like the superior temporal gyrus, but no published work has demonstrated this ability in real-time, which is necessary for neuroprosthetic brain-computer interfaces. Approach. Here, we introduce a real-time neural speech recognition (rtNSR) software package, which was used to classify spoken input from high-resolution electrocorticography signals in real-time. We tested the system with two human subjects implanted with electrode arrays over the lateral brain surface. Subjects listened to multiple repetitions of ten sentences, and rtNSR classified what was heard in real-time from neural activity patterns using direct sentence-level and HMM-based phoneme-level classification schemes. Main results. We observed single-trial sentence classification accuracies of 90% or higher for each subject with less than 7 minutes of training data, demonstrating the ability of rtNSR to use cortical recordings to perform accurate real-time speech decoding in a limited vocabulary setting. Significance. Further development and testing of the package with different speech paradigms could influence the design of future speech neuroprosthetic applications.
The role of voice input for human-machine communication.
Cohen, P R; Oviatt, S L
1995-01-01
Optimism is growing that the near future will witness rapid growth in human-computer interaction using voice. System prototypes have recently been built that demonstrate speaker-independent real-time speech recognition, and understanding of naturally spoken utterances with vocabularies of 1000 to 2000 words, and larger. Already, computer manufacturers are building speech recognition subsystems into their new product lines. However, before this technology can be broadly useful, a substantial knowledge base is needed about human spoken language and performance during computer-based spoken interaction. This paper reviews application areas in which spoken interaction can play a significant role, assesses potential benefits of spoken interaction with machines, and compares voice with other modalities of human-computer interaction. It also discusses information that will be needed to build a firm empirical foundation for the design of future spoken and multimodal interfaces. Finally, it argues for a more systematic and scientific approach to investigating spoken input and performance with future language technology. PMID:7479803
Visual and Auditory Input in Second-Language Speech Processing
ERIC Educational Resources Information Center
Hardison, Debra M.
2010-01-01
The majority of studies in second-language (L2) speech processing have involved unimodal (i.e., auditory) input; however, in many instances, speech communication involves both visual and auditory sources of information. Some researchers have argued that multimodal speech is the primary mode of speech perception (e.g., Rosenblum 2005). Research on…
Speech versus manual control of camera functions during a telerobotic task
NASA Technical Reports Server (NTRS)
Bierschwale, John M.; Sampaio, Carlos E.; Stuart, Mark A.; Smith, Randy L.
1989-01-01
Voice input for control of camera functions was investigated in this study. Objective were to (1) assess the feasibility of a voice-commanded camera control system, and (2) identify factors that differ between voice and manual control of camera functions. Subjects participated in a remote manipulation task that required extensive camera-aided viewing. Each subject was exposed to two conditions, voice and manual input, with a counterbalanced administration order. Voice input was found to be significantly slower than manual input for this task. However, in terms of remote manipulator performance errors and subject preference, there was no difference between modalities. Voice control of continuous camera functions is not recommended. It is believed that the use of voice input for discrete functions, such as multiplexing or camera switching, could aid performance. Hybrid mixes of voice and manual input may provide the best use of both modalities. This report contributes to a better understanding of the issues that affect the design of an efficient human/telerobot interface.
ERIC Educational Resources Information Center
Ramírez-Esparza, Nairán; García-Sierra, Adrián; Kuhl, Patricia K.
2014-01-01
Language input is necessary for language learning, yet little is known about whether, in natural environments, the speech style and social context of language input to children impacts language development. In the present study we investigated the relationship between language input and language development, examining both the style of parental…
Studies in automatic speech recognition and its application in aerospace
NASA Astrophysics Data System (ADS)
Taylor, Michael Robinson
Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.
Mapping the cortical representation of speech sounds in a syllable repetition task.
Markiewicz, Christopher J; Bohland, Jason W
2016-11-01
Speech repetition relies on a series of distributed cortical representations and functional pathways. A speaker must map auditory representations of incoming sounds onto learned speech items, maintain an accurate representation of those items in short-term memory, interface that representation with the motor output system, and fluently articulate the target sequence. A "dorsal stream" consisting of posterior temporal, inferior parietal and premotor regions is thought to mediate auditory-motor representations and transformations, but the nature and activation of these representations for different portions of speech repetition tasks remains unclear. Here we mapped the correlates of phonetic and/or phonological information related to the specific phonemes and syllables that were heard, remembered, and produced using a series of cortical searchlight multi-voxel pattern analyses trained on estimates of BOLD responses from individual trials. Based on responses linked to input events (auditory syllable presentation), predictive vowel-level information was found in the left inferior frontal sulcus, while syllable prediction revealed significant clusters in the left ventral premotor cortex and central sulcus and the left mid superior temporal sulcus. Responses linked to output events (the GO signal cueing overt production) revealed strong clusters of vowel-related information bilaterally in the mid to posterior superior temporal sulcus. For the prediction of onset and coda consonants, input-linked responses yielded distributed clusters in the superior temporal cortices, which were further informative for classifiers trained on output-linked responses. Output-linked responses in the Rolandic cortex made strong predictions for the syllables and consonants produced, but their predictive power was reduced for vowels. The results of this study provide a systematic survey of how cortical response patterns covary with the identity of speech sounds, which will help to constrain and guide theoretical models of speech perception, speech production, and phonological working memory. Copyright © 2016 Elsevier Inc. All rights reserved.
Designing a Humane Multimedia Interface for the Visually Impaired.
ERIC Educational Resources Information Center
Ghaoui, Claude; Mann, M.; Ng, Eng Huat
2001-01-01
Promotes the provision of interfaces that allow users to access most of the functionality of existing graphical user interfaces (GUI) using speech. Uses the design of a speech control tool that incorporates speech recognition and synthesis into existing packaged software such as Teletext, the Internet, or a word processor. (Contains 22…
A Human Machine Interface for EVA
NASA Astrophysics Data System (ADS)
Hartmann, L.
EVA astronauts work in a challenging environment that includes high rate of muscle fatigue, haptic and proprioception impairment, lack of dexterity and interaction with robotic equipment. Currently they are heavily dependent on support from on-board crew and ground station staff for information and robotics operation. They are limited to the operation of simple controls on the suit exterior and external robot controls that are difficult to operate because of the heavy gloves that are part of the EVA suit. A wearable human machine interface (HMI) inside the suit provides a powerful alternative for robot teleoperation, procedure checklist access, generic equipment operation via virtual control panels and general information retrieval and presentation. The HMI proposed here includes speech input and output, a simple 6 degree of freedom (dof) pointing device and a heads up display (HUD). The essential characteristic of this interface is that it offers an alternative to the standard keyboard and mouse interface of a desktop computer. The astronaut's speech is used as input to command mode changes, execute arbitrary computer commands and generate text. The HMI can respond with speech also in order to confirm selections, provide status and feedback and present text output. A candidate 6 dof pointing device is Measurand's Shapetape, a flexible "tape" substrate to which is attached an optic fiber with embedded sensors. Measurement of the modulation of the light passing through the fiber can be used to compute the shape of the tape and, in particular, the position and orientation of the end of the Shapetape. It can be used to provide any kind of 3d geometric information including robot teleoperation control. The HUD can overlay graphical information onto the astronaut's visual field including robot joint torques, end effector configuration, procedure checklists and virtual control panels. With suitable tracking information about the position and orientation of the EVA suit, the overlaid graphical information can be registered with the external world. For example, information about an object can be positioned on or beside the object. This wearable HMI supports many applications during EVA including robot teleoperation, procedure checklist usage, operation of virtual control panels and general information or documentation retrieval and presentation. Whether the robot end effector is a mobile platform for the EVA astronaut or is an assistant to the astronaut in an assembly or repair task, the astronaut can control the robot via a direct manipulation interface. Embedded in the suit or the astronaut's clothing, Shapetape can measure the user's arm/hand position and orientation which can be directly mapped into the workspace coordinate system of the robot. Motion of the users hand can generate corresponding motion of the robot end effector in order to reposition the EVA platform or to manipulate objects in the robot's grasp. Speech input can be used to execute commands and mode changes without the astronaut having to withdraw from the teleoperation task. Speech output from the system can provide feedback without affecting the user's visual attention. The procedure checklist guiding the astronaut's detailed activities can be presented on the HUD and manipulated (e.g., move, scale, annotate, mark tasks as done, consult prerequisite tasks) by spoken command. Virtual control panels for suit equipment, equipment being repaired or arbitrary equipment on the space station can be displayed on the HUD and can be operated by speech commands or by hand gestures. For example, an antenna being repaired could be pointed under the control of the EVA astronaut. Additionally arbitrary computer activities such as information retrieval and presentation can be carried out using similar interface techniques. Considering the risks, expense and physical challenges of EVA work, it is appropriate that EVA astronauts have considerable support from station crew and ground station staff. Reducing their dependence on such personnel may under many circumstances, however, improve performance and reduce risk. For example, the EVA astronaut is likely to have the best viewpoint at a robotic worksite. Direct access to the procedure checklist can help provide temporal context and continuity throughout an EVA. Access to station facilities through an HMI such as the one described here could be invaluable during an emergency or in a situation in which a fault occurs. The full paper will describe the HMI operation and applications in the EVA context in more detail and will describe current laboratory prototyping activities.
Perceptual Learning of Noise Vocoded Words: Effects of Feedback and Lexicality
ERIC Educational Resources Information Center
Hervais-Adelman, Alexis; Davis, Matthew H.; Johnsrude, Ingrid S.; Carlyon, Robert P.
2008-01-01
Speech comprehension is resistant to acoustic distortion in the input, reflecting listeners' ability to adjust perceptual processes to match the speech input. This adjustment is reflected in improved comprehension of distorted speech with experience. For noise vocoding, a manipulation that removes spectral detail from speech, listeners' word…
NASA Technical Reports Server (NTRS)
Jones, Denise R.
1990-01-01
A piloted simulation study was conducted comparing three different input methods for interfacing to a large-screen, multiwindow, whole-flight-deck display for management of transport aircraft systems. The thumball concept utilized a miniature trackball embedded in a conventional side-arm controller. The touch screen concept provided data entry through a capacitive touch screen. The voice concept utilized a speech recognition system with input through a head-worn microphone. No single input concept emerged as the most desirable method of interacting with the display. Subjective results, however, indicate that the voice concept was the most preferred method of data entry and had the most potential for future applications. The objective results indicate that, overall, the touch screen concept was the most effective input method. There was also significant differences between the time required to perform specific tasks and the input concept employed, with each concept providing better performance relative to a specific task. These results suggest that a system combining all three input concepts might provide the most effective method of interaction.
Truong, Son Ngoc; Ham, Seok-Jin; Min, Kyeong-Sik
2014-01-01
In this paper, a neuromorphic crossbar circuit with binary memristors is proposed for speech recognition. The binary memristors which are based on filamentary-switching mechanism can be found more popularly and are easy to be fabricated than analog memristors that are rare in materials and need a more complicated fabrication process. Thus, we develop a neuromorphic crossbar circuit using filamentary-switching binary memristors not using interface-switching analog memristors. The proposed binary memristor crossbar can recognize five vowels with 4-bit 64 input channels. The proposed crossbar is tested by 2,500 speech samples and verified to be able to recognize 89.2% of the tested samples. From the statistical simulation, the recognition rate of the binary memristor crossbar is estimated to be degraded very little from 89.2% to 80%, though the percentage variation in memristance is increased very much from 0% to 15%. In contrast, the analog memristor crossbar loses its recognition rate significantly from 96% to 9% for the same percentage variation in memristance.
A study of speech interfaces for the vehicle environment.
DOT National Transportation Integrated Search
2013-05-01
Over the past few years, there has been a shift in automotive human machine interfaces from : visual-manual interactions (pushing buttons and rotating knobs) to speech interaction. In terms of : distraction, the industry views speech interaction as a...
Using Natural Language to Enhance Mission Effectiveness
NASA Technical Reports Server (NTRS)
Trujillo, Anna C.; Meszaros, Erica
2016-01-01
The availability of highly capable, yet relatively cheap, unmanned aerial vehicles (UAVs) is opening up new areas of use for hobbyists and for professional-related activities. The driving function of this research is allowing a non-UAV pilot, an operator, to define and manage a mission. This paper describes the preliminary usability measures of an interface that allows an operator to define the mission using speech to make inputs. An experiment was conducted to begin to enumerate the efficacy and user acceptance of using voice commands to define a multi-UAV mission and to provide high-level vehicle control commands such as "takeoff." The primary independent variable was input type - voice or mouse. The primary dependent variables consisted of the correctness of the mission parameter inputs and the time needed to make all inputs. Other dependent variables included NASA-TLX workload ratings and subjective ratings on a final questionnaire. The experiment required each subject to fill in an online form that contained comparable required information that would be needed for a package dispatcher to deliver packages. For each run, subjects typed in a simple numeric code for the package code. They then defined the initial starting position, the delivery location, and the return location using either pull-down menus or voice input. Voice input was accomplished using CMU Sphinx4-5prealpha for speech recognition. They then inputted the length of the package. These were the option fields. The subject had the system "Calculate Trajectory" and then "Takeoff" once the trajectory was calculated. Later, the subject used "Land" to finish the run. After the voice and mouse input blocked runs, subjects completed a NASA-TLX. At the conclusion of all runs, subjects completed a questionnaire asking them about their experience in inputting the mission parameters, and starting and stopping the mission using mouse and voice input. In general, the usability of voice commands is acceptable. With a relatively well-defined and simple vocabulary, the operator can input the vast majority of the mission parameters using simple, intuitive voice commands. However, voice input may be more applicable to initial mission specification rather than for critical commands such as the need to land immediately due to time and feedback constraints. It would also be convenient to retrieve relevant mission information using voice input. Therefore, further on-going research is looking at using intent from operator utterances to provide the relevant mission information to the operator. The information displayed will be inferred from the operator's utterances just before key phrases are spoken. Linguistic analysis of the context of verbal communication provides insight into the intended meaning of commonly heard phrases such as "What's it doing now?" Analyzing the semantic sphere surrounding these common phrases enables us to predict the operator's intent and supply the operator's desired information to the interface. This paper also describes preliminary investigations into the generation of the semantic space of UAV operation and the success at providing information to the interface based on the operator's utterances.
NASA Astrophysics Data System (ADS)
Tanioka, Toshimasa; Egashira, Hiroyuki; Takata, Mayumi; Okazaki, Yasuhisa; Watanabe, Kenzi; Kondo, Hiroki
We have designed and implemented a PC operation support system for a physically disabled person with a speech impediment via voice. Voice operation is an effective method for a physically disabled person with involuntary movement of the limbs and the head. We have applied a commercial speech recognition engine to develop our system for practical purposes. Adoption of a commercial engine reduces development cost and will contribute to make our system useful to another speech impediment people. We have customized commercial speech recognition engine so that it can recognize the utterance of a person with a speech impediment. We have restricted the words that the recognition engine recognizes and separated a target words from similar words in pronunciation to avoid misrecognition. Huge number of words registered in commercial speech recognition engines cause frequent misrecognition for speech impediments' utterance, because their utterance is not clear and unstable. We have solved this problem by narrowing the choice of input down in a small number and also by registering their ambiguous pronunciations in addition to the original ones. To realize all character inputs and all PC operation with a small number of words, we have designed multiple input modes with categorized dictionaries and have introduced two-step input in each mode except numeral input to enable correct operation with small number of words. The system we have developed is in practical level. The first author of this paper is physically disabled with a speech impediment. He has been able not only character input into PC but also to operate Windows system smoothly by using this system. He uses this system in his daily life. This paper is written by him with this system. At present, the speech recognition is customized to him. It is, however, possible to customize for other users by changing words and registering new pronunciation according to each user's utterance.
Adaptive multimodal interaction in mobile augmented reality: A conceptual framework
NASA Astrophysics Data System (ADS)
Abidin, Rimaniza Zainal; Arshad, Haslina; Shukri, Saidatul A'isyah Ahmad
2017-10-01
Recently, Augmented Reality (AR) is an emerging technology in many mobile applications. Mobile AR was defined as a medium for displaying information merged with the real world environment mapped with augmented reality surrounding in a single view. There are four main types of mobile augmented reality interfaces and one of them are multimodal interfaces. Multimodal interface processes two or more combined user input modes (such as speech, pen, touch, manual gesture, gaze, and head and body movements) in a coordinated manner with multimedia system output. In multimodal interface, many frameworks have been proposed to guide the designer to develop a multimodal applications including in augmented reality environment but there has been little work reviewing the framework of adaptive multimodal interface in mobile augmented reality. The main goal of this study is to propose a conceptual framework to illustrate the adaptive multimodal interface in mobile augmented reality. We reviewed several frameworks that have been proposed in the field of multimodal interfaces, adaptive interface and augmented reality. We analyzed the components in the previous frameworks and measure which can be applied in mobile devices. Our framework can be used as a guide for designers and developer to develop a mobile AR application with an adaptive multimodal interfaces.
Random Deep Belief Networks for Recognizing Emotions from Speech Signals.
Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang
2017-01-01
Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.
Random Deep Belief Networks for Recognizing Emotions from Speech Signals
Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang
2017-01-01
Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition. PMID:28356908
Houston, Derek M.; Bergeson, Tonya R.
2013-01-01
The advent of cochlear implantation has provided thousands of deaf infants and children access to speech and the opportunity to learn spoken language. Whether or not deaf infants successfully learn spoken language after implantation may depend in part on the extent to which they listen to speech rather than just hear it. We explore this question by examining the role that attention to speech plays in early language development according to a prominent model of infant speech perception – Jusczyk’s WRAPSA model – and by reviewing the kinds of speech input that maintains normal-hearing infants’ attention. We then review recent findings suggesting that cochlear-implanted infants’ attention to speech is reduced compared to normal-hearing infants and that speech input to these infants differs from input to infants with normal hearing. Finally, we discuss possible roles attention to speech may play on deaf children’s language acquisition after cochlear implantation in light of these findings and predictions from Jusczyk’s WRAPSA model. PMID:24729634
Noise-immune multisensor transduction of speech
NASA Astrophysics Data System (ADS)
Viswanathan, Vishu R.; Henry, Claudia M.; Derr, Alan G.; Roucos, Salim; Schwartz, Richard M.
1986-08-01
Two types of configurations of multiple sensors were developed, tested and evaluated in speech recognition application for robust performance in high levels of acoustic background noise: One type combines the individual sensor signals to provide a single speech signal input, and the other provides several parallel inputs. For single-input systems, several configurations of multiple sensors were developed and tested. Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show that each of the two-sensor configurations tested outperforms the constituent individual sensors in high noise. Also presented are results comparing the performance of two-sensor configurations and individual sensors in speaker-dependent, isolated-word speech recognition tests performed using a commercial recognizer (Verbex 4000) in simulated fighter aircraft cockpit noise.
Natural interaction for unmanned systems
NASA Astrophysics Data System (ADS)
Taylor, Glenn; Purman, Ben; Schermerhorn, Paul; Garcia-Sampedro, Guillermo; Lanting, Matt; Quist, Michael; Kawatsu, Chris
2015-05-01
Military unmanned systems today are typically controlled by two methods: tele-operation or menu-based, search-andclick interfaces. Both approaches require the operator's constant vigilance: tele-operation requires constant input to drive the vehicle inch by inch; a menu-based interface requires eyes on the screen in order to search through alternatives and select the right menu item. In both cases, operators spend most of their time and attention driving and minding the unmanned systems rather than on being a warfighter. With these approaches, the platform and interface become more of a burden than a benefit. The availability of inexpensive sensor systems in products such as Microsoft Kinect™ or Nintendo Wii™ has resulted in new ways of interacting with computing systems, but new sensors alone are not enough. Developing useful and usable human-system interfaces requires understanding users and interaction in context: not just what new sensors afford in terms of interaction, but how users want to interact with these systems, for what purpose, and how sensors might enable those interactions. Additionally, the system needs to reliably make sense of the user's inputs in context, translate that interpretation into commands for the unmanned system, and give feedback to the user. In this paper, we describe an example natural interface for unmanned systems, called the Smart Interaction Device (SID), which enables natural two-way interaction with unmanned systems including the use of speech, sketch, and gestures. We present a few example applications SID to different types of unmanned systems and different kinds of interactions.
Visual input enhances selective speech envelope tracking in auditory cortex at a "cocktail party".
Zion Golumbic, Elana; Cogan, Gregory B; Schroeder, Charles E; Poeppel, David
2013-01-23
Our ability to selectively attend to one auditory signal amid competing input streams, epitomized by the "Cocktail Party" problem, continues to stimulate research from various approaches. How this demanding perceptual feat is achieved from a neural systems perspective remains unclear and controversial. It is well established that neural responses to attended stimuli are enhanced compared with responses to ignored ones, but responses to ignored stimuli are nonetheless highly significant, leading to interference in performance. We investigated whether congruent visual input of an attended speaker enhances cortical selectivity in auditory cortex, leading to diminished representation of ignored stimuli. We recorded magnetoencephalographic signals from human participants as they attended to segments of natural continuous speech. Using two complementary methods of quantifying the neural response to speech, we found that viewing a speaker's face enhances the capacity of auditory cortex to track the temporal speech envelope of that speaker. This mechanism was most effective in a Cocktail Party setting, promoting preferential tracking of the attended speaker, whereas without visual input no significant attentional modulation was observed. These neurophysiological results underscore the importance of visual input in resolving perceptual ambiguity in a noisy environment. Since visual cues in speech precede the associated auditory signals, they likely serve a predictive role in facilitating auditory processing of speech, perhaps by directing attentional resources to appropriate points in time when to-be-attended acoustic input is expected to arrive.
Parental Numeric Language Input to Mandarin Chinese and English Speaking Preschool Children
ERIC Educational Resources Information Center
Chang, Alicia; Sandhofer, Catherine M.; Adelchanow, Lauren; Rottman, Benjamin
2011-01-01
The present study examined the number-specific parental language input to Mandarin- and English-speaking preschool-aged children. Mandarin and English transcripts from the CHILDES database were examined for amount of numeric speech, specific types of numeric speech and syntactic frames in which numeric speech appeared. The results showed that…
Robotics control using isolated word recognition of voice input
NASA Technical Reports Server (NTRS)
Weiner, J. M.
1977-01-01
A speech input/output system is presented that can be used to communicate with a task oriented system. Human speech commands and synthesized voice output extend conventional information exchange capabilities between man and machine by utilizing audio input and output channels. The speech input facility is comprised of a hardware feature extractor and a microprocessor implemented isolated word or phrase recognition system. The recognizer offers a medium sized (100 commands), syntactically constrained vocabulary, and exhibits close to real time performance. The major portion of the recognition processing required is accomplished through software, minimizing the complexity of the hardware feature extractor.
Speech input system for meat inspection and pathological coding used thereby
NASA Astrophysics Data System (ADS)
Abe, Shozo
Meat inspection is one of exclusive and important jobs of veterinarians though it is not well known in general. As the inspection should be conducted skillfully during a series of continuous operations in a slaughter house, development of automatic inspecting systems has been required for a long time. We employed a hand-free speech input system to record the inspecting data because inspecters have to use their both hands to treat the internals of catles and check their health conditions by necked eyes. The data collected by the inspectors are transfered to a speech recognizer and then stored as controlable data of each catle inspected. Control of terms such as pathological conditions to be input and their coding are also important in this speech input system and practical examples are shown.
Designing speech-based interfaces for telepresence robots for people with disabilities.
Tsui, Katherine M; Flynn, Kelsey; McHugh, Amelia; Yanco, Holly A; Kontak, David
2013-06-01
People with cognitive and/or motor impairments may benefit from using telepresence robots to engage in social activities. To date, these robots, their user interfaces, and their navigation behaviors have not been designed for operation by people with disabilities. We conducted an experiment in which participants (n=12) used a telepresence robot in a scavenger hunt task to determine how they would use speech to command the robot. Based upon the results, we present design guidelines for speech-based interfaces for telepresence robots.
Automatic Speech Recognition from Neural Signals: A Focused Review.
Herff, Christian; Schultz, Tanja
2016-01-01
Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.
Asynchronous brain-computer interface for cognitive assessment in people with cerebral palsy
NASA Astrophysics Data System (ADS)
Alcaide-Aguirre, R. E.; Warschausky, S. A.; Brown, D.; Aref, A.; Huggins, J. E.
2017-12-01
Objective. Typically, clinical measures of cognition require motor or speech responses. Thus, a significant percentage of people with disabilities are not able to complete standardized assessments. This situation could be resolved by employing a more accessible test administration method, such as a brain-computer interface (BCI). A BCI can circumvent motor and speech requirements by translating brain activity to identify a subject’s response. By eliminating the need for motor or speech input, one could use a BCI to assess an individual who previously did not have access to clinical tests. Approach. We developed an asynchronous, event-related potential BCI-facilitated administration procedure for the peabody picture vocabulary test (PPVT-IV). We then tested our system in typically developing individuals (N = 11), as well as people with cerebral palsy (N = 19) to compare results to the standardized PPVT-IV format and administration. Main results. Standard scores on the BCI-facilitated PPVT-IV, and the standard PPVT-IV were highly correlated (r = 0.95, p < 0.001), with a mean difference of 2.0 ± 6.4 points, which is within the standard error of the PPVT-IV. Significance. Thus, our BCI-facilitated PPVT-IV provided comparable results to the standard PPVT-IV, suggesting that populations for whom standardized cognitive tests are not accessible could benefit from our BCI-facilitated approach.
ERIC Educational Resources Information Center
Pons, Ferran; Biesanz, Jeremy C.; Kajikawa, Sachiyo; Fais, Laurel; Narayan, Chandan R.; Amano, Shigeaki; Werker, Janet F.
2012-01-01
Using an artificial language learning manipulation, Maye, Werker, and Gerken (2002) demonstrated that infants' speech sound categories change as a function of the distributional properties of the input. In a recent study, Werker et al. (2007) showed that Infant-directed Speech (IDS) input contains reliable acoustic cues that support distributional…
Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a ‘Cocktail Party’
Golumbic, Elana Zion; Cogan, Gregory B.; Schroeder, Charles E.; Poeppel, David
2013-01-01
Our ability to selectively attend to one auditory signal amidst competing input streams, epitomized by the ‘Cocktail Party’ problem, continues to stimulate research from various approaches. How this demanding perceptual feat is achieved from a neural systems perspective remains unclear and controversial. It is well established that neural responses to attended stimuli are enhanced compared to responses to ignored ones, but responses to ignored stimuli are nonetheless highly significant, leading to interference in performance. We investigated whether congruent visual input of an attended speaker enhances cortical selectivity in auditory cortex, leading to diminished representation of ignored stimuli. We recorded magnetoencephalographic (MEG) signals from human participants as they attended to segments of natural continuous speech. Using two complementary methods of quantifying the neural response to speech, we found that viewing a speaker’s face enhances the capacity of auditory cortex to track the temporal speech envelope of that speaker. This mechanism was most effective in a ‘Cocktail Party’ setting, promoting preferential tracking of the attended speaker, whereas without visual input no significant attentional modulation was observed. These neurophysiological results underscore the importance of visual input in resolving perceptual ambiguity in a noisy environment. Since visual cues in speech precede the associated auditory signals, they likely serve a predictive role in facilitating auditory processing of speech, perhaps by directing attentional resources to appropriate points in time when to-be-attended acoustic input is expected to arrive. PMID:23345218
ERIC Educational Resources Information Center
Ferati, Mexhid Adem
2012-01-01
To access interactive systems, blind and visually impaired users can leverage their auditory senses by using non-speech sounds. The current structure of non-speech sounds, however, is geared toward conveying user interface operations (e.g., opening a file) rather than large theme-based information (e.g., a history passage) and, thus, is ill-suited…
Lee, J D; Caven, B; Haake, S; Brown, T L
2001-01-01
As computer applications for cars emerge, a speech-based interface offers an appealing alternative to the visually demanding direct manipulation interface. However, speech-based systems may pose cognitive demands that could undermine driving safety. This study used a car-following task to evaluate how a speech-based e-mail system affects drivers' response to the periodic braking of a lead vehicle. The study included 24 drivers between the ages of 18 and 24 years. A baseline condition with no e-mail system was compared with a simple and a complex e-mail system in both simple and complex driving environments. The results show a 30% (310 ms) increase in reaction time when the speech-based system is used. Subjective workload ratings and probe questions also indicate that speech-based interaction introduces a significant cognitive load, which was highest for the complex e-mail system. These data show that a speech-based interface is not a panacea that eliminates the potential distraction of in-vehicle computers. Actual or potential applications of this research include design of in-vehicle information systems and evaluation of their contributions to driver distraction.
Improved Open-Microphone Speech Recognition
NASA Astrophysics Data System (ADS)
Abrash, Victor
2002-12-01
Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken dialog manager extra flexibility to recognize the signal with no audio gaps between recognition requests, as well as to rerecognize portions of the signal, or to rerecognize speech with different grammars, acoustic models, recognizers, start times, and so on. SRI expects that this new open-mic functionality will enable NASA to develop better error-correction mechanisms for spoken dialog systems, and may also enable new interaction strategies.
Improved Open-Microphone Speech Recognition
NASA Technical Reports Server (NTRS)
Abrash, Victor
2002-01-01
Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken dialog manager extra flexibility to recognize the signal with no audio gaps between recognition requests, as well as to rerecognize portions of the signal, or to rerecognize speech with different grammars, acoustic models, recognizers, start times, and so on. SRI expects that this new open-mic functionality will enable NASA to develop better error-correction mechanisms for spoken dialog systems, and may also enable new interaction strategies.
[Prosody, speech input and language acquisition].
Jungheim, M; Miller, S; Kühn, D; Ptok, M
2014-04-01
In order to acquire language, children require speech input. The prosody of the speech input plays an important role. In most cultures adults modify their code when communicating with children. Compared to normal speech this code differs especially with regard to prosody. For this review a selective literature search in PubMed and Scopus was performed. Prosodic characteristics are a key feature of spoken language. By analysing prosodic features, children gain knowledge about underlying grammatical structures. Child-directed speech (CDS) is modified in a way that meaningful sequences are highlighted acoustically so that important information can be extracted from the continuous speech flow more easily. CDS is said to enhance the representation of linguistic signs. Taking into consideration what has previously been described in the literature regarding the perception of suprasegmentals, CDS seems to be able to support language acquisition due to the correspondence of prosodic and syntactic units. However, no findings have been reported, stating that the linguistically reduced CDS could hinder first language acquisition.
A voice-input voice-output communication aid for people with severe speech impairment.
Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter
2013-01-01
A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.
Speech processing using conditional observable maximum likelihood continuity mapping
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, John; Nix, David
A computer implemented method enables the recognition of speech and speech characteristics. Parameters are initialized of first probability density functions that map between the symbols in the vocabulary of one or more sequences of speech codes that represent speech sounds and a continuity map. Parameters are also initialized of second probability density functions that map between the elements in the vocabulary of one or more desired sequences of speech transcription symbols and the continuity map. The parameters of the probability density functions are then trained to maximize the probabilities of the desired sequences of speech-transcription symbols. A new sequence ofmore » speech codes is then input to the continuity map having the trained first and second probability function parameters. A smooth path is identified on the continuity map that has the maximum probability for the new sequence of speech codes. The probability of each speech transcription symbol for each input speech code can then be output.« less
Sperry Univac speech communications technology
NASA Technical Reports Server (NTRS)
Medress, Mark F.
1977-01-01
Technology and systems for effective verbal communication with computers were developed. A continuous speech recognition system for verbal input, a word spotting system to locate key words in conversational speech, prosodic tools to aid speech analysis, and a prerecorded voice response system for speech output are described.
Halliwell, Emily R; Jones, Linor L; Fraser, Matthew; Lockley, Morag; Hill-Feltham, Penelope; McKay, Colette M
2015-06-01
A study was conducted to determine whether modifications to input compression and input frequency response characteristics can improve music-listening satisfaction in cochlear implant users. Experiment 1 compared three pre-processed versions of music and speech stimuli in a laboratory setting: original, compressed, and flattened frequency response. Music excerpts comprised three music genres (classical, country, and jazz), and a running speech excerpt was compared. Experiment 2 implemented a flattened input frequency response in the speech processor program. In a take-home trial, participants compared unaltered and flattened frequency responses. Ten and twelve adult Nucleus Freedom cochlear implant users participated in Experiments 1 and 2, respectively. Experiment 1 revealed a significant preference for music stimuli with a flattened frequency response compared to both original and compressed stimuli, whereas there was a significant preference for the original (rising) frequency response for speech stimuli. Experiment 2 revealed no significant mean preference for the flattened frequency response, with 9 of 11 subjects preferring the rising frequency response. Input compression did not alter music enjoyment. Comparison of the two experiments indicated that individual frequency response preferences may depend on the genre or familiarity, and particularly whether the music contained lyrics.
Intelligent interfaces for expert systems
NASA Technical Reports Server (NTRS)
Villarreal, James A.; Wang, Lui
1988-01-01
Vital to the success of an expert system is an interface to the user which performs intelligently. A generic intelligent interface is being developed for expert systems. This intelligent interface was developed around the in-house developed Expert System for the Flight Analysis System (ESFAS). The Flight Analysis System (FAS) is comprised of 84 configuration controlled FORTRAN subroutines that are used in the preflight analysis of the space shuttle. In order to use FAS proficiently, a person must be knowledgeable in the areas of flight mechanics, the procedures involved in deploying a certain payload, and an overall understanding of the FAS. ESFAS, still in its developmental stage, is taking into account much of this knowledge. The generic intelligent interface involves the integration of a speech recognizer and synthesizer, a preparser, and a natural language parser to ESFAS. The speech recognizer being used is capable of recognizing 1000 words of connected speech. The natural language parser is a commercial software package which uses caseframe instantiation in processing the streams of words from the speech recognizer or the keyboard. The systems configuration is described along with capabilities and drawbacks.
NASA Technical Reports Server (NTRS)
Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.
2016-01-01
During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.
Auditory-motor interaction revealed by fMRI: speech, music, and working memory in area Spt.
Hickok, Gregory; Buchsbaum, Bradley; Humphries, Colin; Muftuler, Tugan
2003-07-01
The concept of auditory-motor interaction pervades speech science research, yet the cortical systems supporting this interface have not been elucidated. Drawing on experimental designs used in recent work in sensory-motor integration in the cortical visual system, we used fMRI in an effort to identify human auditory regions with both sensory and motor response properties, analogous to single-unit responses in known visuomotor integration areas. The sensory phase of the task involved listening to speech (nonsense sentences) or music (novel piano melodies); the "motor" phase of the task involved covert rehearsal/humming of the auditory stimuli. A small set of areas in the superior temporal and temporal-parietal cortex responded both during the listening phase and the rehearsal/humming phase. A left lateralized region in the posterior Sylvian fissure at the parietal-temporal boundary, area Spt, showed particularly robust responses to both phases of the task. Frontal areas also showed combined auditory + rehearsal responsivity consistent with the claim that the posterior activations are part of a larger auditory-motor integration circuit. We hypothesize that this circuit plays an important role in speech development as part of the network that enables acoustic-phonetic input to guide the acquisition of language-specific articulatory-phonetic gestures; this circuit may play a role in analogous musical abilities. In the adult, this system continues to support aspects of speech production, and, we suggest, supports verbal working memory.
Walenski, Matthew; Swinney, David
2009-01-01
The central question underlying this study revolves around how children process co-reference relationships—such as those evidenced by pronouns (him) and reflexives (himself)—and how a slowed rate of speech input may critically affect this process. Previous studies of child language processing have demonstrated that typical language developing (TLD) children as young as 4 years of age process co-reference relations in a manner similar to adults on-line. In contrast, off-line measures of pronoun comprehension suggest a developmental delay for pronouns (relative to reflexives). The present study examines dependency relations in TLD children (ages 5–13) and investigates how a slowed rate of speech input affects the unconscious (on-line) and conscious (off-line) parsing of these constructions. For the on-line investigations (using a cross-modal picture priming paradigm), results indicate that at a normal rate of speech TLD children demonstrate adult-like syntactic reflexes. At a slowed rate of speech the typical language developing children displayed a breakdown in automatic syntactic parsing (again, similar to the pattern seen in unimpaired adults). As demonstrated in the literature, our off-line investigations (sentence/picture matching task) revealed that these children performed much better on reflexives than on pronouns at a regular speech rate. However, at the slow speech rate, performance on pronouns was substantially improved, whereas performance on reflexives was not different than at the regular speech rate. We interpret these results in light of a distinction between fast automatic processes (relied upon for on-line processing in real time) and conscious reflective processes (relied upon for off-line processing), such that slowed speech input disrupts the former, yet improves the latter. PMID:19343495
Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review
NASA Astrophysics Data System (ADS)
Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH
2017-09-01
This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.
Creating speech-synchronized animation.
King, Scott A; Parent, Richard E
2005-01-01
We present a facial model designed primarily to support animated speech. Our facial model takes facial geometry as input and transforms it into a parametric deformable model. The facial model uses a muscle-based parameterization, allowing for easier integration between speech synchrony and facial expressions. Our facial model has a highly deformable lip model that is grafted onto the input facial geometry to provide the necessary geometric complexity needed for creating lip shapes and high-quality renderings. Our facial model also includes a highly deformable tongue model that can represent the shapes the tongue undergoes during speech. We add teeth, gums, and upper palate geometry to complete the inner mouth. To decrease the processing time, we hierarchically deform the facial surface. We also present a method to animate the facial model over time to create animated speech using a model of coarticulation that blends visemes together using dominance functions. We treat visemes as a dynamic shaping of the vocal tract by describing visemes as curves instead of keyframes. We show the utility of the techniques described in this paper by implementing them in a text-to-audiovisual-speech system that creates animation of speech from unrestricted text. The facial and coarticulation models must first be interactively initialized. The system then automatically creates accurate real-time animated speech from the input text. It is capable of cheaply producing tremendous amounts of animated speech with very low resource requirements.
Van der Haegen, Lise; Acke, Frederic; Vingerhoets, Guy; Dhooge, Ingeborg; De Leenheer, Els; Cai, Qing; Brysbaert, Marc
2016-12-01
Auditory speech perception, speech production and reading lateralize to the left hemisphere in the majority of healthy right-handers. In this study, we investigated to what extent sensory input underlies the side of language dominance. We measured the lateralization of the three core subprocesses of language in patients who had profound hearing loss in the right ear from birth and in matched control subjects. They took part in a semantic decision listening task involving speech and sound stimuli (auditory perception), a word generation task (speech production) and a passive reading task (reading). The results show that a lack of sensory auditory input on the right side, which is strongly connected to the contralateral left hemisphere, does not lead to atypical lateralization of speech perception. Speech production and reading were also typically left lateralized in all but one patient, contradicting previous small scale studies. Other factors such as genetic constraints presumably overrule the role of sensory input in the development of (a)typical language lateralization. Copyright © 2015 Elsevier Ltd. All rights reserved.
Speech Recognition for A Digital Video Library.
ERIC Educational Resources Information Center
Witbrock, Michael J.; Hauptmann, Alexander G.
1998-01-01
Production of the meta-data supporting the Informedia Digital Video Library interface is automated using techniques derived from artificial intelligence research. Speech recognition and natural-language processing, information retrieval, and image analysis are applied to produce an interface that helps users locate information and navigate more…
Wireless and acoustic hearing with bone-anchored hearing devices.
Bosman, Arjan J; Mylanus, Emmanuel A M; Hol, Myrthe K S; Snik, Ad F M
2015-07-01
The efficacy of wireless connectivity in bone-anchored hearing was studied by comparing the wireless and acoustic performance of the Ponto Plus sound processor from Oticon Medical relative to the acoustic performance of its predecessor, the Ponto Pro. Nineteen subjects with more than two years' experience with a bone-anchored hearing device were included. Thirteen subjects were fitted unilaterally and six bilaterally. Subjects served as their own control. First, subjects were tested with the Ponto Pro processor. After a four-week acclimatization period performance the Ponto Plus processor was measured. In the laboratory wireless and acoustic input levels were made equal. In daily life equal settings of wireless and acoustic input were used when watching TV, however when using the telephone the acoustic input was reduced by 9 dB relative to the wireless input. Speech scores for microphone with Ponto Pro and for both input modes of the Ponto Plus processor were essentially equal when equal input levels of wireless and microphone inputs were used. Only the TV-condition showed a statistically significant (p <5%) lower speech reception threshold for wireless relative to microphone input. In real life, evaluation of speech quality, speech intelligibility in quiet and noise, and annoyance by ambient noise, when using landline phone, mobile telephone, and watching TV showed a clear preference (p <1%) for the Ponto Plus system with streamer over the microphone input. Due to the small number of respondents with landline phone (N = 7) the result for noise annoyance was only significant at the 5% level. Equal input levels for acoustic and wireless inputs results in equal speech scores, showing a (near) equivalence for acoustic and wireless sound transmission with Ponto Pro and Ponto Plus. The default 9-dB difference between microphone and wireless input when using the telephone results in a substantial wireless benefit when using the telephone. The preference of wirelessly transmitted audio when watching TV can be attributed to the relatively poor sound quality of backward facing loudspeakers in flat screen TVs. The ratio of wireless and acoustic input can be easily set to the user's preference with the streamer's volume control.
Hurtado, Nereyda; Marchman, Virginia A.; Fernald, Anne
2010-01-01
It is well established that variation in caregivers' speech is associated with language outcomes, yet little is known about the learning principles that mediate these effects. This longitudinal study (n = 27) explores whether Spanish-learning children's early experiences with language predict efficiency in real-time comprehension and vocabulary learning. Measures of mothers' speech at 18 months were examined in relation to children's speech processing efficiency and reported vocabulary at 18 and 24 months. Children of mothers who provided more input at 18 months knew more words and were faster in word recognition at 24 months. Moreover, multiple regression analyses indicated that the influences of caregiver speech on speed of word recognition and vocabulary were largely overlapping. This study provides the first evidence that input shapes children's lexical processing efficiency and that vocabulary growth and increasing facility in spoken word comprehension work together to support the uptake of the information that rich input affords the young language learner. PMID:19046145
Key considerations in designing a speech brain-computer interface.
Bocquelet, Florent; Hueber, Thomas; Girin, Laurent; Chabardès, Stéphan; Yvert, Blaise
2016-11-01
Restoring communication in case of aphasia is a key challenge for neurotechnologies. To this end, brain-computer strategies can be envisioned to allow artificial speech synthesis from the continuous decoding of neural signals underlying speech imagination. Such speech brain-computer interfaces do not exist yet and their design should consider three key choices that need to be made: the choice of appropriate brain regions to record neural activity from, the choice of an appropriate recording technique, and the choice of a neural decoding scheme in association with an appropriate speech synthesis method. These key considerations are discussed here in light of (1) the current understanding of the functional neuroanatomy of cortical areas underlying overt and covert speech production, (2) the available literature making use of a variety of brain recording techniques to better characterize and address the challenge of decoding cortical speech signals, and (3) the different speech synthesis approaches that can be considered depending on the level of speech representation (phonetic, acoustic or articulatory) envisioned to be decoded at the core of a speech BCI paradigm. Copyright © 2017 The Author(s). Published by Elsevier Ltd.. All rights reserved.
ERIC Educational Resources Information Center
Newman, Rochelle S.; Rowe, Meredith L.; Ratner, Nan Bernstein
2016-01-01
Both the input directed to the child, and the child's ability to process that input, are likely to impact the child's language acquisition. We explore how these factors inter-relate by tracking the relationships among: (a) lexical properties of maternal child-directed speech to prelinguistic (7-month-old) infants (N = 121); (b) these infants'…
Atcherson, Samuel R; Mendel, Lisa Lucks; Baltimore, Wesley J; Patro, Chhayakanta; Lee, Sungmin; Pousson, Monique; Spann, M Joshua
2017-01-01
It is generally well known that speech perception is often improved with integrated audiovisual input whether in quiet or in noise. In many health-care environments, however, conventional surgical masks block visual access to the mouth and obscure other potential facial cues. In addition, these environments can be noisy. Although these masks may not alter the acoustic properties, the presence of noise in addition to the lack of visual input can have a deleterious effect on speech understanding. A transparent ("see-through") surgical mask may help to overcome this issue. To compare the effect of noise and various visual input conditions on speech understanding for listeners with normal hearing (NH) and hearing impairment using different surgical masks. Participants were assigned to one of three groups based on hearing sensitivity in this quasi-experimental, cross-sectional study. A total of 31 adults participated in this study: one talker, ten listeners with NH, ten listeners with moderate sensorineural hearing loss, and ten listeners with severe-to-profound hearing loss. Selected lists from the Connected Speech Test were digitally recorded with and without surgical masks and then presented to the listeners at 65 dB HL in five conditions against a background of four-talker babble (+10 dB SNR): without a mask (auditory only), without a mask (auditory and visual), with a transparent mask (auditory only), with a transparent mask (auditory and visual), and with a paper mask (auditory only). A significant difference was found in the spectral analyses of the speech stimuli with and without the masks; however, no more than ∼2 dB root mean square. Listeners with NH performed consistently well across all conditions. Both groups of listeners with hearing impairment benefitted from visual input from the transparent mask. The magnitude of improvement in speech perception in noise was greatest for the severe-to-profound group. Findings confirm improved speech perception performance in noise for listeners with hearing impairment when visual input is provided using a transparent surgical mask. Most importantly, the use of the transparent mask did not negatively affect speech perception performance in noise. American Academy of Audiology
Simulation of talking faces in the human brain improves auditory speech recognition
von Kriegstein, Katharina; Dogan, Özgür; Grüter, Martina; Giraud, Anne-Lise; Kell, Christian A.; Grüter, Thomas; Kleinschmidt, Andreas; Kiebel, Stefan J.
2008-01-01
Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. PMID:18436648
ERIC Educational Resources Information Center
Megnin-Viggars, Odette; Goswami, Usha
2013-01-01
Visual speech inputs can enhance auditory speech information, particularly in noisy or degraded conditions. The natural statistics of audiovisual speech highlight the temporal correspondence between visual and auditory prosody, with lip, jaw, cheek and head movements conveying information about the speech envelope. Low-frequency spatial and…
Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.
1983-05-01
16.311 % a. Seale In/Se"l tAL4 lrs e y i s 2 I ROM men "Ig eddiei, m releerla ons leveltc. Ŗ dots ghoeea INDtISTRtAIJ%6LITARY SPEECH SYNTHESIS PRODUCTS...saquence The SC-01 Suech Syntheszer conftains 64 cf, arent poneme~hs which are accessed try A 6-tht code. 1 - the proper sequ.enti omthnatiors of thoe...connected speech input with widely differing emotional states, diverse accents, and substantial nonperiodic background noise input. As noted previously
Use of Computer Speech Technologies To Enhance Learning.
ERIC Educational Resources Information Center
Ferrell, Joe
1999-01-01
Discusses the design of an innovative learning system that uses new technologies for the man-machine interface, incorporating a combination of Automatic Speech Recognition (ASR) and Text To Speech (TTS) synthesis. Highlights include using speech technologies to mimic the attributes of the ideal tutor and design features. (AEF)
MediLink: a wearable telemedicine system for emergency and mobile applications.
Koval, T; Dudziak, M
1999-01-01
The practical needs of the medical professional faced with critical care or emergency situations differ from those working in many environments where telemedicine and mobile computing have been introduced and tested. One constructive criticism of the telemedicine initiative has been to question what positive benefits are gained from videoconferencing, paperless transactions, and online access to patient record. With a goal of producing a positive answer to such questions an architecture for multipurpose mobile telemedicine applications has been developed. The core technology is based upon a wearable personal computer with a smart-card interface coupled with speech, pen, video input and wireless intranet connectivity. The TransPAC system with the MedLink software system is designed to provide an integrated solution for a broad range of health care functions where mobile and hands-free or limited-access systems are preferred or necessary and where the capabilities of other mobile devices are insufficient or inappropriate. Structured and noise-resistant speech-to-text interfacing plus the use of a web browser-like display, accessible through either a flatpanel, standard, or headset monitor, gives the beltpack TransPAC computer the functions of a complete desktop including PCMCIA card interfaces for internet connectivity and a secure smartcard with 16-bit microprocessor and upwards of 64K memory. The card acts to provide user access control for security, user custom configuration of applications and display and vocabulary, and memory to diminish the need for PC-server communications while in an active session. TransPAC is being implemented for EMT and ER staff usage.
Learning Vowel Categories from Maternal Speech in Gurindji Kriol
ERIC Educational Resources Information Center
Jones, Caroline; Meakins, Felicity; Muawiyath, Shujau
2012-01-01
Distributional learning is a proposal for how infants might learn early speech sound categories from acoustic input before they know many words. When categories in the input differ greatly in relative frequency and overlap in acoustic space, research in bilingual development suggests that this affects the course of development. In the present…
Speech Recognition Technology for Disabilities Education
ERIC Educational Resources Information Center
Tang, K. Wendy; Kamoua, Ridha; Sutan, Victor; Farooq, Omer; Eng, Gilbert; Chu, Wei Chern; Hou, Guofeng
2005-01-01
Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace or reduce the reliability on standard keyboard and mouse input. This can especially assist dyslexic students who have problems with character or word use and manipulation in a textual…
Callahan, Sarah M.; Walenski, Matthew; Love, Tracy
2013-01-01
Purpose To examine children’s comprehension of verb phrase (VP) ellipsis constructions in light of their automatic, online structural processing abilities and conscious, metalinguistic reflective skill. Method Forty-two children ages 5 through 12 years listened to VP ellipsis constructions involving the strict/sloppy ambiguity (e.g., “The janitor untangled himself from the rope and the fireman in the elementary school did too after the accident.”) in which the ellipsis phrase (“did too”) had 2 interpretations: (a) strict (“untangled the janitor”) and (b) sloppy (“untangled the fireman”). We examined these sentences at a normal speech rate with an online cross-modal picture priming task (n = 14) and an offline sentence–picture matching task (n = 11). Both tasks were also given with slowed speech input (n = 17). Results Children showed priming for both the strict and sloppy interpretations at a normal speech rate but only for the strict interpretation with slowed input. Offline, children displayed an adultlike preference for the sloppy interpretation with normal-rate input but a divergent pattern with slowed speech. Conclusions Our results suggest that children and adults rely on a hybrid syntax-discourse model for the online comprehension and offline interpretation of VP ellipsis constructions. This model incorporates a temporally sensitive syntactic process of VP reconstruction (disrupted with slow input) and a temporally protracted discourse effect attributed to parallelism (preserved with slow input). PMID:22223886
Zion Golumbic, Elana M.; Poeppel, David; Schroeder, Charles E.
2012-01-01
The human capacity for processing speech is remarkable, especially given that information in speech unfolds over multiple time scales concurrently. Similarly notable is our ability to filter out of extraneous sounds and focus our attention on one conversation, epitomized by the ‘Cocktail Party’ effect. Yet, the neural mechanisms underlying on-line speech decoding and attentional stream selection are not well understood. We review findings from behavioral and neurophysiological investigations that underscore the importance of the temporal structure of speech for achieving these perceptual feats. We discuss the hypothesis that entrainment of ambient neuronal oscillations to speech’s temporal structure, across multiple time-scales, serves to facilitate its decoding and underlies the selection of an attended speech stream over other competing input. In this regard, speech decoding and attentional stream selection are examples of ‘active sensing’, emphasizing an interaction between proactive and predictive top-down modulation of neuronal dynamics and bottom-up sensory input. PMID:22285024
NASA Astrophysics Data System (ADS)
Kattoju, Ravi Kiran; Barber, Daniel J.; Abich, Julian; Harris, Jonathan
2016-05-01
With increasing necessity for intuitive Soldier-robot communication in military operations and advancements in interactive technologies, autonomous robots have transitioned from assistance tools to functional and operational teammates able to service an array of military operations. Despite improvements in gesture and speech recognition technologies, their effectiveness in supporting Soldier-robot communication is still uncertain. The purpose of the present study was to evaluate the performance of gesture and speech interface technologies to facilitate Soldier-robot communication during a spatial-navigation task with an autonomous robot. Gesture and speech semantically based spatial-navigation commands leveraged existing lexicons for visual and verbal communication from the U.S Army field manual for visual signaling and a previously established Squad Level Vocabulary (SLV). Speech commands were recorded by a Lapel microphone and Microsoft Kinect, and classified by commercial off-the-shelf automatic speech recognition (ASR) software. Visual signals were captured and classified using a custom wireless gesture glove and software. Participants in the experiment commanded a robot to complete a simulated ISR mission in a scaled down urban scenario by delivering a sequence of gesture and speech commands, both individually and simultaneously, to the robot. Performance and reliability of gesture and speech hardware interfaces and recognition tools were analyzed and reported. Analysis of experimental results demonstrated the employed gesture technology has significant potential for enabling bidirectional Soldier-robot team dialogue based on the high classification accuracy and minimal training required to perform gesture commands.
Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems
NASA Technical Reports Server (NTRS)
Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan
2010-01-01
A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.
MARTI: man-machine animation real-time interface
NASA Astrophysics Data System (ADS)
Jones, Christian M.; Dlay, Satnam S.
1997-05-01
The research introduces MARTI (man-machine animation real-time interface) for the realization of natural human-machine interfacing. The system uses simple vocal sound-tracks of human speakers to provide lip synchronization of computer graphical facial models. We present novel research in a number of engineering disciplines, which include speech recognition, facial modeling, and computer animation. This interdisciplinary research utilizes the latest, hybrid connectionist/hidden Markov model, speech recognition system to provide very accurate phone recognition and timing for speaker independent continuous speech, and expands on knowledge from the animation industry in the development of accurate facial models and automated animation. The research has many real-world applications which include the provision of a highly accurate and 'natural' man-machine interface to assist user interactions with computer systems and communication with one other using human idiosyncrasies; a complete special effects and animation toolbox providing automatic lip synchronization without the normal constraints of head-sets, joysticks, and skilled animators; compression of video data to well below standard telecommunication channel bandwidth for video communications and multi-media systems; assisting speech training and aids for the handicapped; and facilitating player interaction for 'video gaming' and 'virtual worlds.' MARTI has introduced a new level of realism to man-machine interfacing and special effect animation which has been previously unseen.
ERIC Educational Resources Information Center
Brown, Carrie; And Others
This final report describes activities and outcomes of a research project on a sound-to-speech translation system utilizing a graphic mediation interface for students with severe disabilities. The STS/Graphics system is a voice recognition, computer-based system designed to allow individuals with mental retardation and/or severe physical…
The Use of Spatialized Speech in Auditory Interfaces for Computer Users Who Are Visually Impaired
ERIC Educational Resources Information Center
Sodnik, Jaka; Jakus, Grega; Tomazic, Saso
2012-01-01
Introduction: This article reports on a study that explored the benefits and drawbacks of using spatially positioned synthesized speech in auditory interfaces for computer users who are visually impaired (that is, are blind or have low vision). The study was a practical application of such systems--an enhanced word processing application compared…
Speaking Math--A Voice Input, Speech Output Calculator for Students with Visual Impairments
ERIC Educational Resources Information Center
Bouck, Emily C.; Flanagan, Sara; Joshi, Gauri S.; Sheikh, Waseem; Schleppenbach, Dave
2011-01-01
This project explored a newly developed computer-based voice input, speech output (VISO) calculator. Three high school students with visual impairments educated at a state school for the blind and visually impaired participated in the study. The time they took to complete assessments and the average number of attempts per problem were recorded…
ERIC Educational Resources Information Center
Chung, King; Killion, Mead C.; Christensen, Laurel A.
2007-01-01
Purpose: To determine the rankings of 6 input-output functions for understanding low-level, conversational, and high-level speech in multitalker babble without manipulating volume control for listeners with normal hearing, flat sensorineural hearing loss, and mildly sloping sensorineural hearing loss. Method: Peak clipping, compression limiting,…
ERIC Educational Resources Information Center
Love, Tracy; Walenski, Matthew; Swinney, David
2009-01-01
The central question underlying this study revolves around how children process co-reference relationships--such as those evidenced by pronouns ("him") and reflexives ("himself")--and how a slowed rate of speech input may critically affect this process. Previous studies of child language processing have demonstrated that typical language…
How Salient Are Onomatopoeia in the Early Input? A Prosodic Analysis of Infant-Directed Speech
ERIC Educational Resources Information Center
Laing, Catherine E.; Vihman, Marilyn; Keren-Portnoy, Tamar
2017-01-01
Onomatopoeia are frequently identified amongst infants' earliest words (Menn & Vihman, 2011), yet few authors have considered why this might be, and even fewer have explored this phenomenon empirically. Here we analyze mothers' production of onomatopoeia in infant-directed speech (IDS) to provide an input-based perspective on these forms.…
A Deep Ensemble Learning Method for Monaural Speech Separation.
Zhang, Xiao-Lei; Wang, DeLiang
2016-03-01
Monaural speech separation is a fundamental problem in robust speech processing. Recently, deep neural network (DNN)-based speech separation methods, which predict either clean speech or an ideal time-frequency mask, have demonstrated remarkable performance improvement. However, a single DNN with a given window length does not leverage contextual information sufficiently, and the differences between the two optimization objectives are not well understood. In this paper, we propose a deep ensemble method, named multicontext networks, to address monaural speech separation. The first multicontext network averages the outputs of multiple DNNs whose inputs employ different window lengths. The second multicontext network is a stack of multiple DNNs. Each DNN in a module of the stack takes the concatenation of original acoustic features and expansion of the soft output of the lower module as its input, and predicts the ratio mask of the target speaker; the DNNs in the same module employ different contexts. We have conducted extensive experiments with three speech corpora. The results demonstrate the effectiveness of the proposed method. We have also compared the two optimization objectives systematically and found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.
Performing speech recognition research with hypercard
NASA Technical Reports Server (NTRS)
Shepherd, Chip
1993-01-01
The purpose of this paper is to describe a HyperCard-based system for performing speech recognition research and to instruct Human Factors professionals on how to use the system to obtain detailed data about the user interface of a prototype speech recognition application.
System and methods for reducing harmonic distortion in electrical converters
Kajouke, Lateef A; Perisic, Milun; Ransom, Ray M
2013-12-03
Systems and methods are provided for delivering energy using an energy conversion module. An exemplary method for delivering energy from an input interface to an output interface using an energy converison module coupled between the input interface and the output interface comprises the steps of determining an input voltage reference for the input interface based on a desired output voltage and a measured voltage and the output interface, determining a duty cycle control value based on a ratio of the input voltage reference and the measured voltage, operating one or more switching elements of the energy conversion module to deliver energy from the input interface to the output interface to the output interface with a duty cycle influenced by the dute cycle control value.
ERIC Educational Resources Information Center
Jones, Tom; Di Salvo, Vince
A computerized content analysis of the "theory input" for a basic speech course was conducted. The questions to be answered were (1) What does the inexperienced basic speech student hold as a conceptual perspective of the "speech to inform" prior to his being subjected to a college speech class? and (2) How does that inexperienced student's…
Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech
ERIC Educational Resources Information Center
Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell
2015-01-01
This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…
APEX/SPIN: a free test platform to measure speech intelligibility.
Francart, Tom; Hofmann, Michael; Vanthornhout, Jonas; Van Deun, Lieselot; van Wieringen, Astrid; Wouters, Jan
2017-02-01
Measuring speech intelligibility in quiet and noise is important in clinical practice and research. An easy-to-use free software platform for conducting speech tests is presented, called APEX/SPIN. The APEX/SPIN platform allows the use of any speech material in combination with any noise. A graphical user interface provides control over a large range of parameters, such as number of loudspeakers, signal-to-noise ratio and parameters of the procedure. An easy-to-use graphical interface is provided for calibration and storage of calibration values. To validate the platform, perception of words in quiet and sentences in noise were measured both with APEX/SPIN and with an audiometer and CD player, which is a conventional setup in current clinical practice. Five normal-hearing listeners participated in the experimental evaluation. Speech perception results were similar for the APEX/SPIN platform and conventional procedures. APEX/SPIN is a freely available and open source platform that allows the administration of all kinds of custom speech perception tests and procedures.
Analyzing Distributional Learning of Phonemic Categories in Unsupervised Deep Neural Networks
Räsänen, Okko; Nagamine, Tasha; Mesgarani, Nima
2017-01-01
Infants’ speech perception adapts to the phonemic categories of their native language, a process assumed to be driven by the distributional properties of speech. This study investigates whether deep neural networks (DNNs), the current state-of-the-art in distributional feature learning, are capable of learning phoneme-like representations of speech in an unsupervised manner. We trained DNNs with unlabeled and labeled speech and analyzed the activations of each layer with respect to the phones in the input segments. The analyses reveal that the emergence of phonemic invariance in DNNs is dependent on the availability of phonemic labeling of the input during the training. No increased phonemic selectivity of the hidden layers was observed in the purely unsupervised networks despite successful learning of low-dimensional representations for speech. This suggests that additional learning constraints or more sophisticated models are needed to account for the emergence of phone-like categories in distributional learning operating on natural speech. PMID:29359204
Chen, Chien-Hsu; Wang, Chuan-Po; Lee, I-Jui; Su, Chris Chun-Chin
2016-01-01
We analyzed the efficacy of the interface design of speech generating devices on three non-verbal adolescents with autism spectrum disorder (ASD), in hopes of improving their on-campus communication and cognitive disability. The intervention program was created based on their social and communication needs in school. Two operating interfaces were designed and compared: the Hierarchical Relating Menu and the Pie Abbreviation-Expansion Menu. The experiment used the ABCACB multiple-treatment reversal design. The test items included: (1) accuracy of operating identification; (2) interface operation in response to questions; (3) degree of independent completion. Each of these three items improved with both intervention interfaces. The children were able to operate the interfaces skillfully and respond to questions accurately, which evidenced the effectiveness of the interfaces. We conclude that both interfaces are efficacious enough to help nonverbal children with ASD at different levels.
Self-organizing map classifier for stressed speech recognition
NASA Astrophysics Data System (ADS)
Partila, Pavol; Tovarek, Jaromir; Voznak, Miroslav
2016-05-01
This paper presents a method for detecting speech under stress using Self-Organizing Maps. Most people who are exposed to stressful situations can not adequately respond to stimuli. Army, police, and fire department occupy the largest part of the environment that are typical of an increased number of stressful situations. The role of men in action is controlled by the control center. Control commands should be adapted to the psychological state of a man in action. It is known that the psychological changes of the human body are also reflected physiologically, which consequently means the stress effected speech. Therefore, it is clear that the speech stress recognizing system is required in the security forces. One of the possible classifiers, which are popular for its flexibility, is a self-organizing map. It is one type of the artificial neural networks. Flexibility means independence classifier on the character of the input data. This feature is suitable for speech processing. Human Stress can be seen as a kind of emotional state. Mel-frequency cepstral coefficients, LPC coefficients, and prosody features were selected for input data. These coefficients were selected for their sensitivity to emotional changes. The calculation of the parameters was performed on speech recordings, which can be divided into two classes, namely the stress state recordings and normal state recordings. The benefit of the experiment is a method using SOM classifier for stress speech detection. Results showed the advantage of this method, which is input data flexibility.
Signal Prediction With Input Identification
NASA Technical Reports Server (NTRS)
Juang, Jer-Nan; Chen, Ya-Chin
1999-01-01
A novel coding technique is presented for signal prediction with applications including speech coding, system identification, and estimation of input excitation. The approach is based on the blind equalization method for speech signal processing in conjunction with the geometric subspace projection theory to formulate the basic prediction equation. The speech-coding problem is often divided into two parts, a linear prediction model and excitation input. The parameter coefficients of the linear predictor and the input excitation are solved simultaneously and recursively by a conventional recursive least-squares algorithm. The excitation input is computed by coding all possible outcomes into a binary codebook. The coefficients of the linear predictor and excitation, and the index of the codebook can then be used to represent the signal. In addition, a variable-frame concept is proposed to block the same excitation signal in sequence in order to reduce the storage size and increase the transmission rate. The results of this work can be easily extended to the problem of disturbance identification. The basic principles are outlined in this report and differences from other existing methods are discussed. Simulations are included to demonstrate the proposed method.
Systems and methods for compensating for electrical converter nonlinearities
Perisic, Milun; Ransom, Ray M.; Kajouke, Lateef A.
2013-06-18
Systems and methods are provided for delivering energy from an input interface to an output interface. An electrical system includes an input interface, an output interface, an energy conversion module coupled between the input interface and the output interface, and a control module. The control module determines a duty cycle control value for operating the energy conversion module to produce a desired voltage at the output interface. The control module determines an input power error at the input interface and adjusts the duty cycle control value in a manner that is influenced by the input power error, resulting in a compensated duty cycle control value. The control module operates switching elements of the energy conversion module to deliver energy to the output interface with a duty cycle that is influenced by the compensated duty cycle control value.
Finke, Mareike; Sandmann, Pascale; Bönitz, Hanna; Kral, Andrej; Büchner, Andreas
2016-01-01
Single-sided deaf subjects with a cochlear implant (CI) provide the unique opportunity to compare central auditory processing of the electrical input (CI ear) and the acoustic input (normal-hearing, NH, ear) within the same individual. In these individuals, sensory processing differs between their two ears, while cognitive abilities are the same irrespectively of the sensory input. To better understand perceptual-cognitive factors modulating speech intelligibility with a CI, this electroencephalography study examined the central-auditory processing of words, the cognitive abilities, and the speech intelligibility in 10 postlingually single-sided deaf CI users. We found lower hit rates and prolonged response times for word classification during an oddball task for the CI ear when compared with the NH ear. Also, event-related potentials reflecting sensory (N1) and higher-order processing (N2/N4) were prolonged for word classification (targets versus nontargets) with the CI ear compared with the NH ear. Our results suggest that speech processing via the CI ear and the NH ear differs both at sensory (N1) and cognitive (N2/N4) processing stages, thereby affecting the behavioral performance for speech discrimination. These results provide objective evidence for cognition to be a key factor for speech perception under adverse listening conditions, such as the degraded speech signal provided from the CI. © 2016 S. Karger AG, Basel.
An ALE meta-analysis on the audiovisual integration of speech signals.
Erickson, Laura C; Heeg, Elizabeth; Rauschecker, Josef P; Turkeltaub, Peter E
2014-11-01
The brain improves speech processing through the integration of audiovisual (AV) signals. Situations involving AV speech integration may be crudely dichotomized into those where auditory and visual inputs contain (1) equivalent, complementary signals (validating AV speech) or (2) inconsistent, different signals (conflicting AV speech). This simple framework may allow the systematic examination of broad commonalities and differences between AV neural processes engaged by various experimental paradigms frequently used to study AV speech integration. We conducted an activation likelihood estimation metaanalysis of 22 functional imaging studies comprising 33 experiments, 311 subjects, and 347 foci examining "conflicting" versus "validating" AV speech. Experimental paradigms included content congruency, timing synchrony, and perceptual measures, such as the McGurk effect or synchrony judgments, across AV speech stimulus types (sublexical to sentence). Colocalization of conflicting AV speech experiments revealed consistency across at least two contrast types (e.g., synchrony and congruency) in a network of dorsal stream regions in the frontal, parietal, and temporal lobes. There was consistency across all contrast types (synchrony, congruency, and percept) in the bilateral posterior superior/middle temporal cortex. Although fewer studies were available, validating AV speech experiments were localized to other regions, such as ventral stream visual areas in the occipital and inferior temporal cortex. These results suggest that while equivalent, complementary AV speech signals may evoke activity in regions related to the corroboration of sensory input, conflicting AV speech signals recruit widespread dorsal stream areas likely involved in the resolution of conflicting sensory signals. Copyright © 2014 Wiley Periodicals, Inc.
How Speech Communication Training Interfaces with Public Relations Training.
ERIC Educational Resources Information Center
Bosley, Phyllis B.
Speech communication training is a valuable asset for those entering the public relations (PR) field. This notion is reinforced by the 1987 "Design for Undergraduate Public Relations Education," a guide for implementing speech communication courses within a public relations curriculum, and also in the incorporation of oral communication training…
Davis, Matthew H.
2016-01-01
Successful perception depends on combining sensory input with prior knowledge. However, the underlying mechanism by which these two sources of information are combined is unknown. In speech perception, as in other domains, two functionally distinct coding schemes have been proposed for how expectations influence representation of sensory evidence. Traditional models suggest that expected features of the speech input are enhanced or sharpened via interactive activation (Sharpened Signals). Conversely, Predictive Coding suggests that expected features are suppressed so that unexpected features of the speech input (Prediction Errors) are processed further. The present work is aimed at distinguishing between these two accounts of how prior knowledge influences speech perception. By combining behavioural, univariate, and multivariate fMRI measures of how sensory detail and prior expectations influence speech perception with computational modelling, we provide evidence in favour of Prediction Error computations. Increased sensory detail and informative expectations have additive behavioural and univariate neural effects because they both improve the accuracy of word report and reduce the BOLD signal in lateral temporal lobe regions. However, sensory detail and informative expectations have interacting effects on speech representations shown by multivariate fMRI in the posterior superior temporal sulcus. When prior knowledge was absent, increased sensory detail enhanced the amount of speech information measured in superior temporal multivoxel patterns, but with informative expectations, increased sensory detail reduced the amount of measured information. Computational simulations of Sharpened Signals and Prediction Errors during speech perception could both explain these behavioural and univariate fMRI observations. However, the multivariate fMRI observations were uniquely simulated by a Prediction Error and not a Sharpened Signal model. The interaction between prior expectation and sensory detail provides evidence for a Predictive Coding account of speech perception. Our work establishes methods that can be used to distinguish representations of Prediction Error and Sharpened Signals in other perceptual domains. PMID:27846209
Imitation of Para-Phonological Detail Following Left Hemisphere Lesions
ERIC Educational Resources Information Center
Kappes, Juliane; Baumgaertner, Annette; Peschke, Claudia; Goldenberg, Georg; Ziegler, Wolfram
2010-01-01
Imitation in speech refers to the unintentional transfer of phonologically irrelevant acoustic-phonetic information of auditory input into speech motor output. Evidence for such imitation effects has been explained within the framework of episodic theories. However, it is largely unclear, which neural structures mediate speech imitation and how…
Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces
Bocquelet, Florent; Hueber, Thomas; Girin, Laurent; Savariaux, Christophe; Yvert, Blaise
2016-01-01
Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer. PMID:27880768
A model of serial order problems in fluent, stuttered and agrammatic speech.
Howell, Peter
2007-10-01
Many models of speech production have attempted to explain dysfluent speech. Most models assume that the disruptions that occur when speech is dysfluent arise because the speakers make errors while planning an utterance. In this contribution, a model of the serial order of speech is described that does not make this assumption. It involves the coordination or 'interlocking' of linguistic planning and execution stages at the language-speech interface. The model is examined to determine whether it can distinguish two forms of dysfluent speech (stuttered and agrammatic speech) that are characterized by iteration and omission of whole words and parts of words.
Chasin, Marshall; Russo, Frank A
2004-01-01
Historically, the primary concern for hearing aid design and fitting is optimization for speech inputs. However, increasingly other types of inputs are being investigated and this is certainly the case for music. Whether the hearing aid wearer is a musician or merely someone who likes to listen to music, the electronic and electro-acoustic parameters described can be optimized for music as well as for speech. That is, a hearing aid optimally set for music can be optimally set for speech, even though the converse is not necessarily true. Similarities and differences between speech and music as inputs to a hearing aid are described. Many of these lead to the specification of a set of optimal electro-acoustic characteristics. Parameters such as the peak input-limiting level, compression issues-both compression ratio and knee-points-and number of channels all can deleteriously affect music perception through hearing aids. In other cases, it is not clear how to set other parameters such as noise reduction and feedback control mechanisms. Regardless of the existence of a "music program,'' unless the various electro-acoustic parameters are available in a hearing aid, music fidelity will almost always be less than optimal. There are many unanswered questions and hypotheses in this area. Future research by engineers, researchers, clinicians, and musicians will aid in the clarification of these questions and their ultimate solutions.
Parental numeric language input to Mandarin Chinese and English speaking preschool children.
Chang, Alicia; Sandhofer, Catherine M; Adelchanow, Lauren; Rottman, Benjamin
2011-03-01
The present study examined the number-specific parental language input to Mandarin- and English-speaking preschool-aged children. Mandarin and English transcripts from the CHILDES database were examined for amount of numeric speech, specific types of numeric speech and syntactic frames in which numeric speech appeared. The results showed that Mandarin-speaking parents talked about number more frequently than English-speaking parents. Further, the ways in which parents talked about number terms in the two languages was more supportive of a cardinal interpretation in Mandarin than in English. We discuss these results in terms of their implications for numerical understanding and later mathematical performance.
Why talk with children matters: clinical implications of infant- and child-directed speech research.
Ratner, Nan Bernstein
2013-11-01
This article reviews basic features of infant- or child-directed speech, with particular attention to those aspects of the register that have been shown to impact profiles of child language development. It then discusses concerns that arise when describing adult input to children with language delay or disorder, or children at risk for depressed language skills. The article concludes with some recommendations for parent counseling in such cases, as well as methods that speech-language pathologists can use to improve the quality and quantity of language input to language-learning children. Thieme Medical Publishers 333 Seventh Avenue, New York, NY 10001, USA.
Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems
NASA Technical Reports Server (NTRS)
Ye, Sherry
2015-01-01
NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.
Working Papers in Experimental Speech-Language Pathology and Audiology.
ERIC Educational Resources Information Center
Jablon, Ann, Ed.; And Others
Five papers describe clinical research completed by graduate students in speech-language pathology and audiology. The first study examines the linguistic input of three adults (a mother, teacher, and clinician) to a language impaired 8-year-old. The clinician's approach, less directive than that of the other two, facilitated spontaneous speech and…
Modulation, Adaptation, and Control of Orofacial Pathways in Healthy Adults
ERIC Educational Resources Information Center
Estep, Meredith E.
2009-01-01
Although the healthy adult possesses a large repertoire of coordinative strategies for oromotor behaviors, a range of nonverbal, speech-like movements can be observed during speech. The extent of overlap among sensorimotor speech and nonspeech neural correlates and the role of neuromodulatory inputs generated during oromotor behaviors are unknown.…
Massively-Parallel Architectures for Automatic Recognition of Visual Speech Signals
1988-10-12
Secusrity Clamifieation, Nlassively-Parallel Architectures for Automa ic Recognitio of Visua, Speech Signals 12. PERSONAL AUTHOR(S) Terrence J...characteristics of speech from tJhe, visual speech signals. Neural networks have been trained on a database of vowels. The rqw images of faces , aligned and...images of faces , aligned and preprocessed, were used as input to these network which were trained to estimate the corresponding envelope of the
Incorporating Speech Recognition into a Natural User Interface
NASA Technical Reports Server (NTRS)
Chapa, Nicholas
2017-01-01
The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to investigate options for a Speech Recognition System. To that end I attempted to integrate Sphinx4 into a user interface. Sphinx4 had great accuracy and is the only free program able to perform offline speech dictation. However it had a limited dictionary of words that could be recognized, single syllable words were almost impossible for it to hear, and since it ran on Java it could not be integrated into the Unity based NUI. PocketSphinx ran much faster than Sphinx4 which would've made it ideal as a plugin to the Unity NUI, unfortunately creating a C# wrapper for the C code made the program unusable with Unity due to the wrapper slowing code execution and class files becoming unreachable. Unity Grammar Recognizer is the ideal speech recognition interface, it is flexible in recognizing multiple variations of the same command. It is also the most accurate program in recognizing speech due to using an XML grammar to specify speech structure instead of relying solely on a Dictionary and Language model. The Unity Grammar Recognizer will be used with the NUI for these reasons as well as being written in C# which further simplifies the incorporation.
ERIC Educational Resources Information Center
Brown-Schmidt, Sarah; Konopka, Agnieszka E.
2008-01-01
During unscripted speech, speakers coordinate the formulation of pre-linguistic messages with the linguistic processes that implement those messages into speech. We examine the process of constructing a contextually appropriate message and interfacing that message with utterance planning in English ("the small butterfly") and Spanish ("la mariposa…
A Wavelet Model for Vocalic Speech Coarticulation
1994-10-01
control vowel’s signal as the mother wavelet. A practical experiment is conducted to evaluate the coarticulation channel using samples 01 real speech...transformation from a control speech state (input) to an effected speech state (output). Specifically, a vowel produced in isolation is transformed into an...the wavelet transform of the effected vowel’s signal, using the control vowel’s signal as the mother wavelet. A practical experiment is conducted to
Input Devices for Young Handicapped Children.
ERIC Educational Resources Information Center
Morris, Karen
The versatility of the computer can be expanded considerably for young handicapped children by using input devices other than the typewriter-style keyboard. Input devices appropriate for young children can be classified into four categories: alternative keyboards, contact switches, speech input devices, and cursor control devices. Described are…
ERIC Educational Resources Information Center
Dorman, Michael F.; Natale, Sarah; Spahr, Anthony; Castioni, Erin
2017-01-01
Purpose: The aim of this experiment was to compare, for patients with cochlear implants (CIs), the improvement for speech understanding in noise provided by a monaural adaptive beamformer and for two interventions that produced bilateral input (i.e., bilateral CIs and hearing preservation [HP] surgery). Method: Speech understanding scores for…
The Acquisition of Relative Clauses in Spontaneous Child Speech in Mandarin Chinese
ERIC Educational Resources Information Center
Chen, Jidong; Shirai, Yasuhiro
2015-01-01
This study investigates the developmental trajectory of relative clauses (RCs) in Mandarin-learning children's speech. We analyze the spontaneous production of RCs by four monolingual Mandarin-learning children (0;11 to 3;5) and their input from a longitudinal naturalistic speech corpus (Min, 1994). The results reveal that in terms of the…
Evaluation of Speech Perception via the Use of Hearing Loops and Telecoils
Holmes, Alice E.; Kricos, Patricia B.; Gaeta, Laura; Martin, Sheridan
2015-01-01
A cross-sectional, experimental, and randomized repeated-measures design study was used to examine the objective and subjective value of telecoil and hearing loop systems. Word recognition and speech perception were tested in 12 older adult hearing aid users using the telecoil and microphone inputs in quiet and noise conditions. Participants were asked to subjectively rate cognitive listening effort and self-confidence for each condition. Significant improvement in speech perception with the telecoil over microphone input in both quiet and noise was found along with significantly less reported cognitive listening effort and high self-confidence. The use of telecoils with hearing aids should be recommended for older adults with hearing loss. PMID:28138458
Deep Learning Based Binaural Speech Separation in Reverberant Environments.
Zhang, Xueliang; Wang, DeLiang
2017-05-01
Speech signal is usually degraded by room reverberation and additive noises in real environments. This paper focuses on separating target speech signal in reverberant conditions from binaural inputs. Binaural separation is formulated as a supervised learning problem, and we employ deep learning to map from both spatial and spectral features to a training target. With binaural inputs, we first apply a fixed beamformer and then extract several spectral features. A new spatial feature is proposed and extracted to complement the spectral features. The training target is the recently suggested ideal ratio mask. Systematic evaluations and comparisons show that the proposed system achieves very good separation performance and substantially outperforms related algorithms under challenging multi-source and reverberant environments.
Chasin, Marshall; Russo, Frank A.
2004-01-01
Historically, the primary concern for hearing aid design and fitting is optimization for speech inputs. However, increasingly other types of inputs are being investigated and this is certainly the case for music. Whether the hearing aid wearer is a musician or merely someone who likes to listen to music, the electronic and electro-acoustic parameters described can be optimized for music as well as for speech. That is, a hearing aid optimally set for music can be optimally set for speech, even though the converse is not necessarily true. Similarities and differences between speech and music as inputs to a hearing aid are described. Many of these lead to the specification of a set of optimal electro-acoustic characteristics. Parameters such as the peak input-limiting level, compression issues—both compression ratio and knee-points—and number of channels all can deleteriously affect music perception through hearing aids. In other cases, it is not clear how to set other parameters such as noise reduction and feedback control mechanisms. Regardless of the existence of a “music program,” unless the various electro-acoustic parameters are available in a hearing aid, music fidelity will almost always be less than optimal. There are many unanswered questions and hypotheses in this area. Future research by engineers, researchers, clinicians, and musicians will aid in the clarification of these questions and their ultimate solutions. PMID:15497032
Texting while driving: is speech-based text entry less risky than handheld text entry?
He, J; Chaparro, A; Nguyen, B; Burge, R J; Crandall, J; Chaparro, B; Ni, R; Cao, S
2014-11-01
Research indicates that using a cell phone to talk or text while maneuvering a vehicle impairs driving performance. However, few published studies directly compare the distracting effects of texting using a hands-free (i.e., speech-based interface) versus handheld cell phone, which is an important issue for legislation, automotive interface design and driving safety training. This study compared the effect of speech-based versus handheld text entries on simulated driving performance by asking participants to perform a car following task while controlling the duration of a secondary text-entry task. Results showed that both speech-based and handheld text entries impaired driving performance relative to the drive-only condition by causing more variation in speed and lane position. Handheld text entry also increased the brake response time and increased variation in headway distance. Text entry using a speech-based cell phone was less detrimental to driving performance than handheld text entry. Nevertheless, the speech-based text entry task still significantly impaired driving compared to the drive-only condition. These results suggest that speech-based text entry disrupts driving, but reduces the level of performance interference compared to text entry with a handheld device. In addition, the difference in the distraction effect caused by speech-based and handheld text entry is not simply due to the difference in task duration. Copyright © 2014 Elsevier Ltd. All rights reserved.
Research on oral test modeling based on multi-feature fusion
NASA Astrophysics Data System (ADS)
Shi, Yuliang; Tao, Yiyue; Lei, Jun
2018-04-01
In this paper, the spectrum of speech signal is taken as an input of feature extraction. The advantage of PCNN in image segmentation and other processing is used to process the speech spectrum and extract features. And a new method combining speech signal processing and image processing is explored. At the same time of using the features of the speech map, adding the MFCC to establish the spectral features and integrating them with the features of the spectrogram to further improve the accuracy of the spoken language recognition. Considering that the input features are more complicated and distinguishable, we use Support Vector Machine (SVM) to construct the classifier, and then compare the extracted test voice features with the standard voice features to achieve the spoken standard detection. Experiments show that the method of extracting features from spectrograms using PCNN is feasible, and the fusion of image features and spectral features can improve the detection accuracy.
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Karnofsky, K. F.; Stevens, K. N.; Alakel, M. N.
1983-12-01
The use of multiple sensors to transduce speech was investigated. A data base of speech and noise was collected from a number of transducers located on and around the head of the speaker. The transducers included pressure, first order gradient, second order gradient microphones and an accelerometer. The effort analyzed this data and evaluated the performance of a multiple sensor configuration. The conclusion was: multiple transducer configurations can provide a signal containing more useable speech information than that provided by a microphone.
Temporal factors affecting somatosensory–auditory interactions in speech processing
Ito, Takayuki; Gracco, Vincent L.; Ostry, David J.
2014-01-01
Speech perception is known to rely on both auditory and visual information. However, sound-specific somatosensory input has been shown also to influence speech perceptual processing (Ito et al., 2009). In the present study, we addressed further the relationship between somatosensory information and speech perceptual processing by addressing the hypothesis that the temporal relationship between orofacial movement and sound processing contributes to somatosensory–auditory interaction in speech perception. We examined the changes in event-related potentials (ERPs) in response to multisensory synchronous (simultaneous) and asynchronous (90 ms lag and lead) somatosensory and auditory stimulation compared to individual unisensory auditory and somatosensory stimulation alone. We used a robotic device to apply facial skin somatosensory deformations that were similar in timing and duration to those experienced in speech production. Following synchronous multisensory stimulation the amplitude of the ERP was reliably different from the two unisensory potentials. More importantly, the magnitude of the ERP difference varied as a function of the relative timing of the somatosensory–auditory stimulation. Event-related activity change due to stimulus timing was seen between 160 and 220 ms following somatosensory onset, mostly around the parietal area. The results demonstrate a dynamic modulation of somatosensory–auditory convergence and suggest the contribution of somatosensory information for speech processing process is dependent on the specific temporal order of sensory inputs in speech production. PMID:25452733
Polur, Prasad D; Miller, Gerald E
2006-10-01
Computer speech recognition of individuals with dysarthria, such as cerebral palsy patients requires a robust technique that can handle conditions of very high variability and limited training data. In this study, application of a 10 state ergodic hidden Markov model (HMM)/artificial neural network (ANN) hybrid structure for a dysarthric speech (isolated word) recognition system, intended to act as an assistive tool, was investigated. A small size vocabulary spoken by three cerebral palsy subjects was chosen. The effect of such a structure on the recognition rate of the system was investigated by comparing it with an ergodic hidden Markov model as a control tool. This was done in order to determine if this modified technique contributed to enhanced recognition of dysarthric speech. The speech was sampled at 11 kHz. Mel frequency cepstral coefficients were extracted from them using 15 ms frames and served as training input to the hybrid model setup. The subsequent results demonstrated that the hybrid model structure was quite robust in its ability to handle the large variability and non-conformity of dysarthric speech. The level of variability in input dysarthric speech patterns sometimes limits the reliability of the system. However, its application as a rehabilitation/control tool to assist dysarthric motor impaired individuals holds sufficient promise.
The influence of speech rate and accent on access and use of semantic information.
Sajin, Stanislav M; Connine, Cynthia M
2017-04-01
Circumstances in which the speech input is presented in sub-optimal conditions generally lead to processing costs affecting spoken word recognition. The current study indicates that some processing demands imposed by listening to difficult speech can be mitigated by feedback from semantic knowledge. A set of lexical decision experiments examined how foreign accented speech and word duration impact access to semantic knowledge in spoken word recognition. Results indicate that when listeners process accented speech, the reliance on semantic information increases. Speech rate was not observed to influence semantic access, except in the setting in which unusually slow accented speech was presented. These findings support interactive activation models of spoken word recognition in which attention is modulated based on speech demands.
Multimodal Infant-Directed Communication: How Caregivers Combine Tactile and Linguistic Cues
ERIC Educational Resources Information Center
Abu-Zhaya, Rana; Seidl, Amanda; Cristia, Alejandrina
2017-01-01
Both touch and speech independently have been shown to play an important role in infant development. However, little is known about how they may be combined in the input to the child. We examined the use of touch and speech together by having mothers read their 5-month-olds books about body parts and animals. Results suggest that speech+touch…
Speech-Enabled Interfaces for Travel Information Systems with Large Grammars
NASA Astrophysics Data System (ADS)
Zhao, Baoli; Allen, Tony; Bargiela, Andrzej
This paper introduces three grammar-segmentation methods capable of handling the large grammar issues associated with producing a real-time speech-enabled VXML bus travel application for London. Large grammars tend to produce relatively slow recognition interfaces and this work shows how this limitation can be successfully addressed. Comparative experimental results show that the novel last-word recognition based grammar segmentation method described here achieves an optimal balance between recognition rate, speed of processing and naturalness of interaction.
NASA Astrophysics Data System (ADS)
Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui
2012-04-01
Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.
Language input and acquisition in a Mayan village: how important is directed speech?
Shneidman, Laura A; Goldin-Meadow, Susan
2012-09-01
Theories of language acquisition have highlighted the importance of adult speakers as active participants in children's language learning. However, in many communities children are reported to be directly engaged by their caregivers only rarely (Lieven, 1994). This observation raises the possibility that these children learn language from observing, rather than participating in, communicative exchanges. In this paper, we quantify naturally occurring language input in one community where directed interaction with children has been reported to be rare (Yucatec Mayan). We compare this input to the input heard by children growing up in large families in the United States, and we consider how directed and overheard input relate to Mayan children's later vocabulary. In Study 1, we demonstrate that 1-year-old Mayan children do indeed hear a smaller proportion of total input in directed speech than children from the US. In Study 2, we show that for Mayan (but not US) children, there are great increases in the proportion of directed input that children receive between 13 and 35 months. In Study 3, we explore the validity of using videotaped data in a Mayan village. In Study 4, we demonstrate that word types directed to Mayan children from adults at 24 months (but not word types overheard by children or word types directed from other children) predict later vocabulary. These findings suggest that adult talk directed to children is important for early word learning, even in communities where much of children's early language input comes from overheard speech. © 2012 Blackwell Publishing Ltd.
Compensation for electrical converter nonlinearities
Perisic, Milun; Ransom, Ray M; Kajouke, Lateef A
2013-11-19
Systems and methods are provided for delivering energy from an input interface to an output interface. An electrical system includes an input interface, an output interface, an energy conversion module between the input interface and the output interface, an inductive element between the input interface and the energy conversion module, and a control module. The control module determines a compensated duty cycle control value for operating the energy conversion module to produce a desired voltage at the output interface and operates the energy conversion module to deliver energy to the output interface with a duty cycle that is influenced by the compensated duty cycle control value. The compensated duty cycle control value is influenced by the current through the inductive element and accounts for voltage across the switching elements of the energy conversion module.
Acquisition of ICU data: concepts and demands.
Imhoff, M
1992-12-01
As the issue of data overload is a problem in critical care today, it is of utmost importance to improve acquisition, storage, integration, and presentation of medical data, which appears only feasible with the help of bedside computers. The data originates from four major sources: (1) the bedside medical devices, (2) the local area network (LAN) of the ICU, (3) the hospital information system (HIS) and (4) manual input. All sources differ markedly in quality and quantity of data and in the demands of the interfaces between source of data and patient database. The demands for data acquisition from bedside medical devices, ICU-LAN and HIS concentrate on technical problems, such as computational power, storage capacity, real-time processing, interfacing with different devices and networks and the unmistakable assignment of data to the individual patient. The main problem of manual data acquisition is the definition and configuration of the user interface that must allow the inexperienced user to interact with the computer intuitively. Emphasis must be put on the construction of a pleasant, logical and easy-to-handle graphical user interface (GUI). Short response times will require high graphical processing capacity. Moreover, high computational resources are necessary in the future for additional interfacing devices such as speech recognition and 3D-GUI. Therefore, in an ICU environment the demands for computational power are enormous. These problems are complicated by the urgent need for friendly and easy-to-handle user interfaces. Both facts place ICU bedside computing at the vanguard of present and future workstation development leaving no room for solutions based on traditional concepts of personal computers.(ABSTRACT TRUNCATED AT 250 WORDS)
Input Frequency and the Acquisition of Syllable Structure in Polish
ERIC Educational Resources Information Center
Jarosz, Gaja; Calamaro, Shira; Zentz, Jason
2017-01-01
This article examines phonological development and its relationship to input statistics. Using novel data from a longitudinal corpus of spontaneous child speech in Polish, we evaluate and compare the predictions of a variety of input-based phonotactic models for syllable structure acquisition. We find that many commonly examined input statistics…
Eye-gaze and intent: Application in 3D interface control
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schryver, J.C.; Goldberg, J.H.
1993-06-01
Computer interface control is typically accomplished with an input ``device`` such as keyboard, mouse, trackball, etc. An input device translates a users input actions, such as mouse clicks and key presses, into appropriate computer commands. To control the interface, the user must first convert intent into the syntax of the input device. A more natural means of computer control is possible when the computer can directly infer user intent, without need of intervening input devices. We describe an application of eye-gaze-contingent control of an interactive three-dimensional (3D) user interface. A salient feature of the user interface is natural input, withmore » a heightened impression of controlling the computer directly by the mind. With this interface, input of rotation and translation are intuitive, whereas other abstract features, such as zoom, are more problematic to match with user intent. This paper describes successes with implementation to date, and ongoing efforts to develop a more sophisticated intent inferencing methodology.« less
Eye-gaze and intent: Application in 3D interface control
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schryver, J.C.; Goldberg, J.H.
1993-01-01
Computer interface control is typically accomplished with an input device'' such as keyboard, mouse, trackball, etc. An input device translates a users input actions, such as mouse clicks and key presses, into appropriate computer commands. To control the interface, the user must first convert intent into the syntax of the input device. A more natural means of computer control is possible when the computer can directly infer user intent, without need of intervening input devices. We describe an application of eye-gaze-contingent control of an interactive three-dimensional (3D) user interface. A salient feature of the user interface is natural input, withmore » a heightened impression of controlling the computer directly by the mind. With this interface, input of rotation and translation are intuitive, whereas other abstract features, such as zoom, are more problematic to match with user intent. This paper describes successes with implementation to date, and ongoing efforts to develop a more sophisticated intent inferencing methodology.« less
Davidson, Lisa S; Skinner, Margaret W; Holstad, Beth A; Fears, Beverly T; Richter, Marie K; Matusofsky, Margaret; Brenner, Christine; Holden, Timothy; Birath, Amy; Kettel, Jerrica L; Scollie, Susan
2009-06-01
The purpose of this study was to examine the effects of a wider instantaneous input dynamic range (IIDR) setting on speech perception and comfort in quiet and noise for children wearing the Nucleus 24 implant system and the Freedom speech processor. In addition, children's ability to understand soft and conversational level speech in relation to aided sound-field thresholds was examined. Thirty children (age, 7 to 17 years) with the Nucleus 24 cochlear implant system and the Freedom speech processor with two different IIDR settings (30 versus 40 dB) were tested on the Consonant Nucleus Consonant (CNC) word test at 50 and 60 dB SPL, the Bamford-Kowal-Bench Speech in Noise Test, and a loudness rating task for four-talker speech noise. Aided thresholds for frequency-modulated tones, narrowband noise, and recorded Ling sounds were obtained with the two IIDRs and examined in relation to CNC scores at 50 dB SPL. Speech Intelligibility Indices were calculated using the long-term average speech spectrum of the CNC words at 50 dB SPL measured at each test site and aided thresholds. Group mean CNC scores at 50 dB SPL with the 40 IIDR were significantly higher (p < 0.001) than with the 30 IIDR. Group mean CNC scores at 60 dB SPL, loudness ratings, and the signal to noise ratios-50 for Bamford-Kowal-Bench Speech in Noise Test were not significantly different for the two IIDRs. Significantly improved aided thresholds at 250 to 6000 Hz as well as higher Speech Intelligibility Indices afforded improved audibility for speech presented at soft levels (50 dB SPL). These results indicate that an increased IIDR provides improved word recognition for soft levels of speech without compromising comfort of higher levels of speech sounds or sentence recognition in noise.
"Look What I Did!": Student Conferences with Text-to-Speech Software
ERIC Educational Resources Information Center
Young, Chase; Stover, Katie
2014-01-01
The authors describe a strategy that empowers students to edit and revise their own writing. Students input their writing in to text-to-speech software that rereads the text aloud. While listening, students make necessary revisions and edits.
ERIC Educational Resources Information Center
Van Laere, E.; Braak, J.
2017-01-01
Text-to-speech technology can act as an important support tool in computer-based learning environments (CBLEs) as it provides auditory input, next to on-screen text. Particularly for students who use a language at home other than the language of instruction (LOI) applied at school, text-to-speech can be useful. The CBLE E-Validiv offers content in…
Objective speech quality evaluation of real-time speech coders
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Russell, W. H.; Huggins, A. W. F.
1984-02-01
This report describes the work performed in two areas: subjective testing of a real-time 16 kbit/s adaptive predictive coder (APC) and objective speech quality evaluation of real-time coders. The speech intelligibility of the APC coder was tested using the Diagnostic Rhyme Test (DRT), and the speech quality was tested using the Diagnostic Acceptability Measure (DAM) test, under eight operating conditions involving channel error, acoustic background noise, and tandem link with two other coders. The test results showed that the DRT and DAM scores of the APC coder equalled or exceeded the corresponding test scores fo the 32 kbit/s CVSD coder. In the area of objective speech quality evaluation, the report describes the development, testing, and validation of a procedure for automatically computing several objective speech quality measures, given only the tape-recordings of the input speech and the corresponding output speech of a real-time speech coder.
Liu, Xunying; Zhang, Chao; Woodland, Phil; Fonteneau, Elisabeth
2017-01-01
There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR) systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental ‘machine states’, generated as the ASR analysis progresses over time, to the incremental ‘brain states’, measured using combined electro- and magneto-encephalography (EMEG), generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain. PMID:28945744
Speech emotion recognition methods: A literature review
NASA Astrophysics Data System (ADS)
Basharirad, Babak; Moradhaseli, Mohammadreza
2017-10-01
Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.
Hannah, Beverly; Wang, Yue; Jongman, Allard; Sereno, Joan A; Cao, Jiguo; Nie, Yunlong
2017-01-01
Speech perception involves multiple input modalities. Research has indicated that perceivers establish cross-modal associations between auditory and visuospatial events to aid perception. Such intermodal relations can be particularly beneficial for speech development and learning, where infants and non-native perceivers need additional resources to acquire and process new sounds. This study examines how facial articulatory cues and co-speech hand gestures mimicking pitch contours in space affect non-native Mandarin tone perception. Native English as well as Mandarin perceivers identified tones embedded in noise with either congruent or incongruent Auditory-Facial (AF) and Auditory-FacialGestural (AFG) inputs. Native Mandarin results showed the expected ceiling-level performance in the congruent AF and AFG conditions. In the incongruent conditions, while AF identification was primarily auditory-based, AFG identification was partially based on gestures, demonstrating the use of gestures as valid cues in tone identification. The English perceivers' performance was poor in the congruent AF condition, but improved significantly in AFG. While the incongruent AF identification showed some reliance on facial information, incongruent AFG identification relied more on gestural than auditory-facial information. These results indicate positive effects of facial and especially gestural input on non-native tone perception, suggesting that cross-modal (visuospatial) resources can be recruited to aid auditory perception when phonetic demands are high. The current findings may inform patterns of tone acquisition and development, suggesting how multi-modal speech enhancement principles may be applied to facilitate speech learning.
Hannah, Beverly; Wang, Yue; Jongman, Allard; Sereno, Joan A.; Cao, Jiguo; Nie, Yunlong
2017-01-01
Speech perception involves multiple input modalities. Research has indicated that perceivers establish cross-modal associations between auditory and visuospatial events to aid perception. Such intermodal relations can be particularly beneficial for speech development and learning, where infants and non-native perceivers need additional resources to acquire and process new sounds. This study examines how facial articulatory cues and co-speech hand gestures mimicking pitch contours in space affect non-native Mandarin tone perception. Native English as well as Mandarin perceivers identified tones embedded in noise with either congruent or incongruent Auditory-Facial (AF) and Auditory-FacialGestural (AFG) inputs. Native Mandarin results showed the expected ceiling-level performance in the congruent AF and AFG conditions. In the incongruent conditions, while AF identification was primarily auditory-based, AFG identification was partially based on gestures, demonstrating the use of gestures as valid cues in tone identification. The English perceivers’ performance was poor in the congruent AF condition, but improved significantly in AFG. While the incongruent AF identification showed some reliance on facial information, incongruent AFG identification relied more on gestural than auditory-facial information. These results indicate positive effects of facial and especially gestural input on non-native tone perception, suggesting that cross-modal (visuospatial) resources can be recruited to aid auditory perception when phonetic demands are high. The current findings may inform patterns of tone acquisition and development, suggesting how multi-modal speech enhancement principles may be applied to facilitate speech learning. PMID:29255435
NASA Astrophysics Data System (ADS)
Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis
2003-12-01
This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.
NASA Astrophysics Data System (ADS)
Vassiliou, Marius S.; Sundareswaran, Venkataraman; Chen, S.; Behringer, Reinhold; Tam, Clement K.; Chan, M.; Bangayan, Phil T.; McGee, Joshua H.
2000-08-01
We describe new systems for improved integrated multimodal human-computer interaction and augmented reality for a diverse array of applications, including future advanced cockpits, tactical operations centers, and others. We have developed an integrated display system featuring: speech recognition of multiple concurrent users equipped with both standard air- coupled microphones and novel throat-coupled sensors (developed at Army Research Labs for increased noise immunity); lip reading for improving speech recognition accuracy in noisy environments, three-dimensional spatialized audio for improved display of warnings, alerts, and other information; wireless, coordinated handheld-PC control of a large display; real-time display of data and inferences from wireless integrated networked sensors with on-board signal processing and discrimination; gesture control with disambiguated point-and-speak capability; head- and eye- tracking coupled with speech recognition for 'look-and-speak' interaction; and integrated tetherless augmented reality on a wearable computer. The various interaction modalities (speech recognition, 3D audio, eyetracking, etc.) are implemented a 'modality servers' in an Internet-based client-server architecture. Each modality server encapsulates and exposes commercial and research software packages, presenting a socket network interface that is abstracted to a high-level interface, minimizing both vendor dependencies and required changes on the client side as the server's technology improves.
Rimvall, M K; Clemmensen, L; Munkholm, A; Rask, C U; Larsen, J T; Skovgaard, A M; Simons, C J P; van Os, J; Jeppesen, P
2016-10-01
Auditory verbal hallucinations (AVH) are common during development and may arise due to dysregulation in top-down processing of sensory input. This study was designed to examine the frequency and correlates of speech illusions measured using the White Noise (WN) task in children from the general population. Associations between speech illusions and putative risk factors for psychotic disorder and negative affect were examined. A total of 1486 children aged 11-12 years of the Copenhagen Child Cohort 2000 were examined with the WN task. Psychotic experiences and negative affect were determined using the Kiddie-SADS-PL. Register data described family history of mental disorders. Exaggerated Theory of Mind functioning (hyper-ToM) was measured by the ToM Storybook Frederik. A total of 145 (10%) children experienced speech illusions (hearing speech in the absence of speech stimuli), of which 102 (70%) experienced illusions perceived by the child as positive or negative (affectively salient). Experiencing hallucinations during the last month was associated with affectively salient speech illusions in the WN task [general cognitive ability: adjusted odds ratio (aOR) 2.01, 95% confidence interval (CI) 1.03-3.93]. Negative affect, both last month and lifetime, was also associated with affectively salient speech illusions (aOR 2.01, 95% CI 1.05-3.83 and aOR 1.79, 95% CI 1.11-2.89, respectively). Speech illusions were not associated with delusions, hyper-ToM or family history of mental disorders. Speech illusions were elicited in typically developing children in a WN-test paradigm, and point to an affective pathway to AVH mediated by dysregulation in top-down processing of sensory input.
Huo, Xueliang; Park, Hangue; Kim, Jeonghee; Ghovanloo, Maysam
2015-01-01
We are presenting a new wireless and wearable human computer interface called the dual-mode Tongue Drive System (dTDS), which is designed to allow people with severe disabilities to use computers more effectively with increased speed, flexibility, usability, and independence through their tongue motion and speech. The dTDS detects users’ tongue motion using a magnetic tracer and an array of magnetic sensors embedded in a compact and ergonomic wireless headset. It also captures the users’ voice wirelessly using a small microphone embedded in the same headset. Preliminary evaluation results based on 14 able-bodied subjects and three individuals with high level spinal cord injuries at level C3–C5 indicated that the dTDS headset, combined with a commercially available speech recognition (SR) software, can provide end users with significantly higher performance than either unimodal forms based on the tongue motion or speech alone, particularly in completing tasks that require both pointing and text entry. PMID:23475380
Dynamics of infant cortical auditory evoked potentials (CAEPs) for tone and speech tokens.
Cone, Barbara; Whitaker, Richard
2013-07-01
Cortical auditory evoked potentials (CAEPs) to tones and speech sounds were obtained in infants to: (1) further knowledge of auditory development above the level of the brainstem during the first year of life; (2) establish CAEP input-output functions for tonal and speech stimuli as a function of stimulus level and (3) elaborate the data-base that establishes CAEP in infants tested while awake using clinically relevant stimuli, thus providing methodology that would have translation to pediatric audiological assessment. Hypotheses concerning CAEP development were that the latency and amplitude input-output functions would reflect immaturity in encoding stimulus level. In a second experiment, infants were tested with the same stimuli used to evoke the CAEPs. Thresholds for these stimuli were determined using observer-based psychophysical techniques. The hypothesis was that the behavioral thresholds would be correlated with CAEP input-output functions because of shared cortical response areas known to be active in sound detection. 36 infants, between the ages of 4 and 12 months (mean=8 months, s.d.=1.8 months) and 9 young adults (mean age 21 years) with normal hearing were tested. First, CAEPs amplitude and latency input-output functions were obtained for 4 tone bursts and 7 speech tokens. The tone bursts stimuli were 50 ms tokens of pure tones at 0.5, 1.0, 2.0 and 4.0 kHz. The speech sound tokens, /a/, /i/, /o/, /u/, /m/, /s/, and /∫/, were created from natural speech samples and were also 50 ms in duration. CAEPs were obtained for tone burst and speech token stimuli at 10 dB level decrements in descending order from 70 dB SPL. All CAEP tests were completed while the infants were awake and engaged in quiet play. For the second experiment, observer-based psychophysical methods were used to establish perceptual threshold for the same speech sound and tone tokens. Infant CAEP component latencies were prolonged by 100-150 ms in comparison to adults. CAEP latency-intensity input output functions were steeper in infants compared to adults. CAEP amplitude growth functions with respect to stimulus SPL are adult-like at this age, particularly for the earliest component, P1-N1. Infant perceptual thresholds were elevated with respect to those found in adults. Furthermore, perceptual thresholds were higher, on average, than levels at which CAEPs could be obtained. When CAEP amplitudes were plotted with respect to perceptual threshold (dB SL), the infant CAEP amplitude growth slopes were steeper than in adults. Although CAEP latencies indicate immaturity in neural transmission at the level of the cortex, amplitude growth with respect to stimulus SPL is adult-like at this age, particularly for the earliest component, P1-N1. The latency and amplitude input-output functions may provide additional information as to how infants perceive stimulus level. The reasons for the discrepancy between electrophysiologic and perceptual threshold may be due to immaturity in perceptual temporal resolution abilities and the broad-band listening strategy employed by infants. The findings from the current study can be translated to the clinical setting. It is possible to use tonal or speech sound tokens to evoke CAEPs in an awake, passively alert infant, and thus determine whether these sounds activate the auditory cortex. This could be beneficial in the verification of hearing aid or cochlear implant benefit. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Application of the wavelet transform for speech processing
NASA Technical Reports Server (NTRS)
Maes, Stephane
1994-01-01
Speaker identification and word spotting will shortly play a key role in space applications. An approach based on the wavelet transform is presented that, in the context of the 'modulation model,' enables extraction of speech features which are used as input for the classification process.
Learner Involvement and Comprehensible Input.
ERIC Educational Resources Information Center
Tsui, Amy B. M.
1991-01-01
Studies on comprehensible input generally emphasize how input is made comprehensible to the nonnative speaker by examining native speaker speech or teacher talk in the classroom. This paper uses Hong Kong secondary school data to show that only when modification devices involve learner participation do they serve as indicators of comprehensible…
Children perceive speech onsets by ear and eye*
JERGER, SUSAN; DAMIAN, MARKUS F.; TYE-MURRAY, NANCY; ABDI, HERVÉ
2016-01-01
Adults use vision to perceive low-fidelity speech; yet how children acquire this ability is not well understood. The literature indicates that children show reduced sensitivity to visual speech from kindergarten to adolescence. We hypothesized that this pattern reflects the effects of complex tasks and a growth period with harder-to-utilize cognitive resources, not lack of sensitivity. We investigated sensitivity to visual speech in children via the phonological priming produced by low-fidelity (non-intact onset) auditory speech presented audiovisually (see dynamic face articulate consonant/rhyme b/ag; hear non-intact onset/rhyme: −b/ag) vs. auditorily (see still face; hear exactly same auditory input). Audiovisual speech produced greater priming from four to fourteen years, indicating that visual speech filled in the non-intact auditory onsets. The influence of visual speech depended uniquely on phonology and speechreading. Children – like adults – perceive speech onsets multimodally. Findings are critical for incorporating visual speech into developmental theories of speech perception. PMID:26752548
Neural Prediction Errors Distinguish Perception and Misperception of Speech.
Blank, Helen; Spangenberg, Marlene; Davis, Matthew H
2018-06-11
Humans use prior expectations to improve perception, especially of sensory signals that are degraded or ambiguous. However, if sensory input deviates from prior expectations, correct perception depends on adjusting or rejecting prior expectations. Failure to adjust or reject the prior leads to perceptual illusions especially if there is partial overlap (hence partial mismatch) between expectations and input. With speech, "Slips of the ear" occur when expectations lead to misperception. For instance, a entomologist, might be more susceptible to hear "The ants are my friends" for "The answer, my friend" (in the Bob Dylan song "Blowing in the Wind"). Here, we contrast two mechanisms by which prior expectations may lead to misperception of degraded speech. Firstly, clear representations of the common sounds in the prior and input (i.e., expected sounds) may lead to incorrect confirmation of the prior. Secondly, insufficient representations of sounds that deviate between prior and input (i.e., prediction errors) could lead to deception. We used cross-modal predictions from written words that partially match degraded speech to compare neural responses when male and female human listeners were deceived into accepting the prior or correctly reject it. Combined behavioural and multivariate representational similarity analysis of functional magnetic resonance imaging data shows that veridical perception of degraded speech is signalled by representations of prediction error in the left superior temporal sulcus. Instead of using top-down processes to support perception of expected sensory input, our findings suggest that the strength of neural prediction error representations distinguishes correct perception and misperception. SIGNIFICANCE STATEMENT Misperceiving spoken words is an everyday experience with outcomes that range from shared amusement to serious miscommunication. For hearing-impaired individuals, frequent misperception can lead to social withdrawal and isolation with severe consequences for well-being. In this work, we specify the neural mechanisms by which prior expectations - which are so often helpful for perception - can lead to misperception of degraded sensory signals. Most descriptive theories of illusory perception explain misperception as arising from a clear sensory representation of features or sounds that are in common between prior expectations and sensory input. Our work instead provides support for a complementary proposal; namely that misperception occurs when there is an insufficient sensory representations of the deviation between expectations and sensory signals. Copyright © 2018 the authors.
Polur, Prasad D; Miller, Gerald E
2005-01-01
Computer speech recognition of individuals with dysarthria, such as cerebral palsy patients, requires a robust technique that can handle conditions of very high variability and limited training data. In this study, a hidden Markov model (HMM) was constructed and conditions investigated that would provide improved performance for a dysarthric speech (isolated word) recognition system intended to act as an assistive/control tool. In particular, we investigated the effect of high-frequency spectral components on the recognition rate of the system to determine if they contributed useful additional information to the system. A small-size vocabulary spoken by three cerebral palsy subjects was chosen. Mel-frequency cepstral coefficients extracted with the use of 15 ms frames served as training input to an ergodic HMM setup. Subsequent results demonstrated that no significant useful information was available to the system for enhancing its ability to discriminate dysarthric speech above 5.5 kHz in the current set of dysarthric data. The level of variability in input dysarthric speech patterns limits the reliability of the system. However, its application as a rehabilitation/control tool to assist dysarthric motor-impaired individuals such as cerebral palsy subjects holds sufficient promise.
The varieties of speech to young children.
Huttenlocher, Janellen; Vasilyeva, Marina; Waterfall, Heidi R; Vevea, Jack L; Hedges, Larry V
2007-09-01
This article examines caregiver speech to young children. The authors obtained several measures of the speech used to children during early language development (14-30 months). For all measures, they found substantial variation across individuals and subgroups. Speech patterns vary with caregiver education, and the differences are maintained over time. While there are distinct levels of complexity for different caregivers, there is a common pattern of increase across age within the range that characterizes each educational group. Thus, caregiver speech exhibits both long-standing patterns of linguistic behavior and adjustment for the interlocutor. This information about the variability of speech by individual caregivers provides a framework for systematic study of the role of input in language acquisition. PsycINFO Database Record (c) 2007 APA, all rights reserved
Kharlamov, Viktor; Campbell, Kenneth; Kazanina, Nina
2011-11-01
Speech sounds are not always perceived in accordance with their acoustic-phonetic content. For example, an early and automatic process of perceptual repair, which ensures conformity of speech inputs to the listener's native language phonology, applies to individual input segments that do not exist in the native inventory or to sound sequences that are illicit according to the native phonotactic restrictions on sound co-occurrences. The present study with Russian and Canadian English speakers shows that listeners may perceive phonetically distinct and licit sound sequences as equivalent when the native language system provides robust evidence for mapping multiple phonetic forms onto a single phonological representation. In Russian, due to an optional but productive t-deletion process that affects /stn/ clusters, the surface forms [sn] and [stn] may be phonologically equivalent and map to a single phonological form /stn/. In contrast, [sn] and [stn] clusters are usually phonologically distinct in (Canadian) English. Behavioral data from identification and discrimination tasks indicated that [sn] and [stn] clusters were more confusable for Russian than for English speakers. The EEG experiment employed an oddball paradigm with nonwords [asna] and [astna] used as the standard and deviant stimuli. A reliable mismatch negativity response was elicited approximately 100 msec postchange in the English group but not in the Russian group. These findings point to a perceptual repair mechanism that is engaged automatically at a prelexical level to ensure immediate encoding of speech inputs in phonological terms, which in turn enables efficient access to the meaning of a spoken utterance.
Input and language development in bilingually developing children.
Hoff, Erika; Core, Cynthia
2013-11-01
Language skills in young bilingual children are highly varied as a result of the variability in their language experiences, making it difficult for speech-language pathologists to differentiate language disorder from language difference in bilingual children. Understanding the sources of variability in bilingual contexts and the resulting variability in children's skills will help improve language assessment practices by speech-language pathologists. In this article, we review literature on bilingual first language development for children under 5 years of age. We describe the rate of development in single and total language growth, we describe effects of quantity of input and quality of input on growth, and we describe effects of family composition on language input and language growth in bilingual children. We provide recommendations for language assessment of young bilingual children and consider implications for optimizing children's dual language development. Thieme Medical Publishers 333 Seventh Avenue, New York, NY 10001, USA.
Segmental Properties of Input to Infants: A Study of Korean
ERIC Educational Resources Information Center
Lee, Soyoung; Davis, Barbara L.; MacNeilage, Peter F.
2008-01-01
Segmental distributions of Korean infant-directed speech (IDS) and adult-directed speech (ADS) were compared. Significant differences were found in both consonant and vowel patterns. Korean-speaking mothers using IDS displayed more frequent labial consonantal place and less frequent coronal and glottal place and fricative manner. They showed more…
Parent Telegraphic Speech Use and Spoken Language in Preschoolers with ASD
ERIC Educational Resources Information Center
Venker, Courtney E.; Bolt, Daniel M.; Meyer, Allison; Sindberg, Heidi; Weismer, Susan Ellis; Tager-Flusberg, Helen
2015-01-01
Purpose: There is considerable controversy regarding whether to use telegraphic or grammatical input when speaking to young children with language delays, including children with autism spectrum disorder (ASD). This study examined telegraphic speech use in parents of preschoolers with ASD and associations with children's spoken language 1 year…
Effect of Three Classroom Listening Conditions on Speech Intelligibility
ERIC Educational Resources Information Center
Ross, Mark; Giolas, Thomas G.
1971-01-01
Speech discrimination scores for 13 deaf children were obtained in a classroom under: usual listening condition (hearing aid or not), binaural listening situation using auditory trainer/FM receiver with wireless microphone transmitter turned off, and binaural condition with inputs from auditory trainer/FM receiver and wireless microphone/FM…
Pragmatic Elements in EFL Course Books
ERIC Educational Resources Information Center
Ulum, Ömer Gökhan
2015-01-01
Pragmatic development or competence has been great concern particularly for the recent decades. Regarding this issue, questioning the existence and delivery of speech acts in EFL course books may be sententious, as learners employ them for pragmatic input. Although much research has been conducted referring to speech acts, comparably little…
High visual resolution matters in audiovisual speech perception, but only for some.
Alsius, Agnès; Wayne, Rachel V; Paré, Martin; Munhall, Kevin G
2016-07-01
The basis for individual differences in the degree to which visual speech input enhances comprehension of acoustically degraded speech is largely unknown. Previous research indicates that fine facial detail is not critical for visual enhancement when auditory information is available; however, these studies did not examine individual differences in ability to make use of fine facial detail in relation to audiovisual speech perception ability. Here, we compare participants based on their ability to benefit from visual speech information in the presence of an auditory signal degraded with noise, modulating the resolution of the visual signal through low-pass spatial frequency filtering and monitoring gaze behavior. Participants who benefited most from the addition of visual information (high visual gain) were more adversely affected by the removal of high spatial frequency information, compared to participants with low visual gain, for materials with both poor and rich contextual cues (i.e., words and sentences, respectively). Differences as a function of gaze behavior between participants with the highest and lowest visual gains were observed only for words, with participants with the highest visual gain fixating longer on the mouth region. Our results indicate that the individual variance in audiovisual speech in noise performance can be accounted for, in part, by better use of fine facial detail information extracted from the visual signal and increased fixation on mouth regions for short stimuli. Thus, for some, audiovisual speech perception may suffer when the visual input (in addition to the auditory signal) is less than perfect.
Network speech systems technology program
NASA Astrophysics Data System (ADS)
Weinstein, C. J.
1981-09-01
This report documents work performed during FY 1981 on the DCA-sponsored Network Speech Systems Technology Program. The two areas of work reported are: (1) communication system studies in support of the evolving Defense Switched Network (DSN) and (2) design and implementation of satellite/terrestrial interfaces for the Experimental Integrated Switched Network (EISN). The system studies focus on the development and evaluation of economical and endurable network routing procedures. Satellite/terrestrial interface development includes circuit-switched and packet-switched connections to the experimental wideband satellite network. Efforts in planning and coordination of EISN experiments are reported in detail in a separate EISN Experiment Plan.
A multimodal interface for real-time soldier-robot teaming
NASA Astrophysics Data System (ADS)
Barber, Daniel J.; Howard, Thomas M.; Walter, Matthew R.
2016-05-01
Recent research and advances in robotics have led to the development of novel platforms leveraging new sensing capabilities for semantic navigation. As these systems becoming increasingly more robust, they support highly complex commands beyond direct teleoperation and waypoint finding facilitating a transition away from robots as tools to robots as teammates. Supporting future Soldier-Robot teaming requires communication capabilities on par with human-human teams for successful integration of robots. Therefore, as robots increase in functionality, it is equally important that the interface between the Soldier and robot advances as well. Multimodal communication (MMC) enables human-robot teaming through redundancy and levels of communications more robust than single mode interaction. Commercial-off-the-shelf (COTS) technologies released in recent years for smart-phones and gaming provide tools for the creation of portable interfaces incorporating MMC through the use of speech, gestures, and visual displays. However, for multimodal interfaces to be successfully used in the military domain, they must be able to classify speech, gestures, and process natural language in real-time with high accuracy. For the present study, a prototype multimodal interface supporting real-time interactions with an autonomous robot was developed. This device integrated COTS Automated Speech Recognition (ASR), a custom gesture recognition glove, and natural language understanding on a tablet. This paper presents performance results (e.g. response times, accuracy) of the integrated device when commanding an autonomous robot to perform reconnaissance and surveillance activities in an unknown outdoor environment.
Visual speech information: a help or hindrance in perceptual processing of dysarthric speech.
Borrie, Stephanie A
2015-03-01
This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.
Visual Feedback of Tongue Movement for Novel Speech Sound Learning
Katz, William F.; Mehta, Sonya
2015-01-01
Pronunciation training studies have yielded important information concerning the processing of audiovisual (AV) information. Second language (L2) learners show increased reliance on bottom-up, multimodal input for speech perception (compared to monolingual individuals). However, little is known about the role of viewing one's own speech articulation processes during speech training. The current study investigated whether real-time, visual feedback for tongue movement can improve a speaker's learning of non-native speech sounds. An interactive 3D tongue visualization system based on electromagnetic articulography (EMA) was used in a speech training experiment. Native speakers of American English produced a novel speech sound (/ɖ/; a voiced, coronal, palatal stop) before, during, and after trials in which they viewed their own speech movements using the 3D model. Talkers' productions were evaluated using kinematic (tongue-tip spatial positioning) and acoustic (burst spectra) measures. The results indicated a rapid gain in accuracy associated with visual feedback training. The findings are discussed with respect to neural models for multimodal speech processing. PMID:26635571
ERIC Educational Resources Information Center
Hadley, Pamela A.; Rispoli, Matthew; Holt, Janet K.
2017-01-01
Purpose: This follow-up study examined whether a parent intervention that increased the diversity of lexical noun phrase subjects in parent input and accelerated children's sentence diversity (Hadley et al., 2017) had indirect benefits on tense/agreement (T/A) morphemes in parent input and children's spontaneous speech. Method: Differences in…
The Influence of Child-Directed Speech in Early Trilingualism
ERIC Educational Resources Information Center
Barnes, Julia
2011-01-01
Contexts of limited input such as trilingual families where a language is not spoken in the wider community but only by a reduced number of speakers in the home provide a unique opportunity to examine closely the relationship between a child's input and what she learns to say. Barnes reported on the relationship between maternal input and a…
Embedding speech into virtual realities
NASA Technical Reports Server (NTRS)
Bohn, Christian-Arved; Krueger, Wolfgang
1993-01-01
In this work a speaker-independent speech recognition system is presented, which is suitable for implementation in Virtual Reality applications. The use of an artificial neural network in connection with a special compression of the acoustic input leads to a system, which is robust, fast, easy to use and needs no additional hardware, beside a common VR-equipment.
Exploring Speech Recognition Technology: Children with Learning and Emotional/Behavioral Disorders.
ERIC Educational Resources Information Center
Faris-Cole, Debra; Lewis, Rena
2001-01-01
Intermediate grade students with disabilities in written expression and emotional/behavioral disorders were trained to use discrete or continuous speech input devices for written work. The study found extreme variability in the fidelity of the devices, PowerSecretary and Dragon NaturallySpeaking ranging from 49 percent to 87 percent. Both devices…
Implicit Processing of Phonotactic Cues: Evidence from Electrophysiological and Vascular Responses
ERIC Educational Resources Information Center
Rossi, Sonja; Jurgenson, Ina B.; Hanulikova, Adriana; Telkemeyer, Silke; Wartenburger, Isabell; Obrig, Hellmuth
2011-01-01
Spoken word recognition is achieved via competition between activated lexical candidates that match the incoming speech input. The competition is modulated by prelexical cues that are important for segmenting the auditory speech stream into linguistic units. One such prelexical cue that listeners rely on in spoken word recognition is phonotactics.…
Quadcopter Control Using Speech Recognition
NASA Astrophysics Data System (ADS)
Malik, H.; Darma, S.; Soekirno, S.
2018-04-01
This research reported a comparison from a success rate of speech recognition systems that used two types of databases they were existing databases and new databases, that were implemented into quadcopter as motion control. Speech recognition system was using Mel frequency cepstral coefficient method (MFCC) as feature extraction that was trained using recursive neural network method (RNN). MFCC method was one of the feature extraction methods that most used for speech recognition. This method has a success rate of 80% - 95%. Existing database was used to measure the success rate of RNN method. The new database was created using Indonesian language and then the success rate was compared with results from an existing database. Sound input from the microphone was processed on a DSP module with MFCC method to get the characteristic values. Then, the characteristic values were trained using the RNN which result was a command. The command became a control input to the single board computer (SBC) which result was the movement of the quadcopter. On SBC, we used robot operating system (ROS) as the kernel (Operating System).
Neural Entrainment to Rhythmically Presented Auditory, Visual, and Audio-Visual Speech in Children
Power, Alan James; Mead, Natasha; Barnes, Lisa; Goswami, Usha
2012-01-01
Auditory cortical oscillations have been proposed to play an important role in speech perception. It is suggested that the brain may take temporal “samples” of information from the speech stream at different rates, phase resetting ongoing oscillations so that they are aligned with similar frequency bands in the input (“phase locking”). Information from these frequency bands is then bound together for speech perception. To date, there are no explorations of neural phase locking and entrainment to speech input in children. However, it is clear from studies of language acquisition that infants use both visual speech information and auditory speech information in learning. In order to study neural entrainment to speech in typically developing children, we use a rhythmic entrainment paradigm (underlying 2 Hz or delta rate) based on repetition of the syllable “ba,” presented in either the auditory modality alone, the visual modality alone, or as auditory-visual speech (via a “talking head”). To ensure attention to the task, children aged 13 years were asked to press a button as fast as possible when the “ba” stimulus violated the rhythm for each stream type. Rhythmic violation depended on delaying the occurrence of a “ba” in the isochronous stream. Neural entrainment was demonstrated for all stream types, and individual differences in standardized measures of language processing were related to auditory entrainment at the theta rate. Further, there was significant modulation of the preferred phase of auditory entrainment in the theta band when visual speech cues were present, indicating cross-modal phase resetting. The rhythmic entrainment paradigm developed here offers a method for exploring individual differences in oscillatory phase locking during development. In particular, a method for assessing neural entrainment and cross-modal phase resetting would be useful for exploring developmental learning difficulties thought to involve temporal sampling, such as dyslexia. PMID:22833726
Malavasi, Massimiliano; Turri, Enrico; Atria, Jose Joaquin; Christensen, Heidi; Marxer, Ricard; Desideri, Lorenzo; Coy, Andre; Tamburini, Fabio; Green, Phil
2017-01-01
A better use of the increasing functional capabilities of home automation systems and Internet of Things (IoT) devices to support the needs of users with disability, is the subject of a research project currently conducted by Area Ausili (Assistive Technology Area), a department of Polo Tecnologico Regionale Corte Roncati of the Local Health Trust of Bologna (Italy), in collaboration with AIAS Ausilioteca Assistive Technology (AT) Team. The main aim of the project is to develop experimental low cost systems for environmental control through simplified and accessible user interfaces. Many of the activities are focused on automatic speech recognition and are developed in the framework of the CloudCAST project. In this paper we report on the first technical achievements of the project and discuss future possible developments and applications within and outside CloudCAST.
Dialogue enabling speech-to-text user assistive agent system for hearing-impaired person.
Lee, Seongjae; Kang, Sunmee; Han, David K; Ko, Hanseok
2016-06-01
A novel approach for assisting bidirectional communication between people of normal hearing and hearing-impaired is presented. While the existing hearing-impaired assistive devices such as hearing aids and cochlear implants are vulnerable in extreme noise conditions or post-surgery side effects, the proposed concept is an alternative approach wherein spoken dialogue is achieved by means of employing a robust speech recognition technique which takes into consideration of noisy environmental factors without any attachment into human body. The proposed system is a portable device with an acoustic beamformer for directional noise reduction and capable of performing speech-to-text transcription function, which adopts a keyword spotting method. It is also equipped with an optimized user interface for hearing-impaired people, rendering intuitive and natural device usage with diverse domain contexts. The relevant experimental results confirm that the proposed interface design is feasible for realizing an effective and efficient intelligent agent for hearing-impaired.
NASA Astrophysics Data System (ADS)
Lightstone, P. C.; Davidson, W. M.
1982-04-01
The military detection assessment laboratory houses an experimental field system which assesses different alarm indicators such as fence disturbance sensors, MILES cables, and microwave Racons. A speech synthesis board which could be interfaced, by means of a computer, to an alarm logger making verbal acknowledgement of alarms possible was purchased. Different products and different types of voice synthesis were analyzed before a linear predictive code device produced by Telesensory Speech Systems of Palo Alto, California was chosen. This device is called the Speech 1000 Board and has a dedicated 8085 processor. A multiplexer card was designed and the Sp 1000 interfaced through the card into a TMS 990/100M Texas Instrument microcomputer. It was also necessary to design the software with the capability of recognizing and flagging an alarm on any 1 of 32 possible lines. The experimental field system was then packaged with a dc power supply, LED indicators, speakers, and switches, and deployed in the field performing reliably.
SPAIDE: A Real-time Research Platform for the Clarion CII/90K Cochlear Implant
NASA Astrophysics Data System (ADS)
Van Immerseel, L.; Peeters, S.; Dykmans, P.; Vanpoucke, F.; Bracke, P.
2005-12-01
SPAIDE ( sound-processing algorithm integrated development environment) is a real-time platform of Advanced Bionics Corporation (Sylmar, Calif, USA) to facilitate advanced research on sound-processing and electrical-stimulation strategies with the Clarion CII and 90K implants. The platform is meant for testing in the laboratory. SPAIDE is conceptually based on a clear separation of the sound-processing and stimulation strategies, and, in specific, on the distinction between sound-processing and stimulation channels and electrode contacts. The development environment has a user-friendly interface to specify sound-processing and stimulation strategies, and includes the possibility to simulate the electrical stimulation. SPAIDE allows for real-time sound capturing from file or audio input on PC, sound processing and application of the stimulation strategy, and streaming the results to the implant. The platform is able to cover a broad range of research applications; from noise reduction and mimicking of normal hearing, over complex (simultaneous) stimulation strategies, to psychophysics. The hardware setup consists of a personal computer, an interface board, and a speech processor. The software is both expandable and to a great extent reusable in other applications.
Baltus, Alina; Herrmann, Christoph Siegfried
2016-06-01
Oscillatory EEG activity in the human brain with frequencies in the gamma range (approx. 30-80Hz) is known to be relevant for a large number of cognitive processes. Interestingly, each subject reveals an individual frequency of the auditory gamma-band response (GBR) that coincides with the peak in the auditory steady state response (ASSR). A common resonance frequency of auditory cortex seems to underlie both the individual frequency of the GBR and the peak of the ASSR. This review sheds light on the functional role of oscillatory gamma activity for auditory processing. For successful processing, the auditory system has to track changes in auditory input over time and store information about past events in memory which allows the construction of auditory objects. Recent findings support the idea of gamma oscillations being involved in the partitioning of auditory input into discrete samples to facilitate higher order processing. We review experiments that seem to suggest that inter-individual differences in the resonance frequency are behaviorally relevant for gap detection and speech processing. A possible application of these resonance frequencies for brain computer interfaces is illustrated with regard to optimized individual presentation rates for auditory input to correspond with endogenous oscillatory activity. This article is part of a Special Issue entitled SI: Auditory working memory. Copyright © 2015 Elsevier B.V. All rights reserved.
Temporal order processing of syllables in the left parietal lobe.
Moser, Dana; Baker, Julie M; Sanchez, Carmen E; Rorden, Chris; Fridriksson, Julius
2009-10-07
Speech processing requires the temporal parsing of syllable order. Individuals suffering from posterior left hemisphere brain injury often exhibit temporal processing deficits as well as language deficits. Although the right posterior inferior parietal lobe has been implicated in temporal order judgments (TOJs) of visual information, there is limited evidence to support the role of the left inferior parietal lobe (IPL) in processing syllable order. The purpose of this study was to examine whether the left inferior parietal lobe is recruited during temporal order judgments of speech stimuli. Functional magnetic resonance imaging data were collected on 14 normal participants while they completed the following forced-choice tasks: (1) syllable order of multisyllabic pseudowords, (2) syllable identification of single syllables, and (3) gender identification of both multisyllabic and monosyllabic speech stimuli. Results revealed increased neural recruitment in the left inferior parietal lobe when participants made judgments about syllable order compared with both syllable identification and gender identification. These findings suggest that the left inferior parietal lobe plays an important role in processing syllable order and support the hypothesized role of this region as an interface between auditory speech and the articulatory code. Furthermore, a breakdown in this interface may explain some components of the speech deficits observed after posterior damage to the left hemisphere.
Temporal Order Processing of Syllables in the Left Parietal Lobe
Baker, Julie M.; Sanchez, Carmen E.; Rorden, Chris; Fridriksson, Julius
2009-01-01
Speech processing requires the temporal parsing of syllable order. Individuals suffering from posterior left hemisphere brain injury often exhibit temporal processing deficits as well as language deficits. Although the right posterior inferior parietal lobe has been implicated in temporal order judgments (TOJs) of visual information, there is limited evidence to support the role of the left inferior parietal lobe (IPL) in processing syllable order. The purpose of this study was to examine whether the left inferior parietal lobe is recruited during temporal order judgments of speech stimuli. Functional magnetic resonance imaging data were collected on 14 normal participants while they completed the following forced-choice tasks: (1) syllable order of multisyllabic pseudowords, (2) syllable identification of single syllables, and (3) gender identification of both multisyllabic and monosyllabic speech stimuli. Results revealed increased neural recruitment in the left inferior parietal lobe when participants made judgments about syllable order compared with both syllable identification and gender identification. These findings suggest that the left inferior parietal lobe plays an important role in processing syllable order and support the hypothesized role of this region as an interface between auditory speech and the articulatory code. Furthermore, a breakdown in this interface may explain some components of the speech deficits observed after posterior damage to the left hemisphere. PMID:19812331
Interfaces. Working Papers in Linguistics No. 32.
ERIC Educational Resources Information Center
Zwicky, Arnold M.
The papers collected here concern the interfaces between various components of grammar (semantics, syntax, morphology, and phonology) and between grammar itself and various extragrammatical domains. They include: "The OSU Random, Unorganized Collection of Speech Act Examples"; "In and Out in Phonology"; "Forestress and…
ERGONOMICS ABSTRACTS 48347-48982.
ERIC Educational Resources Information Center
Ministry of Technology, London (England). Warren Spring Lab.
IN THIS COLLECTION OF ERGONOMICS ABSTRACTS AND ANNOTATIONS THE FOLLOWING AREAS OF CONCERN ARE REPRESENTED--GENERAL REFERENCES, METHODS, FACILITIES, AND EQUIPMENT RELATING TO ERGONOMICS, SYSTEMS OF MAN AND MACHINES, VISUAL, AUDITORY, AND OTHER SENSORY INPUTS AND PROCESSES (INCLUDING SPEECH AND INTELLIGIBILITY), INPUT CHANNELS, BODY MEASUREMENTS,…
Maternal speech to preterm infants during the first 2 years of life: stability and change.
Suttora, Chiara; Salerni, Nicoletta
2011-01-01
Studies on typical language development documented that mothers fine-tune their verbal input to children's advancing skills and development. Although premature birth has often been associated with delays in communicative and language development, studies investigating maternal language addressed to these children are still rare. The principal aim of this longitudinal study was to investigate the maternal speech directed at very preterm children by examining its changes across time and the stability of maternal individual styles. A sample of 16 mother-preterm infant dyads participated in semi-structured play sessions when children were 6, 12, 18 and 24 months of corrected age. Maternal speech directed at the children was analysed in terms of lexical and syntactical complexity as well as verbal productivity. Also children's motor, cognitive and communicative skills were assessed. Results highlight an overall increase in the lexical and syntactical complexity and in the amount of maternal speech across the first years of life. At the same time, individual maternal communicative styles seem stable as infants grow older, even if between 12 and 18 months all the indices' predictive values decrease, indicating a noteworthy modification in individual maternal styles. Furthermore, between 12 and 18 months predictive relationships between children's motor and vocal skills and maternal changes in input were found. Verbal input addressed to children born preterm during the first 2 years of life does not seem to differ considerably from the language usually used with full-term infants. Nevertheless, maternal verbal adjustments seem to be predicted by earlier infant achievements in vocal and motor development. This suggests that infants' motor skill maturation may function as a major signal for mothers of preterm babies to adjust aspects of their linguistic interactive style. © 2011 Royal College of Speech & Language Therapists.
Optical mass memory system (AMM-13). AMM/DBMS interface control document
NASA Technical Reports Server (NTRS)
Bailey, G. A.
1980-01-01
The baseline for external interfaces of a 10 to the 13th power bit, optical archival mass memory system (AMM-13) is established. The types of interfaces addressed include data transfer; AMM-13, Data Base Management System, NASA End-to-End Data System computer interconnect; data/control input and output interfaces; test input data source; file management; and facilities interface.
"Who" is saying "what"? Brain-based decoding of human voice and speech.
Formisano, Elia; De Martino, Federico; Bonte, Milene; Goebel, Rainer
2008-11-07
Can we decipher speech content ("what" is being said) and speaker identity ("who" is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.
Multimodal infant-directed communication: how caregivers combine tactile and linguistic cues.
Abu-Zhaya, Rana; Seidl, Amanda; Cristia, Alejandrina
2017-09-01
Both touch and speech independently have been shown to play an important role in infant development. However, little is known about how they may be combined in the input to the child. We examined the use of touch and speech together by having mothers read their 5-month-olds books about body parts and animals. Results suggest that speech+touch multimodal events are characterized by more exaggerated touch and speech cues. Further, our results suggest that maternal touches are aligned with speech and that mothers tend to touch their infants in locations that are congruent with names of body parts. Thus, our results suggest that tactile cues could potentially aid both infant word segmentation and word learning.
Probing the Electrode–Neuron Interface With Focused Cochlear Implant Stimulation
Bierer, Julie Arenberg
2010-01-01
Cochlear implants are highly successful neural prostheses for persons with severe or profound hearing loss who gain little benefit from hearing aid amplification. Although implants are capable of providing important spectral and temporal cues for speech perception, performance on speech tests is variable across listeners. Psychophysical measures obtained from individual implant subjects can also be highly variable across implant channels. This review discusses evidence that such variability reflects deviations in the electrode–neuron interface, which refers to an implant channel's ability to effectively stimulate the auditory nerve. It is proposed that focused electrical stimulation is ideally suited to assess channel-to-channel irregularities in the electrode–neuron interface. In implant listeners, it is demonstrated that channels with relatively high thresholds, as measured with the tripolar configuration, exhibit broader psychophysical tuning curves and smaller dynamic ranges than channels with relatively low thresholds. Broader tuning implies that frequency-specific information intended for one population of neurons in the cochlea may activate more distant neurons, and a compressed dynamic range could make it more difficult to resolve intensity-based information, particularly in the presence of competing noise. Degradation of both types of cues would negatively affect speech perception. PMID:20724356
D'Mello, Sidney K; Dowell, Nia; Graesser, Arthur
2011-03-01
There is the question of whether learning differs when students speak versus type their responses when interacting with intelligent tutoring systems with natural language dialogues. Theoretical bases exist for three contrasting hypotheses. The speech facilitation hypothesis predicts that spoken input will increase learning, whereas the text facilitation hypothesis predicts typed input will be superior. The modality equivalence hypothesis claims that learning gains will be equivalent. Previous experiments that tested these hypotheses were confounded by automated speech recognition systems with substantial error rates that were detected by learners. We addressed this concern in two experiments via a Wizard of Oz procedure, where a human intercepted the learner's speech and transcribed the utterances before submitting them to the tutor. The overall pattern of the results supported the following conclusions: (1) learning gains associated with spoken and typed input were on par and quantitatively higher than a no-intervention control, (2) participants' evaluations of the session were not influenced by modality, and (3) there were no modality effects associated with differences in prior knowledge and typing proficiency. Although the results generally support the modality equivalence hypothesis, highly motivated learners reported lower cognitive load and demonstrated increased learning when typing compared with speaking. We discuss the implications of our findings for intelligent tutoring systems that can support typed and spoken input.
Terrestrial interface architecture (DSI/DNI)
NASA Astrophysics Data System (ADS)
Rieser, J. H.; Onufry, M.
The 64-kbit/s digital speech interpolation (DSI)/digital noninterpolation (DNI) equipment interfaces the TDMA satellite system with the terrestrial network. This paper provides a functional description of the 64-kbit/s DSI/DNI equipment built at Comsat Laboratories in conformance with the Intelsat TDMA/DSI system specification, and discusses the theoretical and experimental performance of the DSI system. Several DSI-related network and interface issues are discussed, including the interaction between echo-control devices and DSI speech detectors, single and multidestinational DSI operation, location of the DSI equipment relative to the international switching center, and the location and need for Doppler and plesiochronous alignment buffers. The transition from 64-kbit/s DSI to 32-kbit/s low-rate encoding/DSI is expected to begin in 1988. The impact of this transition is discussed as it relates to existing 64-kbit/s DSI/DNI equipment.
Now you hear it, now you don't: vowel devoicing in Japanese infant-directed speech.
Fais, Laurel; Kajikawa, Sachiyo; Amano, Shigeaki; Werker, Janet F
2010-03-01
In this work, we examine a context in which a conflict arises between two roles that infant-directed speech (IDS) plays: making language structure salient and modeling the adult form of a language. Vowel devoicing in fluent adult Japanese creates violations of the canonical Japanese consonant-vowel word structure pattern by systematically devoicing particular vowels, yielding surface consonant clusters. We measured vowel devoicing rates in a corpus of infant- and adult-directed Japanese speech, for both read and spontaneous speech, and found that the mothers in our study preserve the fluent adult form of the language and mask underlying phonological structure by devoicing vowels in infant-directed speech at virtually the same rates as those for adult-directed speech. The results highlight the complex interrelationships among the modifications to adult speech that comprise infant-directed speech, and that form the input from which infants begin to build the eventual mature form of their native language.
Orthography Affects Second Language Speech: Double Letters and Geminate Production in English
ERIC Educational Resources Information Center
Bassetti, Bene
2017-01-01
Second languages (L2s) are often learned through spoken and written input, and L2 orthographic forms (spellings) can lead to non-native-like pronunciation. The present study investigated whether orthography can lead experienced learners of English[subscript L2] to make a phonological contrast in their speech production that does not exist in…
ERIC Educational Resources Information Center
Metz, Dale Evan; And Others
1992-01-01
A preliminary scheme for estimating the speech intelligibility of hearing-impaired speakers from acoustic parameters, using a computerized artificial neural network to process mathematically the acoustic input variables, is outlined. Tests with 60 hearing-impaired speakers found the scheme to be highly accurate in identifying speakers separated by…
Input, Output, and Negotiation of Meaning in Spanish Conversation Classes
ERIC Educational Resources Information Center
Rondon-Pari, Graziela
2014-01-01
This research study is based on the analysis of speech in three Spanish conversation classes. Research questions are: What is the ratio of English and Spanish spoken in class? Is classroom speech more predominant in students or the instructor? And, are teachers' beliefs in regards to the use of English and Spanish consistent with their classroom…
NLEdit: A generic graphical user interface for Fortran programs
NASA Technical Reports Server (NTRS)
Curlett, Brian P.
1994-01-01
NLEdit is a generic graphical user interface for the preprocessing of Fortran namelist input files. The interface consists of a menu system, a message window, a help system, and data entry forms. A form is generated for each namelist. The form has an input field for each namelist variable along with a one-line description of that variable. Detailed help information, default values, and minimum and maximum allowable values can all be displayed via menu picks. Inputs are processed through a scientific calculator program that allows complex equations to be used instead of simple numeric inputs. A custom user interface is generated simply by entering information about the namelist input variables into an ASCII file. There is no need to learn a new graphics system or programming language. NLEdit can be used as a stand-alone program or as part of a larger graphical user interface. Although NLEdit is intended for files using namelist format, it can be easily modified to handle other file formats.
Real-Time Extended Interface Automata for Software Testing Cases Generation
Yang, Shunkun; Xu, Jiaqi; Man, Tianlong; Liu, Bin
2014-01-01
Testing and verification of the interface between software components are particularly important due to the large number of complex interactions, which requires the traditional modeling languages to overcome the existing shortcomings in the aspects of temporal information description and software testing input controlling. This paper presents the real-time extended interface automata (RTEIA) which adds clearer and more detailed temporal information description by the application of time words. We also establish the input interface automaton for every input in order to solve the problems of input controlling and interface covering nimbly when applied in the software testing field. Detailed definitions of the RTEIA and the testing cases generation algorithm are provided in this paper. The feasibility and efficiency of this method have been verified in the testing of one real aircraft braking system. PMID:24892080
2005-01-01
Interface Compatibility); the tool is written in Ocaml [10], and the symbolic algorithms for interface compatibility and refinement are built on top...automata for a fire detection and reporting system. be encoded in the input language of the tool TIC. The refinement of sociable interfaces is discussed...are closely related to the I/O Automata Language (IOA) of [11]. Interface models are games between Input and Output, and in the models, it is es
Role of Linguistic Input in Third Person Singular -"s" Use in the Speech of Young Children
ERIC Educational Resources Information Center
Finneran, Denise A.; Leonard, Laurence B.
2010-01-01
Purpose: To examine the role of linguistic input in how young, typically developing children use the 3rd person singular -"s" (3S) inflection. Method: Novel verbs were presented to 16 young children in either 3S contexts (e.g., "The tiger heens") or nonfinite (NF) contexts (e.g., "Will the tiger heen?"). The input was further manipulated for…
Wang, Yulin; Tian, Xuelong
2014-08-01
In order to improve the speech quality and auditory perceptiveness of electronic cochlear implant under strong noise background, a speech enhancement system used for electronic cochlear implant front-end was constructed. Taking digital signal processing (DSP) as the core, the system combines its multi-channel buffered serial port (McBSP) data transmission channel with extended audio interface chip TLV320AIC10, so speech signal acquisition and output with high speed are realized. Meanwhile, due to the traditional speech enhancement method which has the problems as bad adaptability, slow convergence speed and big steady-state error, versiera function and de-correlation principle were used to improve the existing adaptive filtering algorithm, which effectively enhanced the quality of voice communications. Test results verified the stability of the system and the de-noising performance of the algorithm, and it also proved that they could provide clearer speech signals for the deaf or tinnitus patients.
A Mis-recognized Medical Vocabulary Correction System for Speech-based Electronic Medical Record
Seo, Hwa Jeong; Kim, Ju Han; Sakabe, Nagamasa
2002-01-01
Speech recognition as an input tool for electronic medical record (EMR) enables efficient data entry at the point of care. However, the recognition accuracy for medical vocabulary is much poorer than that for doctor-patient dialogue. We developed a mis-recognized medical vocabulary correction system based on syllable-by-syllable comparison of speech text against medical vocabulary database. Using specialty medical vocabulary, the algorithm detects and corrects mis-recognized medical vocabularies in narrative text. Our preliminary evaluation showed 94% of accuracy in mis-recognized medical vocabulary correction.
Smart command recognizer (SCR) - For development, test, and implementation of speech commands
NASA Technical Reports Server (NTRS)
Simpson, Carol A.; Bunnell, John W.; Krones, Robert R.
1988-01-01
The SCR, a rapid prototyping system for the development, testing, and implementation of speech commands in a flight simulator or test aircraft, is described. A single unit performs all functions needed during these three phases of system development, while the use of common software and speech command data structure files greatly reduces the preparation time for successive development phases. As a smart peripheral to a simulation or flight host computer, the SCR interprets the pilot's spoken input and passes command codes to the simulation or flight computer.
A multilingual audiometer simulator software for training purposes.
Kompis, Martin; Steffen, Pascal; Caversaccio, Marco; Brugger, Urs; Oesch, Ivo
2012-04-01
A set of algorithms, which allows a computer to determine the answers of simulated patients during pure tone and speech audiometry, is presented. Based on these algorithms, a computer program for training in audiometry was written and found to be useful for teaching purposes. To develop a flexible audiometer simulator software as a teaching and training tool for pure tone and speech audiometry, both with and without masking. First a set of algorithms, which allows a computer to determine the answers of a simulated, hearing-impaired patient, was developed. Then, the software was implemented. Extensive use was made of simple, editable text files to define all texts in the user interface and all patient definitions. The software 'audiometer simulator' is available for free download. It can be used to train pure tone audiometry (both with and without masking), speech audiometry, measurement of the uncomfortable level, and simple simulation tests. Due to the use of text files, the user can alter or add patient definitions and all texts and labels shown on the screen. So far, English, French, German, and Portuguese user interfaces are available and the user can choose between German or French speech audiometry.
An algorithm that improves speech intelligibility in noise for normal-hearing listeners.
Kim, Gibak; Lu, Yang; Hu, Yi; Loizou, Philipos C
2009-09-01
Traditional noise-suppression algorithms have been shown to improve speech quality, but not speech intelligibility. Motivated by prior intelligibility studies of speech synthesized using the ideal binary mask, an algorithm is proposed that decomposes the input signal into time-frequency (T-F) units and makes binary decisions, based on a Bayesian classifier, as to whether each T-F unit is dominated by the target or the masker. Speech corrupted at low signal-to-noise ratio (SNR) levels (-5 and 0 dB) using different types of maskers is synthesized by this algorithm and presented to normal-hearing listeners for identification. Results indicated substantial improvements in intelligibility (over 60% points in -5 dB babble) over that attained by human listeners with unprocessed stimuli. The findings from this study suggest that algorithms that can estimate reliably the SNR in each T-F unit can improve speech intelligibility.
How should a speech recognizer work?
Scharenborg, Odette; Norris, Dennis; Bosch, Louis; McQueen, James M
2005-11-12
Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of communication follows largely from the fact that research in these related fields has focused on the mechanics of how speech can be recognized. In Marr's (1982) terms, emphasis has been on the algorithmic and implementational levels rather than on the computational level. In this article, we provide a computational-level analysis of the task of speech recognition, which reveals the close parallels between research concerned with HSR and ASR. We illustrate this relation by presenting a new computational model of human spoken-word recognition, built using techniques from the field of ASR that, in contrast to current existing models of HSR, recognizes words from real speech input. 2005 Lawrence Erlbaum Associates, Inc.
ERIC Educational Resources Information Center
Falcao, Taciana Pontual; Price, Sara
2011-01-01
Tangible technologies and shared interfaces create new paradigms for mediating collaboration through dynamic, synchronous environments, where action is as important as speech for participating and contributing to the activity. However, interaction with shared interfaces has been shown to be inherently susceptible to peer interference, potentially…
Robot Command Interface Using an Audio-Visual Speech Recognition System
NASA Astrophysics Data System (ADS)
Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy
In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.
The cortical organization of lexical knowledge: A dual lexicon model of spoken language processing
Gow, David W.
2012-01-01
Current accounts of spoken language assume the existence of a lexicon where wordforms are stored and interact during spoken language perception, understanding and production. Despite the theoretical importance of the wordform lexicon, the exact localization and function of the lexicon in the broader context of language use is not well understood. This review draws on evidence from aphasia, functional imaging, neuroanatomy, laboratory phonology and behavioral results to argue for the existence of parallel lexica that facilitate different processes in the dorsal and ventral speech pathways. The dorsal lexicon, localized in the inferior parietal region including the supramarginal gyrus, serves as an interface between phonetic and articulatory representations. The ventral lexicon, localized in the posterior superior temporal sulcus and middle temporal gyrus, serves as an interface between phonetic and semantic representations. In addition to their interface roles, the two lexica contribute to the robustness of speech processing. PMID:22498237
NASA Technical Reports Server (NTRS)
Rasmussen, Robert D. (Inventor); Manning, Robert M. (Inventor); Lewis, Blair F. (Inventor); Bolotin, Gary S. (Inventor); Ward, Richard S. (Inventor)
1990-01-01
This is a distributed computing system providing flexible fault tolerance; ease of software design and concurrency specification; and dynamic balance of the loads. The system comprises a plurality of computers each having a first input/output interface and a second input/output interface for interfacing to communications networks each second input/output interface including a bypass for bypassing the associated computer. A global communications network interconnects the first input/output interfaces for providing each computer the ability to broadcast messages simultaneously to the remainder of the computers. A meshwork communications network interconnects the second input/output interfaces providing each computer with the ability to establish a communications link with another of the computers bypassing the remainder of computers. Each computer is controlled by a resident copy of a common operating system. Communications between respective ones of computers is by means of split tokens each having a moving first portion which is sent from computer to computer and a resident second portion which is disposed in the memory of at least one of computer and wherein the location of the second portion is part of the first portion. The split tokens represent both functions to be executed by the computers and data to be employed in the execution of the functions. The first input/output interfaces each include logic for detecting a collision between messages and for terminating the broadcasting of a message whereby collisions between messages are detected and avoided.
TongueToSpeech (TTS): Wearable wireless assistive device for augmented speech.
Marjanovic, Nicholas; Piccinini, Giacomo; Kerr, Kevin; Esmailbeigi, Hananeh
2017-07-01
Speech is an important aspect of human communication; individuals with speech impairment are unable to communicate vocally in real time. Our team has developed the TongueToSpeech (TTS) device with the goal of augmenting speech communication for the vocally impaired. The proposed device is a wearable wireless assistive device that incorporates a capacitive touch keyboard interface embedded inside a discrete retainer. This device connects to a computer, tablet or a smartphone via Bluetooth connection. The developed TTS application converts text typed by the tongue into audible speech. Our studies have concluded that an 8-contact point configuration between the tongue and the TTS device would yield the best user precision and speed performance. On average using the TTS device inside the oral cavity takes 2.5 times longer than the pointer finger using a T9 (Text on 9 keys) keyboard configuration to type the same phrase. In conclusion, we have developed a discrete noninvasive wearable device that allows the vocally impaired individuals to communicate in real time.
Joint Spatial-Spectral Feature Space Clustering for Speech Activity Detection from ECoG Signals
Kanas, Vasileios G.; Mporas, Iosif; Benz, Heather L.; Sgarbas, Kyriakos N.; Bezerianos, Anastasios; Crone, Nathan E.
2014-01-01
Brain machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines (SVM) as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and non-speech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllable repetition tasks and may contribute to the development of portable ECoG-based communication. PMID:24658248
NASA Astrophysics Data System (ADS)
Jelinek, H. J.
1986-01-01
This is the Final Report of Electronic Design Associates on its Phase I SBIR project. The purpose of this project is to develop a method for correcting helium speech, as experienced in diver-surface communication. The goal of the Phase I study was to design, prototype, and evaluate a real time helium speech corrector system based upon digital signal processing techniques. The general approach was to develop hardware (an IBM PC board) to digitize helium speech and software (a LAMBDA computer based simulation) to translate the speech. As planned in the study proposal, this initial prototype may now be used to assess expected performance from a self contained real time system which uses an identical algorithm. The Final Report details the work carried out to produce the prototype system. Four major project tasks were: a signal processing scheme for converting helium speech to normal sounding speech was generated. The signal processing scheme was simulated on a general purpose (LAMDA) computer. Actual helium speech was supplied to the simulation and the converted speech was generated. An IBM-PC based 14 bit data Input/Output board was designed and built. A bibliography of references on speech processing was generated.
Onojima, Takayuki; Kitajo, Keiichi; Mizuhara, Hiroaki
2017-01-01
Neural oscillation is attracting attention as an underlying mechanism for speech recognition. Speech intelligibility is enhanced by the synchronization of speech rhythms and slow neural oscillation, which is typically observed as human scalp electroencephalography (EEG). In addition to the effect of neural oscillation, it has been proposed that speech recognition is enhanced by the identification of a speaker's motor signals, which are used for speech production. To verify the relationship between the effect of neural oscillation and motor cortical activity, we measured scalp EEG, and simultaneous EEG and functional magnetic resonance imaging (fMRI) during a speech recognition task in which participants were required to recognize spoken words embedded in noise sound. We proposed an index to quantitatively evaluate the EEG phase effect on behavioral performance. The results showed that the delta and theta EEG phase before speech inputs modulated the participant's response time when conducting speech recognition tasks. The simultaneous EEG-fMRI experiment showed that slow EEG activity was correlated with motor cortical activity. These results suggested that the effect of the slow oscillatory phase was associated with the activity of the motor cortex during speech recognition.
Effects of human fatigue on speech signals
NASA Astrophysics Data System (ADS)
Stamoulis, Catherine
2004-05-01
Cognitive performance may be significantly affected by fatigue. In the case of critical personnel, such as pilots, monitoring human fatigue is essential to ensure safety and success of a given operation. One of the modalities that may be used for this purpose is speech, which is sensitive to respiratory changes and increased muscle tension of vocal cords, induced by fatigue. Age, gender, vocal tract length, physical and emotional state may significantly alter speech intensity, duration, rhythm, and spectral characteristics. In addition to changes in speech rhythm, fatigue may also affect the quality of speech, such as articulation. In a noisy environment, detecting fatigue-related changes in speech signals, particularly subtle changes at the onset of fatigue, may be difficult. Therefore, in a performance-monitoring system, speech parameters which are significantly affected by fatigue need to be identified and extracted from input signals. For this purpose, a series of experiments was performed under slowly varying cognitive load conditions and at different times of the day. The results of the data analysis are presented here.
Noise suppression methods for robust speech processing
NASA Astrophysics Data System (ADS)
Boll, S. F.; Ravindra, H.; Randall, G.; Armantrout, R.; Power, R.
1980-05-01
Robust speech processing in practical operating environments requires effective environmental and processor noise suppression. This report describes the technical findings and accomplishments during this reporting period for the research program funded to develop real time, compressed speech analysis synthesis algorithms whose performance in invariant under signal contamination. Fulfillment of this requirement is necessary to insure reliable secure compressed speech transmission within realistic military command and control environments. Overall contributions resulting from this research program include the understanding of how environmental noise degrades narrow band, coded speech, development of appropriate real time noise suppression algorithms, and development of speech parameter identification methods that consider signal contamination as a fundamental element in the estimation process. This report describes the current research and results in the areas of noise suppression using the dual input adaptive noise cancellation using the short time Fourier transform algorithms, articulation rate change techniques, and a description of an experiment which demonstrated that the spectral subtraction noise suppression algorithm can improve the intelligibility of 2400 bps, LPC 10 coded, helicopter speech by 10.6 point.
Brancalioni, Ana Rita; Magnago, Karine Faverzani; Keske-Soares, Marcia
2012-09-01
The objective of this study is to create a new proposal for classifying the severity of speech disorders using a fuzzy model in accordance with a linguistic model that represents the speech acquisition of Brazilian Portuguese. The fuzzy linguistic model was run in the MATLAB software fuzzy toolbox from a set of fuzzy rules, and it encompassed three input variables: path routing, level of complexity and phoneme acquisition. The output was the Speech Disorder Severity Index, and it used the following fuzzy subsets: severe, moderate severe, mild moderate and mild. The proposal was used for 204 children with speech disorders who were monolingual speakers of Brazilian Portuguese. The fuzzy linguistic model provided the Speech Disorder Severity Index for all of the evaluated phonological systems in a fast and practical manner. It was then possible to classify the systems according to the severity of the speech disorder as severe, moderate severe, mild moderate and mild; the speech disorders could also be differentiated according to the severity index.
All words are not created equal: Expectations about word length guide infant statistical learning
Lew-Williams, Casey; Saffran, Jenny R.
2011-01-01
Infants have been described as ‘statistical learners’ capable of extracting structure (such as words) from patterned input (such as language). Here, we investigated whether prior knowledge influences how infants track transitional probabilities in word segmentation tasks. Are infants biased by prior experience when engaging in sequential statistical learning? In a laboratory simulation of learning across time, we exposed 9- and 10-month-old infants to a list of either bisyllabic or trisyllabic nonsense words, followed by a pause-free speech stream composed of a different set of bisyllabic or trisyllabic nonsense words. Listening times revealed successful segmentation of words from fluent speech only when words were uniformly bisyllabic or trisyllabic throughout both phases of the experiment. Hearing trisyllabic words during the pre-exposure phase derailed infants’ abilities to segment speech into bisyllabic words, and vice versa. We conclude that prior knowledge about word length equips infants with perceptual expectations that facilitate efficient processing of subsequent language input. PMID:22088408
Communication system with adaptive noise suppression
NASA Technical Reports Server (NTRS)
Kozel, David (Inventor); Devault, James A. (Inventor); Birr, Richard B. (Inventor)
2007-01-01
A signal-to-noise ratio dependent adaptive spectral subtraction process eliminates noise from noise-corrupted speech signals. The process first pre-emphasizes the frequency components of the input sound signal which contain the consonant information in human speech. Next, a signal-to-noise ratio is determined and a spectral subtraction proportion adjusted appropriately. After spectral subtraction, low amplitude signals can be squelched. A single microphone is used to obtain both the noise-corrupted speech and the average noise estimate. This is done by determining if the frame of data being sampled is a voiced or unvoiced frame. During unvoiced frames an estimate of the noise is obtained. A running average of the noise is used to approximate the expected value of the noise. Spectral subtraction may be performed on a composite noise-corrupted signal, or upon individual sub-bands of the noise-corrupted signal. Pre-averaging of the input signal's magnitude spectrum over multiple time frames may be performed to reduce musical noise.
ERIC Educational Resources Information Center
McLeod, Sharynne; Baker, Elise; McCormack, Jane; Wren, Yvonne; Roulstone, Sue; Crowe, Kathryn; Masso, Sarah; White, Paul; Howland, Charlotte
2017-01-01
Purpose: The aim was to evaluate the effectiveness of computer-assisted input-based intervention for children with speech sound disorders (SSD). Method: The Sound Start Study was a cluster-randomized controlled trial. Seventy-nine early childhood centers were invited to participate, 45 were recruited, and 1,205 parents and educators of 4- and…
Effect of hearing loss on semantic access by auditory and audiovisual speech in children.
Jerger, Susan; Tye-Murray, Nancy; Damian, Markus F; Abdi, Hervé
2013-01-01
This research studied whether the mode of input (auditory versus audiovisual) influenced semantic access by speech in children with sensorineural hearing impairment (HI). Participants, 31 children with HI and 62 children with normal hearing (NH), were tested with the authors' new multimodal picture word task. Children were instructed to name pictures displayed on a monitor and ignore auditory or audiovisual speech distractors. The semantic content of the distractors was varied to be related versus unrelated to the pictures (e.g., picture distractor of dog-bear versus dog-cheese, respectively). In children with NH, picture-naming times were slower in the presence of semantically related distractors. This slowing, called semantic interference, is attributed to the meaning-related picture-distractor entries competing for selection and control of the response (the lexical selection by competition hypothesis). Recently, a modification of the lexical selection by competition hypothesis, called the competition threshold (CT) hypothesis, proposed that (1) the competition between the picture-distractor entries is determined by a threshold, and (2) distractors with experimentally reduced fidelity cannot reach the CT. Thus, semantically related distractors with reduced fidelity do not produce the normal interference effect, but instead no effect or semantic facilitation (faster picture naming times for semantically related versus unrelated distractors). Facilitation occurs because the activation level of the semantically related distractor with reduced fidelity (1) is not sufficient to exceed the CT and produce interference but (2) is sufficient to activate its concept, which then strengthens the activation of the picture and facilitates naming. This research investigated whether the proposals of the CT hypothesis generalize to the auditory domain, to the natural degradation of speech due to HI, and to participants who are children. Our multimodal picture word task allowed us to (1) quantify picture naming results in the presence of auditory speech distractors and (2) probe whether the addition of visual speech enriched the fidelity of the auditory input sufficiently to influence results. In the HI group, the auditory distractors produced no effect or a facilitative effect, in agreement with proposals of the CT hypothesis. In contrast, the audiovisual distractors produced the normal semantic interference effect. Results in the HI versus NH groups differed significantly for the auditory mode, but not for the audiovisual mode. This research indicates that the lower fidelity auditory speech associated with HI affects the normalcy of semantic access by children. Further, adding visual speech enriches the lower fidelity auditory input sufficiently to produce the semantic interference effect typical of children with NH.
Noise Hampers Children’s Expressive Word Learning
Riley, Kristine Grohne; McGregor, Karla K.
2013-01-01
Purpose To determine the effects of noise and speech style on word learning in typically developing school-age children. Method Thirty-one participants ages 9;0 (years; months) to 10;11 attempted to learn 2 sets of 8 novel words and their referents. They heard all of the words 13 times each within meaningful narrative discourse. Signal-to-noise ratio (noise vs. quiet) and speech style (plain vs. clear) were manipulated such that half of the children heard the new words in broadband white noise and half heard them in quiet; within those conditions, each child heard one set of words produced in a plain speech style and another set in a clear speech style. Results Children who were trained in quiet learned to produce the word forms more accurately than those who were trained in noise. Clear speech resulted in more accurate word form productions than plain speech, whether the children had learned in noise or quiet. Learning from clear speech in noise and plain speech in quiet produced comparable results. Conclusion Noise limits expressive vocabulary growth in children, reducing the quality of word form representation in the lexicon. Clear speech input can aid expressive vocabulary growth in children, even in noisy environments. PMID:22411494
Speech coding at 4800 bps for mobile satellite communications
NASA Technical Reports Server (NTRS)
Gersho, Allen; Chan, Wai-Yip; Davidson, Grant; Chen, Juin-Hwey; Yong, Mei
1988-01-01
A speech compression project has recently been completed to develop a speech coding algorithm suitable for operation in a mobile satellite environment aimed at providing telephone quality natural speech at 4.8 kbps. The work has resulted in two alternative techniques which achieve reasonably good communications quality at 4.8 kbps while tolerating vehicle noise and rather severe channel impairments. The algorithms are embodied in a compact self-contained prototype consisting of two AT and T 32-bit floating-point DSP32 digital signal processors (DSP). A Motorola 68HC11 microcomputer chip serves as the board controller and interface handler. On a wirewrapped card, the prototype's circuit footprint amounts to only 200 sq cm, and consumes about 9 watts of power.
Wolfe, Jace; Schafer, Erin; Parkinson, Aaron; John, Andrew; Hudson, Mary; Wheeler, Julie; Mucci, Angie
2013-01-01
The objective of this study was to compare speech recognition in quiet and in noise for cochlear implant recipients using two different types of personal frequency modulation (FM) systems (directly coupled [direct auditory input] versus induction neckloop) with each of two sound processors (Cochlear Nucleus Freedom versus Cochlear Nucleus 5). Two different experiments were conducted within this study. In both these experiments, mixing of the FM signal within the Freedom processor was implemented via the same scheme used clinically for the Freedom sound processor. In Experiment 1, the aforementioned comparisons were conducted with the Nucleus 5 programmed so that the microphone and FM signals were mixed and then the mixed signals were subjected to autosensitivity control (ASC). In Experiment 2, comparisons between the two FM systems and processors were conducted again with the Nucleus 5 programmed to provide a more complex multistage implementation of ASC during the preprocessing stage. This study was a within-subject, repeated-measures design. Subjects were recruited from the patient population at the Hearts for Hearing Foundation in Oklahoma City, OK. Fifteen subjects participated in Experiment 1, and 16 subjects participated in Experiment 2. Subjects were adults who had used either unilateral or bilateral cochlear implants for at least 1 year. In this experiment, no differences were found in speech recognition in quiet obtained with the two different FM systems or the various sound-processor conditions. With each sound processor, speech recognition in noise was better with the directly coupled direct auditory input system relative to the neckloop system. The multistage ASC processing of the Nucleus 5 sound processor provided better performance than the single-stage approach for the Nucleus 5 and the Nucleus Freedom sound processor. Speech recognition in noise is substantially affected by the type of sound processor, FM system, and implementation of ASC used by a Cochlear implant recipient.
1983-12-31
perception as much as binaural back- ward maskin6. Dichotic backward masking effects have also been found with more complex stimuli, such as CV syllables...the basis of these results and of binaur - al masking effects, it has been suggested that an auditory input produces a preperceptual auditory image that...four, in two sessions separated by at least 48 hours. In the "speech" session, subjects were first presented binaurally with the series of [bal and [gal
1988-09-01
Group Subgroup Command and control; Computational linguistics; expert system voice recognition; man- machine interface; U.S. Government 19 Abstract...simulates the characteristics of FRESH on a smaller scale. This study assisted NOSC in developing a voice-recognition, man- machine interface that could...scale. This study assisted NOSC in developing a voice-recogni- tion, man- machine interface that could be used with TONE and upgraded at a later date
Guest editorial: Introduction to the special issue on modern control for computer games.
Argyriou, Vasileios; Kotsia, Irene; Zafeiriou, Stefanos; Petrou, Maria
2013-12-01
A typical gaming scenario, as developed in the past 20 years, involves a player interacting with a game using a specialized input device, such as a joystic, a mouse, a keyboard, etc. Recent technological advances and new sensors (for example, low cost commodity depth cameras) have enabled the introduction of more elaborated approaches in which the player is now able to interact with the game using his body pose, facial expressions, actions, and even his physiological signals. A new era of games has already started, employing computer vision techniques, brain-computer interfaces systems, haptic and wearable devices. The future lies in games that will be intelligent enough not only to extract the player's commands provided by his speech and gestures but also his behavioral cues, as well as his/her emotional states, and adjust their game plot accordingly in order to ensure more realistic and satisfactory gameplay experience. This special issue on modern control for computer games discusses several interdisciplinary factors that influence a user's input to a game, something directly linked to the gaming experience. These include, but are not limited to, the following: behavioral affective gaming, user satisfaction and perception, motion capture and scene modeling, and complete software frameworks that address several challenges risen in such scenarios.
Evaluation of the importance of time-frequency contributions to speech intelligibility in noise
Yu, Chengzhu; Wójcicki, Kamil K.; Loizou, Philipos C.; Hansen, John H. L.; Johnson, Michael T.
2014-01-01
Recent studies on binary masking techniques make the assumption that each time-frequency (T-F) unit contributes an equal amount to the overall intelligibility of speech. The present study demonstrated that the importance of each T-F unit to speech intelligibility varies in accordance with speech content. Specifically, T-F units are categorized into two classes, speech-present T-F units and speech-absent T-F units. Results indicate that the importance of each speech-present T-F unit to speech intelligibility is highly related to the loudness of its target component, while the importance of each speech-absent T-F unit varies according to the loudness of its masker component. Two types of mask errors are also considered, which include miss and false alarm errors. Consistent with previous work, false alarm errors are shown to be more harmful to speech intelligibility than miss errors when the mixture signal-to-noise ratio (SNR) is below 0 dB. However, the relative importance between the two types of error is conditioned on the SNR level of the input speech signal. Based on these observations, a mask-based objective measure, the loudness weighted hit-false, is proposed for predicting speech intelligibility. The proposed objective measure shows significantly higher correlation with intelligibility compared to two existing mask-based objective measures. PMID:24815280
Speech and gesture interfaces for squad-level human-robot teaming
NASA Astrophysics Data System (ADS)
Harris, Jonathan; Barber, Daniel
2014-06-01
As the military increasingly adopts semi-autonomous unmanned systems for military operations, utilizing redundant and intuitive interfaces for communication between Soldiers and robots is vital to mission success. Currently, Soldiers use a common lexicon to verbally and visually communicate maneuvers between teammates. In order for robots to be seamlessly integrated within mixed-initiative teams, they must be able to understand this lexicon. Recent innovations in gaming platforms have led to advancements in speech and gesture recognition technologies, but the reliability of these technologies for enabling communication in human robot teaming is unclear. The purpose for the present study is to investigate the performance of Commercial-Off-The-Shelf (COTS) speech and gesture recognition tools in classifying a Squad Level Vocabulary (SLV) for a spatial navigation reconnaissance and surveillance task. The SLV for this study was based on findings from a survey conducted with Soldiers at Fort Benning, GA. The items of the survey focused on the communication between the Soldier and the robot, specifically in regards to verbally instructing them to execute reconnaissance and surveillance tasks. Resulting commands, identified from the survey, were then converted to equivalent arm and hand gestures, leveraging existing visual signals (e.g. U.S. Army Field Manual for Visual Signaling). A study was then run to test the ability of commercially available automated speech recognition technologies and a gesture recognition glove to classify these commands in a simulated intelligence, surveillance, and reconnaissance task. This paper presents classification accuracy of these devices for both speech and gesture modalities independently.
A Hierarchical multi-input and output Bi-GRU Model for Sentiment Analysis on Customer Reviews
NASA Astrophysics Data System (ADS)
Zhang, Liujie; Zhou, Yanquan; Duan, Xiuyu; Chen, Ruiqi
2018-03-01
Multi-label sentiment classification on customer reviews is a practical challenging task in Natural Language Processing. In this paper, we propose a hierarchical multi-input and output model based bi-directional recurrent neural network, which both considers the semantic and lexical information of emotional expression. Our model applies two independent Bi-GRU layer to generate part of speech and sentence representation. Then the lexical information is considered via attention over output of softmax activation on part of speech representation. In addition, we combine probability of auxiliary labels as feature with hidden layer to capturing crucial correlation between output labels. The experimental result shows that our model is computationally efficient and achieves breakthrough improvements on customer reviews dataset.
NASA Astrophysics Data System (ADS)
Přibil, Jiří; Přibilová, Anna; Ďuračkoá, Daniela
2014-01-01
The paper describes our experiment with using the Gaussian mixture models (GMM) for classification of speech uttered by a person wearing orthodontic appliances. For the GMM classification, the input feature vectors comprise the basic and the complementary spectral properties as well as the supra-segmental parameters. Dependence of classification correctness on the number of the parameters in the input feature vector and on the computation complexity is also evaluated. In addition, an influence of the initial setting of the parameters for GMM training process was analyzed. Obtained recognition results are compared visually in the form of graphs as well as numerically in the form of tables and confusion matrices for tested sentences uttered using three configurations of orthodontic appliances.
Rollout of Endeavour at Palmdale, California (Part 1 of 2)
NASA Technical Reports Server (NTRS)
1991-01-01
Footage shows the rollout ceremonies for Endeavour, including the display of colors, invocation, and speeches by Sam Iacobellis, Executive Vice-President and CEO of Rockwell International, Richard H. Truly, Administrator for NASA, and Senator Jake Garn (Utah). The tape ends during the speech by Senator Garn and continues on part two (Input Processing ID 2000152220, Document ID 20010010951). Endeavour rolls out to music provided by the band on-site.
Audio-visual speech perception in adult readers with dyslexia: an fMRI study.
Rüsseler, Jascha; Ye, Zheng; Gerth, Ivonne; Szycik, Gregor R; Münte, Thomas F
2018-04-01
Developmental dyslexia is a specific deficit in reading and spelling that often persists into adulthood. In the present study, we used slow event-related fMRI and independent component analysis to identify brain networks involved in perception of audio-visual speech in a group of adult readers with dyslexia (RD) and a group of fluent readers (FR). Participants saw a video of a female speaker saying a disyllabic word. In the congruent condition, audio and video input were identical whereas in the incongruent condition, the two inputs differed. Participants had to respond to occasionally occurring animal names. The independent components analysis (ICA) identified several components that were differently modulated in FR and RD. Two of these components including fusiform gyrus and occipital gyrus showed less activation in RD compared to FR possibly indicating a deficit to extract face information that is needed to integrate auditory and visual information in natural speech perception. A further component centered on the superior temporal sulcus (STS) also exhibited less activation in RD compared to FR. This finding is corroborated in the univariate analysis that shows less activation in STS for RD compared to FR. These findings suggest a general impairment in recruitment of audiovisual processing areas in dyslexia during the perception of natural speech.
Piquado, Tepring; Benichov, Jonathan I.; Brownell, Hiram; Wingfield, Arthur
2013-01-01
Objective The purpose of this research was to determine whether negative effects of hearing loss on recall accuracy for spoken narratives can be mitigated by allowing listeners to control the rate of speech input. Design Paragraph-length narratives were presented for recall under two listening conditions in a within-participants design: presentation without interruption (continuous) at an average speech-rate of 150 words per minute; and presentation interrupted at periodic intervals at which participants were allowed to pause before initiating the next segment (self-paced). Study sample Participants were 24 adults ranging from 21 to 33 years of age. Half had age-normal hearing acuity and half had mild-to-moderate hearing loss. The two groups were comparable for age, years of formal education, and vocabulary. Results When narrative passages were presented continuously, without interruption, participants with hearing loss recalled significantly fewer story elements, both main ideas and narrative details, than those with age-normal hearing. The recall difference was eliminated when the two groups were allowed to self-pace the speech input. Conclusion Results support the hypothesis that the listening effort associated with reduced hearing acuity can slow processing operations and increase demands on working memory, with consequent negative effects on accuracy of narrative recall. PMID:22731919
Brain Volume Differences Associated With Hearing Impairment in Adults
Vriend, Chris; Heslenfeld, Dirk J.; Versfeld, Niek J.; Kramer, Sophia E.
2018-01-01
Speech comprehension depends on the successful operation of a network of brain regions. Processing of degraded speech is associated with different patterns of brain activity in comparison with that of high-quality speech. In this exploratory study, we studied whether processing degraded auditory input in daily life because of hearing impairment is associated with differences in brain volume. We compared T1-weighted structural magnetic resonance images of 17 hearing-impaired (HI) adults with those of 17 normal-hearing (NH) controls using a voxel-based morphometry analysis. HI adults were individually matched with NH adults based on age and educational level. Gray and white matter brain volumes were compared between the groups by region-of-interest analyses in structures associated with speech processing, and by whole-brain analyses. The results suggest increased gray matter volume in the right angular gyrus and decreased white matter volume in the left fusiform gyrus in HI listeners as compared with NH ones. In the HI group, there was a significant correlation between hearing acuity and cluster volume of the gray matter cluster in the right angular gyrus. This correlation supports the link between partial hearing loss and altered brain volume. The alterations in volume may reflect the operation of compensatory mechanisms that are related to decoding meaning from degraded auditory input. PMID:29557274
NASA Technical Reports Server (NTRS)
Tobey, G. L.
1978-01-01
Tests were performed to evaluate the operating characteristics of the interface between the Space Lab Bus Interface Unit (SL/BIU) and the Orbiter Multiplexer-Demultiplexer (MDM) serial data input-output (SIO) module. This volume contains the test equipment preparation procedures and a detailed description of the Nova/Input Output Processor Simulator (IOPS) software used during the data transfer tests to determine word error rates (WER).
Development of speech prostheses: current status and recent advances
Brumberg, Jonathan S; Guenther, Frank H
2010-01-01
Brain–computer interfaces (BCIs) have been developed over the past decade to restore communication to persons with severe paralysis. In the most severe cases of paralysis, known as locked-in syndrome, patients retain cognition and sensation, but are capable of only slight voluntary eye movements. For these patients, no standard communication method is available, although some can use BCIs to communicate by selecting letters or words on a computer. Recent research has sought to improve on existing techniques by using BCIs to create a direct prediction of speech utterances rather than to simply control a spelling device. Such methods are the first steps towards speech prostheses as they are intended to entirely replace the vocal apparatus of paralyzed users. This article outlines many well known methods for restoration of communication by BCI and illustrates the difference between spelling devices and direct speech prediction or speech prosthesis. PMID:20822389
Speech processing and production in two-year-old children acquiring isiXhosa: A tale of two children
Rossouw, Kate; Fish, Laura; Jansen, Charne; Manley, Natalie; Powell, Michelle; Rosen, Loren
2016-01-01
We investigated the speech processing and production of 2-year-old children acquiring isiXhosa in South Africa. Two children (2 years, 5 months; 2 years, 8 months) are presented as single cases. Speech input processing, stored phonological knowledge and speech output are described, based on data from auditory discrimination, naming, and repetition tasks. Both children were approximating adult levels of accuracy in their speech output, although naming was constrained by vocabulary. Performance across tasks was variable: One child showed a relative strength with repetition, and experienced most difficulties with auditory discrimination. The other performed equally well in naming and repetition, and obtained 100% for her auditory task. There is limited data regarding typical development of isiXhosa, and the focus has mainly been on speech production. This exploratory study describes typical development of isiXhosa using a variety of tasks understood within a psycholinguistic framework. We describe some ways in which speech and language therapists can devise and carry out assessment with children in situations where few formal assessments exist, and also detail the challenges of such work. PMID:27245131
V2S: Voice to Sign Language Translation System for Malaysian Deaf People
NASA Astrophysics Data System (ADS)
Mean Foong, Oi; Low, Tang Jung; La, Wai Wan
The process of learning and understand the sign language may be cumbersome to some, and therefore, this paper proposes a solution to this problem by providing a voice (English Language) to sign language translation system using Speech and Image processing technique. Speech processing which includes Speech Recognition is the study of recognizing the words being spoken, regardless of whom the speaker is. This project uses template-based recognition as the main approach in which the V2S system first needs to be trained with speech pattern based on some generic spectral parameter set. These spectral parameter set will then be stored as template in a database. The system will perform the recognition process through matching the parameter set of the input speech with the stored templates to finally display the sign language in video format. Empirical results show that the system has 80.3% recognition rate.
Building Interfaces between the Humanities and Cognitive Sciences: The Case of Human Speech
ERIC Educational Resources Information Center
Benus, Stefan
2010-01-01
I argue that creating "interfaces" between the humanities and cognitive sciences would be intellectually stimulating for both groups. More specifically for the humanities: they might gain challenging and rewarding avenues of inquiry, attract more funding, and advance their position in the 21st-century universities and among the general public, if…
Tona, Risa; Naito, Yasushi; Moroto, Saburo; Yamamoto, Rinko; Fujiwara, Keizo; Yamazaki, Hiroshi; Shinohara, Shogo; Kikuchi, Masahiro
2015-12-01
To investigate the McGurk effect in profoundly deafened Japanese children with cochlear implants (CI) and in normal-hearing children. This was done to identify how children with profound deafness using CI established audiovisual integration during the speech acquisition period. Twenty-four prelingually deafened children with CI and 12 age-matched normal-hearing children participated in this study. Responses to audiovisual stimuli were compared between deafened and normal-hearing controls. Additionally, responses of the children with CI younger than 6 years of age were compared with those of the children with CI at least 6 years of age at the time of the test. Responses to stimuli combining auditory labials and visual non-labials were significantly different between deafened children with CI and normal-hearing controls (p<0.05). Additionally, the McGurk effect tended to be more induced in deafened children older than 6 years of age than in their younger counterparts. The McGurk effect was more significantly induced in prelingually deafened Japanese children with CI than in normal-hearing, age-matched Japanese children. Despite having good speech-perception skills and auditory input through their CI, from early childhood, deafened children may use more visual information in speech perception than normal-hearing children. As children using CI need to communicate based on insufficient speech signals coded by CI, additional activities of higher-order brain function may be necessary to compensate for the incomplete auditory input. This study provided information on the influence of deafness on the development of audiovisual integration related to speech, which could contribute to our further understanding of the strategies used in spoken language communication by prelingually deafened children. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
The Use of an Eight-Step Instructional Model to Train School Staff in Partner-Augmented Input
ERIC Educational Resources Information Center
Senner, Jill E.; Baud, Matthew R.
2017-01-01
An eight-step instruction model was used to train a self-contained classroom teacher, speech-language pathologist, and two instructional assistants in partner-augmented input, a modeling strategy for teaching augmentative and alternative communication use. With the exception of a 2-hr training session, instruction primarily was conducted during…
Why Not Non-Native Varieties of English as Listening Comprehension Test Input?
ERIC Educational Resources Information Center
Abeywickrama, Priyanvada
2013-01-01
The existence of different varieties of English in target language use (TLU) domains calls into question the usefulness of listening comprehension tests whose input is limited only to a native speaker variety. This study investigated the impact of non-native varieties or accented English speech on test takers from three different English use…
Gesture as Input in Language Acquisition: Learning "Who She Is" from "Where She Is"
ERIC Educational Resources Information Center
Goodrich, Whitney Sarah-Iverson
2009-01-01
This dissertation explores the role co-speech gesture plays as input in language learning, specifically with respect to the acquisition of anaphoric pronouns. Four studies investigate how both adults and children interpret ambiguous pronouns, and how the order-of-mention tendency develops in children. The results suggest that gesture is a useful…
Shin, Young Hoon; Seo, Jiwon
2016-01-01
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867
Shin, Young Hoon; Seo, Jiwon
2016-10-29
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.
The brain dynamics of rapid perceptual adaptation to adverse listening conditions.
Erb, Julia; Henry, Molly J; Eisner, Frank; Obleser, Jonas
2013-06-26
Listeners show a remarkable ability to quickly adjust to degraded speech input. Here, we aimed to identify the neural mechanisms of such short-term perceptual adaptation. In a sparse-sampling, cardiac-gated functional magnetic resonance imaging (fMRI) acquisition, human listeners heard and repeated back 4-band-vocoded sentences (in which the temporal envelope of the acoustic signal is preserved, while spectral information is highly degraded). Clear-speech trials were included as baseline. An additional fMRI experiment on amplitude modulation rate discrimination quantified the convergence of neural mechanisms that subserve coping with challenging listening conditions for speech and non-speech. First, the degraded speech task revealed an "executive" network (comprising the anterior insula and anterior cingulate cortex), parts of which were also activated in the non-speech discrimination task. Second, trial-by-trial fluctuations in successful comprehension of degraded speech drove hemodynamic signal change in classic "language" areas (bilateral temporal cortices). Third, as listeners perceptually adapted to degraded speech, downregulation in a cortico-striato-thalamo-cortical circuit was observable. The present data highlight differential upregulation and downregulation in auditory-language and executive networks, respectively, with important subcortical contributions when successfully adapting to a challenging listening situation.
Local area network with fault-checking, priorities, and redundant backup
NASA Technical Reports Server (NTRS)
Morales, Sergio (Inventor); Friedman, Gary L. (Inventor)
1989-01-01
This invention is a redundant error detecting and correcting local area networked computer system having a plurality of nodes each including a network connector board within the node for connecting to an interfacing transceiver operably attached to a network cable. There is a first network cable disposed along a path to interconnect the nodes. The first network cable includes a plurality of first interfacing transceivers attached thereto. A second network cable is disposed in parallel with the first cable and, in like manner, includes a plurality of second interfacing transceivers attached thereto. There are a plurality of three position switches each having a signal input, three outputs for individual selective connection to the input, and a control input for receiving signals designating which of the outputs is to be connected to the signal input. Each of the switches includes means for designating a response address for responding to addressed signals appearing at the control input and each of the switches further has its signal input connected to a respective one of the input/output lines from the nodes. Also, one of the three outputs is connected to a repective one of the plurality of first interfacing transceivers. There is master switch control means having an output connected to the control inputs of the plurality of three position switches and an input for receiving directive signals for outputting addressed switch position signals to the three position switches as well as monitor and control computer means having a pair of network connector boards therein connected to respective ones of one of the first interfacing transceivers and one of the second interfacing transceivers and an output connected to the input of the master switch means for monitoring the status of the networked computer system by sending messages to the nodes and receiving and verifying messages therefrom and for sending control signals to the master switch to cause the master switch to cause respective ones of the nodes to use a desired one of the first and second cables for transmitting and receiving messages and for disconnecting desired ones of the nodes from both cables.
High voltage photo switch package module
Sullivan, James S; Sanders, David M; Hawkins, Steven A; Sampayan, Stephen E
2014-02-18
A photo-conductive switch package module having a photo-conductive substrate or wafer with opposing electrode-interface surfaces, and at least one light-input surface. First metallic layers are formed on the electrode-interface surfaces, and one or more optical waveguides having input and output ends are bonded to the substrate so that the output end of each waveguide is bonded to a corresponding one of the light-input surfaces of the photo-conductive substrate. This forms a waveguide-substrate interface for coupling light into the photo-conductive wafer. A dielectric material such as epoxy is then used to encapsulate the photo-conductive substrate and optical waveguide so that only the metallic layers and the input end of the optical waveguide are exposed. Second metallic layers are then formed on the first metallic layers so that the waveguide-substrate interface is positioned under the second metallic layers.
17 Ways to Say Yes: Toward Nuanced Tone of Voice in AAC and Speech Technology
Pullin, Graham; Hennig, Shannon
2015-01-01
Abstract People with complex communication needs who use speech-generating devices have very little expressive control over their tone of voice. Despite its importance in human interaction, the issue of tone of voice remains all but absent from AAC research and development however. In this paper, we describe three interdisciplinary projects, past, present and future: The critical design collection Six Speaking Chairs has provoked deeper discussion and inspired a social model of tone of voice; the speculative concept Speech Hedge illustrates challenges and opportunities in designing more expressive user interfaces; the pilot project Tonetable could enable participatory research and seed a research network around tone of voice. We speculate that more radical interactions might expand frontiers of AAC and disrupt speech technology as a whole. PMID:25965913
Measuring Input Thresholds on an Existing Board
NASA Technical Reports Server (NTRS)
Kuperman, Igor; Gutrich, Daniel G.; Berkun, Andrew C.
2011-01-01
A critical PECL (positive emitter-coupled logic) interface to Xilinx interface needed to be changed on an existing flight board. The new Xilinx input interface used a CMOS (complementary metal-oxide semiconductor) type of input, and the driver could meet its thresholds typically, but not in worst-case, according to the data sheet. The previous interface had been based on comparison with an external reference, but the CMOS input is based on comparison with an internal divider from the power supply. A way to measure what the exact input threshold was for this device for 64 inputs on a flight board was needed. The measurement technique allowed an accurate measurement of the voltage required to switch a Xilinx input from high to low for each of the 64 lines, while only probing two of them. Directly driving an external voltage was considered too risky, and tests done on any other unit could not be used to qualify the flight board. The two lines directly probed gave an absolute voltage threshold calibration, while data collected on the remaining 62 lines without probing gave relative measurements that could be used to identify any outliers. The PECL interface was forced to a long-period square wave by driving a saturated square wave into the ADC (analog to digital converter). The active pull-down circuit was turned off, causing each line to rise rapidly and fall slowly according to the input s weak pull-down circuitry. The fall time shows up as a change in the pulse width of the signal ready by the Xilinx. This change in pulse width is a function of capacitance, pulldown current, and input threshold. Capacitance was known from the different trace lengths, plus a gate input capacitance, which is the same for all inputs. The pull-down current is the same for all inputs including the two that are probed directly. The data was combined, and the Excel solver tool was used to find input thresholds for the 62 lines. This was repeated over different supply voltages and temperatures to show that the interface had voltage margin under all worst case conditions. Gate input thresholds are normally measured at the manufacturer when the device is on a chip tester. A key function of this machine was duplicated on an existing flight board with no modifications to the nets to be tested, with the exception of changes in the FPGA program.
The Input-Interface of Webcam Applied in 3D Virtual Reality Systems
ERIC Educational Resources Information Center
Sun, Huey-Min; Cheng, Wen-Lin
2009-01-01
Our research explores a virtual reality application based on Web camera (Webcam) input-interface. The interface can replace with the mouse to control direction intention of a user by the method of frame difference. We divide a frame into nine grids from Webcam and make use of the background registration to compute the moving object. In order to…
BrainIACS: a system for web-based medical image processing
NASA Astrophysics Data System (ADS)
Kishore, Bhaskar; Bazin, Pierre-Louis; Pham, Dzung L.
2009-02-01
We describe BrainIACS, a web-based medical image processing system that permits and facilitates algorithm developers to quickly create extensible user interfaces for their algorithms. Designed to address the challenges faced by algorithm developers in providing user-friendly graphical interfaces, BrainIACS is completely implemented using freely available, open-source software. The system, which is based on a client-server architecture, utilizes an AJAX front-end written using the Google Web Toolkit (GWT) and Java Servlets running on Apache Tomcat as its back-end. To enable developers to quickly and simply create user interfaces for configuring their algorithms, the interfaces are described using XML and are parsed by our system to create the corresponding user interface elements. Most of the commonly found elements such as check boxes, drop down lists, input boxes, radio buttons, tab panels and group boxes are supported. Some elements such as the input box support input validation. Changes to the user interface such as addition and deletion of elements are performed by editing the XML file or by using the system's user interface creator. In addition to user interface generation, the system also provides its own interfaces for data transfer, previewing of input and output files, and algorithm queuing. As the system is programmed using Java (and finally Java-script after compilation of the front-end code), it is platform independent with the only requirements being that a Servlet implementation be available and that the processing algorithms can execute on the server platform.
Larm, Petra; Hongisto, Valtteri
2006-02-01
During the acoustical design of, e.g., auditoria or open-plan offices, it is important to know how speech can be perceived in various parts of the room. Different objective methods have been developed to measure and predict speech intelligibility, and these have been extensively used in various spaces. In this study, two such methods were compared, the speech transmission index (STI) and the speech intelligibility index (SII). Also the simplification of the STI, the room acoustics speech transmission index (RASTI), was considered. These quantities are all based on determining an apparent speech-to-noise ratio on selected frequency bands and summing them using a specific weighting. For comparison, some data were needed on the possible differences of these methods resulting from the calculation scheme and also measuring equipment. Their prediction accuracy was also of interest. Measurements were made in a laboratory having adjustable noise level and absorption, and in a real auditorium. It was found that the measurement equipment, especially the selection of the loudspeaker, can greatly affect the accuracy of the results. The prediction accuracy of the RASTI was found acceptable, if the input values for the prediction are accurately known, even though the studied space was not ideally diffuse.
Using the Electrocorticographic Speech Network to Control a Brain-Computer Interface in Humans
Leuthardt, Eric C.; Gaona, Charles; Sharma, Mohit; Szrama, Nicholas; Roland, Jarod; Freudenberg, Zac; Solis, Jamie; Breshears, Jonathan; Schalk, Gerwin
2013-01-01
Electrocorticography (ECoG) has emerged as a new signal platform for brain-computer interface (BCI) systems. Classically, the cortical physiology that has been commonly investigated and utilized for device control in humans has been brain signals from sensorimotor cortex. Hence, it was unknown whether other neurophysiological substrates, such as the speech network, could be used to further improve on or complement existing motor-based control paradigms. We demonstrate here for the first time that ECoG signals associated with different overt and imagined phoneme articulation can enable invasively monitored human patients to control a one-dimensional computer cursor rapidly and accurately. This phonetic content was distinguishable within higher gamma frequency oscillations and enabled users to achieve final target accuracies between 68 and 91% within 15 minutes. Additionally, one of the patients achieved robust control using recordings from a microarray consisting of 1 mm spaced microwires. These findings suggest that the cortical network associated with speech could provide an additional cognitive and physiologic substrate for BCI operation and that these signals can be acquired from a cortical array that is small and minimally invasive. PMID:21471638
Smart mobility solution with multiple input Output interface.
Sethi, Aartika; Deb, Sujay; Ranjan, Prabhat; Sardar, Arghya
2017-07-01
Smart wheelchairs are commonly used to provide solution for mobility impairment. However their usage is limited primarily due to high cost owing from sensors required for giving input, lack of adaptability for different categories of input and limited functionality. In this paper we propose a smart mobility solution using smartphone with inbuilt sensors (accelerometer, camera and speaker) as an input interface. An Emotiv EPOC+ is also used for motor imagery based input control synced with facial expressions in cases of extreme disability. Apart from traction, additional functions like home security and automation are provided using Internet of Things (IoT) and web interfaces. Although preliminary, our results suggest that this system can be used as an integrated and efficient solution for people suffering from mobility impairment. The results also indicate a decent accuracy is obtained for the overall system.
Text-to-audiovisual speech synthesizer for children with learning disabilities.
Mendi, Engin; Bayrak, Coskun
2013-01-01
Learning disabilities affect the ability of children to learn, despite their having normal intelligence. Assistive tools can highly increase functional capabilities of children with learning disorders such as writing, reading, or listening. In this article, we describe a text-to-audiovisual synthesizer that can serve as an assistive tool for such children. The system automatically converts an input text to audiovisual speech, providing synchronization of the head, eye, and lip movements of the three-dimensional face model with appropriate facial expressions and word flow of the text. The proposed system can enhance speech perception and help children having learning deficits to improve their chances of success.
End-to-End ASR-Free Keyword Search From Speech
NASA Astrophysics Data System (ADS)
Audhkhasi, Kartik; Rosenberg, Andrew; Sethy, Abhinav; Ramabhadran, Bhuvana; Kingsbury, Brian
2017-12-01
End-to-end (E2E) systems have achieved competitive results compared to conventional hybrid hidden Markov model (HMM)-deep neural network based automatic speech recognition (ASR) systems. Such E2E systems are attractive due to the lack of dependence on alignments between input acoustic and output grapheme or HMM state sequence during training. This paper explores the design of an ASR-free end-to-end system for text query-based keyword search (KWS) from speech trained with minimal supervision. Our E2E KWS system consists of three sub-systems. The first sub-system is a recurrent neural network (RNN)-based acoustic auto-encoder trained to reconstruct the audio through a finite-dimensional representation. The second sub-system is a character-level RNN language model using embeddings learned from a convolutional neural network. Since the acoustic and text query embeddings occupy different representation spaces, they are input to a third feed-forward neural network that predicts whether the query occurs in the acoustic utterance or not. This E2E ASR-free KWS system performs respectably despite lacking a conventional ASR system and trains much faster.
Applying Spatial Audio to Human Interfaces: 25 Years of NASA Experience
NASA Technical Reports Server (NTRS)
Begault, Durand R.; Wenzel, Elizabeth M.; Godfrey, Martine; Miller, Joel D.; Anderson, Mark R.
2010-01-01
From the perspective of human factors engineering, the inclusion of spatial audio within a human-machine interface is advantageous from several perspectives. Demonstrated benefits include the ability to monitor multiple streams of speech and non-speech warning tones using a cocktail party advantage, and for aurally-guided visual search. Other potential benefits include the spatial coordination and interaction of multimodal events, and evaluation of new communication technologies and alerting systems using virtual simulation. Many of these technologies were developed at NASA Ames Research Center, beginning in 1985. This paper reviews examples and describes the advantages of spatial sound in NASA-related technologies, including space operations, aeronautics, and search and rescue. The work has involved hardware and software development as well as basic and applied research.
1984-12-01
BLOCK DATA Default values for variables input by menus. LIBR Interface with frame I/O routines. SNSR Interface with sensor routines. ATMOS Interface with...Routines Included in Frame I/O Interface Routine Description LIBR Selects options for input or output to a data library. FRREAD Reads frame from file and/or...Layer", Journal of Applied Meteorology 20, pp. 242-249, March 1981. 15 L.J. Harding, Numerical Analysis and Applications Software Abstracts, Computing
Techniques and applications for binaural sound manipulation in human-machine interfaces
NASA Technical Reports Server (NTRS)
Begault, Durand R.; Wenzel, Elizabeth M.
1990-01-01
The implementation of binaural sound to speech and auditory sound cues (auditory icons) is addressed from both an applications and technical standpoint. Techniques overviewed include processing by means of filtering with head-related transfer functions. Application to advanced cockpit human interface systems is discussed, although the techniques are extendable to any human-machine interface. Research issues pertaining to three-dimensional sound displays under investigation at the Aerospace Human Factors Division at NASA Ames Research Center are described.
Techniques and applications for binaural sound manipulation in human-machine interfaces
NASA Technical Reports Server (NTRS)
Begault, Durand R.; Wenzel, Elizabeth M.
1992-01-01
The implementation of binaural sound to speech and auditory sound cues (auditory icons) is addressed from both an applications and technical standpoint. Techniques overviewed include processing by means of filtering with head-related transfer functions. Application to advanced cockpit human interface systems is discussed, although the techniques are extendable to any human-machine interface. Research issues pertaining to three-dimensional sound displays under investigation at the Aerospace Human Factors Division at NASA Ames Research Center are described.
Hybrid Speaker Recognition Using Universal Acoustic Model
NASA Astrophysics Data System (ADS)
Nishimura, Jun; Kuroda, Tadahiro
We propose a novel speaker recognition approach using a speaker-independent universal acoustic model (UAM) for sensornet applications. In sensornet applications such as “Business Microscope”, interactions among knowledge workers in an organization can be visualized by sensing face-to-face communication using wearable sensor nodes. In conventional studies, speakers are detected by comparing energy of input speech signals among the nodes. However, there are often synchronization errors among the nodes which degrade the speaker recognition performance. By focusing on property of the speaker's acoustic channel, UAM can provide robustness against the synchronization error. The overall speaker recognition accuracy is improved by combining UAM with the energy-based approach. For 0.1s speech inputs and 4 subjects, speaker recognition accuracy of 94% is achieved at the synchronization error less than 100ms.
Moore, Brian C J; Füllgrabe, Christian; Stone, Michael A
2011-01-01
To determine preferred parameters of multichannel compression using individually fitted simulated hearing aids and a method of paired comparisons. Fourteen participants with mild to moderate hearing loss listened via a simulated five-channel compression hearing aid fitted using the CAMEQ2-HF method to pairs of speech sounds (a male talker and a female talker) and musical sounds (a percussion instrument, orchestral classical music, and a jazz trio) presented sequentially and indicated which sound of the pair was preferred and by how much. The sounds in each pair were derived from the same token and differed along a single dimension in the type of processing applied. For the speech sounds, participants judged either pleasantness or clarity; in the latter case, the speech was presented in noise at a 2-dB signal-to-noise ratio. For musical sounds, they judged pleasantness. The parameters explored were time delay of the audio signal relative to the gain control signal (the alignment delay), compression speed (attack and release times), bandwidth (5, 7.5, or 10 kHz), and gain at high frequencies relative to that prescribed by CAMEQ2-HF. Pleasantness increased with increasing alignment delay only for the percussive musical sound. Clarity was not affected by alignment delay. There was a trend for pleasantness to decrease slightly with increasing bandwidth, but this was significant only for female speech with fast compression. Judged clarity was significantly higher for the 7.5- and 10-kHz bandwidths than for the 5-kHz bandwidth for both slow and fast compression and for both talker genders. Compression speed had little effect on pleasantness for 50- or 65-dB SPL input levels, but slow compression was generally judged as slightly more pleasant than fast compression for an 80-dB SPL input level. Clarity was higher for slow than for fast compression for input levels of 80 and 65 dB SPL but not for a level of 50 dB SPL. Preferences for pleasantness were approximately equal with CAMEQ2-HF gains and with gains slightly reduced at high frequencies and were lower when gains were slightly increased at high frequencies. Speech clarity was not affected by changing the gain at high frequencies. Effects of alignment delay were small except for the percussive sound. A wider bandwidth was slightly preferred for speech clarity. Speech clarity was slightly greater with slow compression, especially at high levels. Preferred high-frequency gains were close to or a little below those prescribed by CAMEQ2-HF.
ERIC Educational Resources Information Center
Trebits, Anna
2016-01-01
The aim of this study was to investigate the effects of cognitive task complexity and individual differences in input, processing, and output anxiety (IPOA) on L2 narrative production. The participants were enrolled in a bilingual secondary educational program. They performed two narrative tasks in speech and writing. The participants' level of…
ERIC Educational Resources Information Center
Zhuang, Jie; Randall, Billi; Stamatakis, Emmanuel A.; Marslen-Wilson, William D.; Tyler, Lorraine K.
2011-01-01
Spoken word recognition involves the activation of multiple word candidates on the basis of the initial speech input--the "cohort"--and selection among these competitors. Selection may be driven primarily by bottom-up acoustic-phonetic inputs or it may be modulated by other aspects of lexical representation, such as a word's meaning…
Rohlfing, Katharina J.; Nachtigäller, Kerstin
2016-01-01
The learning of spatial prepositions is assumed to be based on experience in space. In a slow mapping study, we investigated whether 31 German 28-month-old children could robustly learn the German spatial prepositions hinter [behind] and neben [next to] from pictures, and whether a narrative input can compensate for a lack of immediate experience in space. One group of children received pictures with a narrative input as a training to understand spatial prepositions. In two further groups, we controlled (a) for the narrative input by providing unconnected speech during the training and (b) for the learning material by training the children on toys rather than pictures. We assessed children’s understanding of spatial prepositions at three different time points: pretest, immediate test, and delayed posttest. Results showed improved word retention in children from the narrative but not the control group receiving unconnected speech. Neither of the trained groups succeeded in generalization to novel referents. Finally, all groups were instructed to deal with untrained material in the test to investigate the robustness of learning across tasks. None of the groups succeeded in this task transfer. PMID:27471479
Dynamic Encoding of Speech Sequence Probability in Human Temporal Cortex
Leonard, Matthew K.; Bouchard, Kristofer E.; Tang, Claire
2015-01-01
Sensory processing involves identification of stimulus features, but also integration with the surrounding sensory and cognitive context. Previous work in animals and humans has shown fine-scale sensitivity to context in the form of learned knowledge about the statistics of the sensory environment, including relative probabilities of discrete units in a stream of sequential auditory input. These statistics are a defining characteristic of one of the most important sequential signals humans encounter: speech. For speech, extensive exposure to a language tunes listeners to the statistics of sound sequences. To address how speech sequence statistics are neurally encoded, we used high-resolution direct cortical recordings from human lateral superior temporal cortex as subjects listened to words and nonwords with varying transition probabilities between sound segments. In addition to their sensitivity to acoustic features (including contextual features, such as coarticulation), we found that neural responses dynamically encoded the language-level probability of both preceding and upcoming speech sounds. Transition probability first negatively modulated neural responses, followed by positive modulation of neural responses, consistent with coordinated predictive and retrospective recognition processes, respectively. Furthermore, transition probability encoding was different for real English words compared with nonwords, providing evidence for online interactions with high-order linguistic knowledge. These results demonstrate that sensory processing of deeply learned stimuli involves integrating physical stimulus features with their contextual sequential structure. Despite not being consciously aware of phoneme sequence statistics, listeners use this information to process spoken input and to link low-level acoustic representations with linguistic information about word identity and meaning. PMID:25948269
Ylinen, Sari; Nora, Anni; Leminen, Alina; Hakala, Tero; Huotilainen, Minna; Shtyrov, Yury; Mäkelä, Jyrki P; Service, Elisabet
2015-06-01
Speech production, both overt and covert, down-regulates the activation of auditory cortex. This is thought to be due to forward prediction of the sensory consequences of speech, contributing to a feedback control mechanism for speech production. Critically, however, these regulatory effects should be specific to speech content to enable accurate speech monitoring. To determine the extent to which such forward prediction is content-specific, we recorded the brain's neuromagnetic responses to heard multisyllabic pseudowords during covert rehearsal in working memory, contrasted with a control task. The cortical auditory processing of target syllables was significantly suppressed during rehearsal compared with control, but only when they matched the rehearsed items. This critical specificity to speech content enables accurate speech monitoring by forward prediction, as proposed by current models of speech production. The one-to-one phonological motor-to-auditory mappings also appear to serve the maintenance of information in phonological working memory. Further findings of right-hemispheric suppression in the case of whole-item matches and left-hemispheric enhancement for last-syllable mismatches suggest that speech production is monitored by 2 auditory-motor circuits operating on different timescales: Finer grain in the left versus coarser grain in the right hemisphere. Taken together, our findings provide hemisphere-specific evidence of the interface between inner and heard speech. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Attoliter Control of Microliquid
NASA Astrophysics Data System (ADS)
Imura, Fumito; Kuroiwa, Hiroyuki; Nakada, Akira; Kosaka, Kouji; Kubota, Hiroshi
2007-11-01
The technology of the sub-femtoliter volume control of liquids in nanometer range pipettes (nanopipettes) has been developed for carrying out surgical operations on living cells. We focus attention on an interface forming between oil and water in a nanopipette. The interface position can be moved by increasing or decreasing the input pressure. If the volume of liquid in the nanopipette can be controlled by moving the position of the interface, cell organelles can be discharged or suctioned and a drug-solution can be injected into the cell. Quantity volume control in the pico-attoliter range using a tapered nanopipette is controlled by the condition of an interface with a convex shape toward the top of the nanopipette. The volume can be controlled by the input pressure corresponding to the interfacial radius without the use of a microscope by preliminarily preparing the pipette shape and the interface radius as a function of the input pressure.
ERIC Educational Resources Information Center
Choi, Soojung; Lantolf, James P.
2008-01-01
This study investigates the interface between speech and gesture in second language (L2) narration within Slobin's (2003) thinking-for-speaking (TFS) framework as well as with respect to McNeill's (1992, 2005) growth point (GP) hypothesis. Specifically, our interest is in whether speakers shift from a first language (L1) to a L2 TFS pattern as…
A pilot study comparing mouse and mouse-emulating interface devices for graphic input.
Kanny, E M; Anson, D K
1991-01-01
Adaptive interface devices make it possible for individuals with physical disabilities to use microcomputers and thus perform many tasks that they would otherwise be unable to accomplish. Special equipment is available that purports to allow functional access to the computer for users with disabilities. As technology moves from purely keyboard applications to include graphic input, it will be necessary for assistive interface devices to support graphics as well as text entry. Headpointing systems that emulate the mouse in combination with on-screen keyboards are of particular interest to persons with severe physical impairment such as high level quadriplegia. Two such systems currently on the market are the HeadMaster and the Free Wheel. The authors have conducted a pilot study comparing graphic input speed using the mouse and two headpointing interface systems on the Macintosh computer. The study used a single subject design with six able-bodied subjects, to establish a baseline for comparison with persons with severe disabilities. Results of these preliminary data indicated that the HeadMaster was nearly as effective as the mouse and that it was superior to the Free Wheel for graphics input. This pilot study, however, demonstrated several experimental design problems that need to be addressed to make the study more robust. It also demonstrated the need to include the evaluation of text input so that the effectiveness of the interface devices with text and graphic input could be compared.
Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features
NASA Astrophysics Data System (ADS)
Nguyen, Chuong H.; Karavas, George K.; Artemiadis, Panagiotis
2018-02-01
Objective. In this paper, we investigate the suitability of imagined speech for brain-computer interface (BCI) applications. Approach. A novel method based on covariance matrix descriptors, which lie in Riemannian manifold, and the relevance vector machines classifier is proposed. The method is applied on electroencephalographic (EEG) signals and tested in multiple subjects. Main results. The method is shown to outperform other approaches in the field with respect to accuracy and robustness. The algorithm is validated on various categories of speech, such as imagined pronunciation of vowels, short words and long words. The classification accuracy of our methodology is in all cases significantly above chance level, reaching a maximum of 70% for cases where we classify three words and 95% for cases of two words. Significance. The results reveal certain aspects that may affect the success of speech imagery classification from EEG signals, such as sound, meaning and word complexity. This can potentially extend the capability of utilizing speech imagery in future BCI applications. The dataset of speech imagery collected from total 15 subjects is also published.
ERIC Educational Resources Information Center
Lam, Christa; Kitamura, Christine
2010-01-01
Purpose: This study examined a mother's speech style and interactive behaviors with her twin sons: 1 with bilateral hearing impairment (HI) and the other with normal hearing (NH). Method: The mother was video-recorded interacting with her twin sons when the boys were 12.5 and 22 months of age. Mean F0, F0 range, duration, and F1/F2 vowel space of…
Exploring expressivity and emotion with artificial voice and speech technologies.
Pauletto, Sandra; Balentine, Bruce; Pidcock, Chris; Jones, Kevin; Bottaci, Leonardo; Aretoulaki, Maria; Wells, Jez; Mundy, Darren P; Balentine, James
2013-10-01
Emotion in audio-voice signals, as synthesized by text-to-speech (TTS) technologies, was investigated to formulate a theory of expression for user interface design. Emotional parameters were specified with markup tags, and the resulting audio was further modulated with post-processing techniques. Software was then developed to link a selected TTS synthesizer with an automatic speech recognition (ASR) engine, producing a chatbot that could speak and listen. Using these two artificial voice subsystems, investigators explored both artistic and psychological implications of artificial speech emotion. Goals of the investigation were interdisciplinary, with interest in musical composition, augmentative and alternative communication (AAC), commercial voice announcement applications, human-computer interaction (HCI), and artificial intelligence (AI). The work-in-progress points towards an emerging interdisciplinary ontology for artificial voices. As one study output, HCI tools are proposed for future collaboration.
de Taillez, Tobias; Grimm, Giso; Kollmeier, Birger; Neher, Tobias
2018-06-01
To investigate the influence of an algorithm designed to enhance or magnify interaural difference cues on speech signals in noisy, spatially complex conditions using both technical and perceptual measurements. To also investigate the combination of interaural magnification (IM), monaural microphone directionality (DIR), and binaural coherence-based noise reduction (BC). Speech-in-noise stimuli were generated using virtual acoustics. A computational model of binaural hearing was used to analyse the spatial effects of IM. Predicted speech quality changes and signal-to-noise-ratio (SNR) improvements were also considered. Additionally, a listening test was carried out to assess speech intelligibility and quality. Listeners aged 65-79 years with and without sensorineural hearing loss (N = 10 each). IM increased the horizontal separation of concurrent directional sound sources without introducing any major artefacts. In situations with diffuse noise, however, the interaural difference cues were distorted. Preprocessing the binaural input signals with DIR reduced distortion. IM influenced neither speech intelligibility nor speech quality. The IM algorithm tested here failed to improve speech perception in noise, probably because of the dispersion and inconsistent magnification of interaural difference cues in complex environments.
Speech Rate Normalization and Phonemic Boundary Perception in Cochlear-Implant Users.
Jaekel, Brittany N; Newman, Rochelle S; Goupell, Matthew J
2017-05-24
Normal-hearing (NH) listeners rate normalize, temporarily remapping phonemic category boundaries to account for a talker's speech rate. It is unknown if adults who use auditory prostheses called cochlear implants (CI) can rate normalize, as CIs transmit degraded speech signals to the auditory nerve. Ineffective adjustment to rate information could explain some of the variability in this population's speech perception outcomes. Phonemes with manipulated voice-onset-time (VOT) durations were embedded in sentences with different speech rates. Twenty-three CI and 29 NH participants performed a phoneme identification task. NH participants heard the same unprocessed stimuli as the CI participants or stimuli degraded by a sine vocoder, simulating aspects of CI processing. CI participants showed larger rate normalization effects (6.6 ms) than the NH participants (3.7 ms) and had shallower (less reliable) category boundary slopes. NH participants showed similarly shallow slopes when presented acoustically degraded vocoded signals, but an equal or smaller rate effect in response to reductions in available spectral and temporal information. CI participants can rate normalize, despite their degraded speech input, and show a larger rate effect compared to NH participants. CI participants may particularly rely on rate normalization to better maintain perceptual constancy of the speech signal.
Address entry while driving: speech recognition versus a touch-screen keyboard.
Tsimhoni, Omer; Smith, Daniel; Green, Paul
2004-01-01
A driving simulator experiment was conducted to determine the effects of entering addresses into a navigation system during driving. Participants drove on roads of varying visual demand while entering addresses. Three address entry methods were explored: word-based speech recognition, character-based speech recognition, and typing on a touch-screen keyboard. For each method, vehicle control and task measures, glance timing, and subjective ratings were examined. During driving, word-based speech recognition yielded the shortest total task time (15.3 s), followed by character-based speech recognition (41.0 s) and touch-screen keyboard (86.0 s). The standard deviation of lateral position when performing keyboard entry (0.21 m) was 60% higher than that for all other address entry methods (0.13 m). Degradation of vehicle control associated with address entry using a touch screen suggests that the use of speech recognition is favorable. Speech recognition systems with visual feedback, however, even with excellent accuracy, are not without performance consequences. Applications of this research include the design of in-vehicle navigation systems as well as other systems requiring significant driver input, such as E-mail, the Internet, and text messaging.
Bidelman, Gavin M; Dexter, Lauren
2015-04-01
We examined a consistent deficit observed in bilinguals: poorer speech-in-noise (SIN) comprehension for their nonnative language. We recorded neuroelectric mismatch potentials in mono- and bi-lingual listeners in response to contrastive speech sounds in noise. Behaviorally, late bilinguals required ∼10dB more favorable signal-to-noise ratios to match monolinguals' SIN abilities. Source analysis of cortical activity demonstrated monotonic increase in response latency with noise in superior temporal gyrus (STG) for both groups, suggesting parallel degradation of speech representations in auditory cortex. Contrastively, we found differential speech encoding between groups within inferior frontal gyrus (IFG)-adjacent to Broca's area-where noise delays observed in nonnative listeners were offset in monolinguals. Notably, brain-behavior correspondences double dissociated between language groups: STG activation predicted bilinguals' SIN, whereas IFG activation predicted monolinguals' performance. We infer higher-order brain areas act compensatorily to enhance impoverished sensory representations but only when degraded speech recruits linguistic brain mechanisms downstream from initial auditory-sensory inputs. Copyright © 2015 Elsevier Inc. All rights reserved.
Is talking to an automated teller machine natural and fun?
Chan, F Y; Khalid, H M
Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.
NASA Astrophysics Data System (ADS)
Jitsuhiro, Takatoshi; Toriyama, Tomoji; Kogure, Kiyoshi
We propose a noise suppression method based on multi-model compositions and multi-pass search. In real environments, input speech for speech recognition includes many kinds of noise signals. To obtain good recognized candidates, suppressing many kinds of noise signals at once and finding target speech is important. Before noise suppression, to find speech and noise label sequences, we introduce multi-pass search with acoustic models including many kinds of noise models and their compositions, their n-gram models, and their lexicon. Noise suppression is frame-synchronously performed using the multiple models selected by recognized label sequences with time alignments. We evaluated this method using the E-Nightingale task, which contains voice memoranda spoken by nurses during actual work at hospitals. The proposed method obtained higher performance than the conventional method.
Early speech perception in Mandarin-speaking children at one-year post cochlear implantation.
Chen, Yuan; Wong, Lena L N; Zhu, Shufeng; Xi, Xin
2016-01-01
The aim in this study was to examine early speech perception outcomes in Mandarin-speaking children during the first year of cochlear implant (CI) use. A hierarchical early speech perception battery was administered to 80 children before and 3, 6, and 12 months after implantation. Demographic information was obtained to evaluate its relationship with these outcomes. Regardless of dialect exposure and whether a hearing aid was trialed before implantation, implant recipients were able to attain similar pre-lingual auditory skills after 12 months of CI use. Children speaking Mandarin developed early Mandarin speech perception faster than those with greater exposure to other Chinese dialects. In addition, children with better pre-implant hearing levels and younger age at implantation attained significantly better speech perception scores after 12 months of CI use. Better pre-implant hearing levels and higher maternal education level were also associated with a significantly steeper growth in early speech perception ability. Mandarin-speaking children with CIs are able to attain early speech perception results comparable to those of their English-speaking counterparts. In addition, consistent single language input via CI probably enhances early speech perception development at least during the first-year of CI use. Copyright © 2015 Elsevier Ltd. All rights reserved.
Brochier, Tim; McDermott, Hugh J; McKay, Colette M
2017-06-01
In order to improve speech understanding for cochlear implant users, it is important to maximize the transmission of temporal information. The combined effects of stimulation rate and presentation level on temporal information transfer and speech understanding remain unclear. The present study systematically varied presentation level (60, 50, and 40 dBA) and stimulation rate [500 and 2400 pulses per second per electrode (pps)] in order to observe how the effect of rate on speech understanding changes for different presentation levels. Speech recognition in quiet and noise, and acoustic amplitude modulation detection thresholds (AMDTs) were measured with acoustic stimuli presented to speech processors via direct audio input (DAI). With the 500 pps processor, results showed significantly better performance for consonant-vowel nucleus-consonant words in quiet, and a reduced effect of noise on sentence recognition. However, no rate or level effect was found for AMDTs, perhaps partly because of amplitude compression in the sound processor. AMDTs were found to be strongly correlated with the effect of noise on sentence perception at low levels. These results indicate that AMDTs, at least when measured with the CP910 Freedom speech processor via DAI, explain between-subject variance of speech understanding, but do not explain within-subject variance for different rates and levels.
NASA Astrophysics Data System (ADS)
Gaik Tay, Kim; Cheong, Tau Han; Foong Lee, Ming; Kek, Sie Long; Abdul-Kahar, Rosmila
2017-08-01
In the previous work on Euler’s spreadsheet calculator for solving an ordinary differential equation, the Visual Basic for Application (VBA) programming was used, however, a graphical user interface was not developed to capture users input. This weakness may make users confuse on the input and output since those input and output are displayed in the same worksheet. Besides, the existing Euler’s spreadsheet calculator is not interactive as there is no prompt message if there is a mistake in inputting the parameters. On top of that, there are no users’ instructions to guide users to input the derivative function. Hence, in this paper, we improved previous limitations by developing a user-friendly and interactive graphical user interface. This improvement is aimed to capture users’ input with users’ instructions and interactive prompt error messages by using VBA programming. This Euler’s graphical user interface spreadsheet calculator is not acted as a black box as users can click on any cells in the worksheet to see the formula used to implement the numerical scheme. In this way, it could enhance self-learning and life-long learning in implementing the numerical scheme in a spreadsheet and later in any programming language.
Newell, Matthew R [Los Alamos, NM; Jones, David Carl [Los Alamos, NM
2009-09-01
A portable multiplicity counter has signal input circuitry, processing circuitry and a user/computer interface disposed in a housing. The processing circuitry, which can comprise a microcontroller integrated circuit operably coupled to shift register circuitry implemented in a field programmable gate array, is configured to be operable via the user/computer interface to count input signal pluses receivable at said signal input circuitry and record time correlations thereof in a total counting mode, coincidence counting mode and/or a multiplicity counting mode. The user/computer interface can be for example an LCD display/keypad and/or a USB interface. The counter can include a battery pack for powering the counter and low/high voltage power supplies for biasing external detectors so that the counter can be configured as a hand-held device for counting neutron events.
Won, Jong Ho; Shim, Hyun Joon; Lorenzi, Christian; Rubinstein, Jay T
2014-06-01
Won et al. (J Acoust Soc Am 132:1113-1119, 2012) reported that cochlear implant (CI) speech processors generate amplitude-modulation (AM) cues recovered from broadband speech frequency modulation (FM) and that CI users can use these cues for speech identification in quiet. The present study was designed to extend this finding for a wide range of listening conditions, where the original speech cues were severely degraded by manipulating either the acoustic signals or the speech processor. The manipulation of the acoustic signals included the presentation of background noise, simulation of reverberation, and amplitude compression. The manipulation of the speech processor included changing the input dynamic range and the number of channels. For each of these conditions, multiple levels of speech degradation were tested. Speech identification was measured for CI users and compared for stimuli having both AM and FM information (intact condition) or FM information only (FM condition). Each manipulation degraded speech identification performance for both intact and FM conditions. Performance for the intact and FM conditions became similar for stimuli having the most severe degradations. Identification performance generally overlapped for the intact and FM conditions. Moreover, identification performance for the FM condition was better than chance performance even at the maximum level of distortion. Finally, significant correlations were found between speech identification scores for the intact and FM conditions. Altogether, these results suggest that despite poor frequency selectivity, CI users can make efficient use of AM cues recovered from speech FM in difficult listening situations.
A software tool for analyzing multichannel cochlear implant signals.
Lai, Wai Kong; Bögli, Hans; Dillier, Norbert
2003-10-01
A useful and convenient means to analyze the radio frequency (RF) signals being sent by a speech processor to a cochlear implant would be to actually capture and display them with appropriate software. This is particularly useful for development or diagnostic purposes. sCILab (Swiss Cochlear Implant Laboratory) is such a PC-based software tool intended for the Nucleus family of Multichannel Cochlear Implants. Its graphical user interface provides a convenient and intuitive means for visualizing and analyzing the signals encoding speech information. Both numerical and graphic displays are available for detailed examination of the captured CI signals, as well as an acoustic simulation of these CI signals. sCILab has been used in the design and verification of new speech coding strategies, and has also been applied as an analytical tool in studies of how different parameter settings of existing speech coding strategies affect speech perception. As a diagnostic tool, it is also useful for troubleshooting problems with the external equipment of the cochlear implant systems.
Universal sensor interface module (USIM)
NASA Astrophysics Data System (ADS)
King, Don; Torres, A.; Wynn, John
1999-01-01
A universal sensor interface model (USIM) is being developed by the Raytheon-TI Systems Company for use with fields of unattended distributed sensors. In its production configuration, the USIM will be a multichip module consisting of a set of common modules. The common module USIM set consists of (1) a sensor adapter interface (SAI) module, (2) digital signal processor (DSP) and associated memory module, and (3) a RF transceiver model. The multispectral sensor interface is designed around a low-power A/D converted, whose input/output interface consists of: -8 buffered, sampled inputs from various devices including environmental, acoustic seismic and magnetic sensors. The eight sensor inputs are each high-impedance, low- capacitance, differential amplifiers. The inputs are ideally suited for interface with discrete or MEMS sensors, since the differential input will allow direct connection with high-impedance bridge sensors and capacitance voltage sources. Each amplifier is connected to a 22-bit (Delta) (Sigma) A/D converter to enable simultaneous samples. The low power (Delta) (Sigma) converter provides 22-bit resolution at sample frequencies up to 142 hertz (used for magnetic sensors) and 16-bit resolution at frequencies up to 1168 hertz (used for acoustic and seismic sensors). The video interface module is based around the TMS320C5410 DSP. It can provide sensor array addressing, video data input, data calibration and correction. The processor module is based upon a MPC555. It will be used for mode control, synchronization of complex sensors, sensor signal processing, array processing, target classification and tracking. Many functions of the A/D, DSP and transceiver can be powered down by using variable clock speeds under software command or chip power switches. They can be returned to intermediate or full operation by DSP command. Power management may be based on the USIM's internal timer, command from the USIM transceiver, or by sleep mode processing management. The low power detection mode is implemented by monitoring any of the sensor analog outputs at lower sample rates for detection over a software controllable threshold.
Development of a User Interface for a Regression Analysis Software Tool
NASA Technical Reports Server (NTRS)
Ulbrich, Norbert Manfred; Volden, Thomas R.
2010-01-01
An easy-to -use user interface was implemented in a highly automated regression analysis tool. The user interface was developed from the start to run on computers that use the Windows, Macintosh, Linux, or UNIX operating system. Many user interface features were specifically designed such that a novice or inexperienced user can apply the regression analysis tool with confidence. Therefore, the user interface s design minimizes interactive input from the user. In addition, reasonable default combinations are assigned to those analysis settings that influence the outcome of the regression analysis. These default combinations will lead to a successful regression analysis result for most experimental data sets. The user interface comes in two versions. The text user interface version is used for the ongoing development of the regression analysis tool. The official release of the regression analysis tool, on the other hand, has a graphical user interface that is more efficient to use. This graphical user interface displays all input file names, output file names, and analysis settings for a specific software application mode on a single screen which makes it easier to generate reliable analysis results and to perform input parameter studies. An object-oriented approach was used for the development of the graphical user interface. This choice keeps future software maintenance costs to a reasonable limit. Examples of both the text user interface and graphical user interface are discussed in order to illustrate the user interface s overall design approach.
Effect of increased IIDR in the nucleus freedom cochlear implant system.
Holden, Laura K; Skinner, Margaret W; Fourakis, Marios S; Holden, Timothy A
2007-10-01
The objective of this study was to evaluate the effect of the increased instantaneous input dynamic range (IIDR) in the Nucleus Freedom cochlear implant (CI) system on recipients' ability to perceive soft speech and speech in noise. Ten adult Freedom CI recipients participated. Two maps differing in IIDR were placed on each subject's processor at initial activation. The IIDR was set to 30 dB for one map and 40 dB for the other. Subjects used both maps for at least one month prior to speech perception testing. Results revealed significantly higher scores for words (50 dB SPL), for sentences in background babble (65 dB SPL), and significantly lower sound field threshold levels with the 40 compared to the 30 dB IIDR map. Ceiling effects may have contributed to non-significant findings for sentences in quiet (50 dB SPL). The Freedom's increased IIDR allows better perception of soft speech and speech in noise.
Voice input/output capabilities at Perception Technology Corporation
NASA Technical Reports Server (NTRS)
Ferber, Leon A.
1977-01-01
Condensed resumes of key company personnel at the Perception Technology Corporation are presented. The staff possesses recognition, speech synthesis, speaker authentication, and language identification. Hardware and software engineers' capabilities are included.
Data-Driven Subclassification of Speech Sound Disorders in Preschool Children
Vick, Jennell C.; Campbell, Thomas F.; Shriberg, Lawrence D.; Green, Jordan R.; Truemper, Klaus; Rusiewicz, Heather Leavy; Moore, Christopher A.
2015-01-01
Purpose The purpose of the study was to determine whether distinct subgroups of preschool children with speech sound disorders (SSD) could be identified using a subgroup discovery algorithm (SUBgroup discovery via Alternate Random Processes, or SUBARP). Of specific interest was finding evidence of a subgroup of SSD exhibiting performance consistent with atypical speech motor control. Method Ninety-seven preschool children with SSD completed speech and nonspeech tasks. Fifty-three kinematic, acoustic, and behavioral measures from these tasks were input to SUBARP. Results Two distinct subgroups were identified from the larger sample. The 1st subgroup (76%; population prevalence estimate = 67.8%–84.8%) did not have characteristics that would suggest atypical speech motor control. The 2nd subgroup (10.3%; population prevalence estimate = 4.3%– 16.5%) exhibited significantly higher variability in measures of articulatory kinematics and poor ability to imitate iambic lexical stress, suggesting atypical speech motor control. Both subgroups were consistent with classes of SSD in the Speech Disorders Classification System (SDCS; Shriberg et al., 2010a). Conclusion Characteristics of children in the larger subgroup were consistent with the proportionally large SDCS class termed speech delay; characteristics of children in the smaller subgroup were consistent with the SDCS subtype termed motor speech disorder—not otherwise specified. The authors identified candidate measures to identify children in each of these groups. PMID:25076005
Multichannel spatial auditory display for speech communications
NASA Technical Reports Server (NTRS)
Begault, D. R.; Erbe, T.; Wenzel, E. M. (Principal Investigator)
1994-01-01
A spatial auditory display for multiple speech communications was developed at NASA/Ames Research Center. Input is spatialized by the use of simplified head-related transfer functions, adapted for FIR filtering on Motorola 56001 digital signal processors. Hardware and firmware design implementations are overviewed for the initial prototype developed for NASA-Kennedy Space Center. An adaptive staircase method was used to determine intelligibility levels of four-letter call signs used by launch personnel at NASA against diotic speech babble. Spatial positions at 30 degrees azimuth increments were evaluated. The results from eight subjects showed a maximum intelligibility improvement of about 6-7 dB when the signal was spatialized to 60 or 90 degrees azimuth positions.
Multi-channel spatial auditory display for speech communications
NASA Astrophysics Data System (ADS)
Begault, Durand; Erbe, Tom
1993-10-01
A spatial auditory display for multiple speech communications was developed at NASA-Ames Research Center. Input is spatialized by use of simplified head-related transfer functions, adapted for FIR filtering on Motorola 56001 digital signal processors. Hardware and firmware design implementations are overviewed for the initial prototype developed for NASA-Kennedy Space Center. An adaptive staircase method was used to determine intelligibility levels of four letter call signs used by launch personnel at NASA, against diotic speech babble. Spatial positions at 30 deg azimuth increments were evaluated. The results from eight subjects showed a maximal intelligibility improvement of about 6 to 7 dB when the signal was spatialized to 60 deg or 90 deg azimuth positions.
Multichannel spatial auditory display for speech communications.
Begault, D R; Erbe, T
1994-10-01
A spatial auditory display for multiple speech communications was developed at NASA/Ames Research Center. Input is spatialized by the use of simplified head-related transfer functions, adapted for FIR filtering on Motorola 56001 digital signal processors. Hardware and firmware design implementations are overviewed for the initial prototype developed for NASA-Kennedy Space Center. An adaptive staircase method was used to determine intelligibility levels of four-letter call signs used by launch personnel at NASA against diotic speech babble. Spatial positions at 30 degrees azimuth increments were evaluated. The results from eight subjects showed a maximum intelligibility improvement of about 6-7 dB when the signal was spatialized to 60 or 90 degrees azimuth positions.
Multi-channel spatial auditory display for speech communications
NASA Technical Reports Server (NTRS)
Begault, Durand; Erbe, Tom
1993-01-01
A spatial auditory display for multiple speech communications was developed at NASA-Ames Research Center. Input is spatialized by use of simplified head-related transfer functions, adapted for FIR filtering on Motorola 56001 digital signal processors. Hardware and firmware design implementations are overviewed for the initial prototype developed for NASA-Kennedy Space Center. An adaptive staircase method was used to determine intelligibility levels of four letter call signs used by launch personnel at NASA, against diotic speech babble. Spatial positions at 30 deg azimuth increments were evaluated. The results from eight subjects showed a maximal intelligibility improvement of about 6 to 7 dB when the signal was spatialized to 60 deg or 90 deg azimuth positions.
Control Board Digital Interface Input Devices – Touchscreen, Trackpad, or Mouse?
DOE Office of Scientific and Technical Information (OSTI.GOV)
Thomas A. Ulrich; Ronald L. Boring; Roger Lew
The authors collaborated with a power utility to evaluate input devices for use in the human system interface (HSI) for a new digital Turbine Control System (TCS) at a nuclear power plant (NPP) undergoing a TCS upgrade. A standalone dynamic software simulation of the new digital TCS and a mobile kiosk were developed to conduct an input device study to evaluate operator preference and input device effectiveness. The TCS software presented the anticipated HSI for the TCS and mimicked (i.e., simulated) the turbine systems’ responses to operator commands. Twenty-four licensed operators from the two nuclear power units participated in themore » study. Three input devices were tested: a trackpad, mouse, and touchscreen. The subjective feedback from the survey indicates the operators preferred the touchscreen interface. The operators subjectively rated the touchscreen as the fastest and most comfortable input device given the range of tasks they performed during the study, but also noted a lack of accuracy for selecting small targets. The empirical data suggest the mouse input device provides the most consistent performance for screen navigation and manipulating on screen controls. The trackpad input device was both empirically and subjectively found to be the least effective and least desired input device.« less
Pattern learning with deep neural networks in EMG-based speech recognition.
Wand, Michael; Schultz, Tanja
2014-01-01
We report on classification of phones and phonetic features from facial electromyographic (EMG) data, within the context of our EMG-based Silent Speech interface. In this paper we show that a Deep Neural Network can be used to perform this classification task, yielding a significant improvement over conventional Gaussian Mixture models. Our central contribution is the visualization of patterns which are learned by the neural network. With increasing network depth, these patterns represent more and more intricate electromyographic activity.
Neural Oscillations Carry Speech Rhythm through to Comprehension
Peelle, Jonathan E.; Davis, Matthew H.
2012-01-01
A key feature of speech is the quasi-regular rhythmic information contained in its slow amplitude modulations. In this article we review the information conveyed by speech rhythm, and the role of ongoing brain oscillations in listeners’ processing of this content. Our starting point is the fact that speech is inherently temporal, and that rhythmic information conveyed by the amplitude envelope contains important markers for place and manner of articulation, segmental information, and speech rate. Behavioral studies demonstrate that amplitude envelope information is relied upon by listeners and plays a key role in speech intelligibility. Extending behavioral findings, data from neuroimaging – particularly electroencephalography (EEG) and magnetoencephalography (MEG) – point to phase locking by ongoing cortical oscillations to low-frequency information (~4–8 Hz) in the speech envelope. This phase modulation effectively encodes a prediction of when important events (such as stressed syllables) are likely to occur, and acts to increase sensitivity to these relevant acoustic cues. We suggest a framework through which such neural entrainment to speech rhythm can explain effects of speech rate on word and segment perception (i.e., that the perception of phonemes and words in connected speech is influenced by preceding speech rate). Neuroanatomically, acoustic amplitude modulations are processed largely bilaterally in auditory cortex, with intelligible speech resulting in differential recruitment of left-hemisphere regions. Notable among these is lateral anterior temporal cortex, which we propose functions in a domain-general fashion to support ongoing memory and integration of meaningful input. Together, the reviewed evidence suggests that low-frequency oscillations in the acoustic speech signal form the foundation of a rhythmic hierarchy supporting spoken language, mirrored by phase-locked oscillations in the human brain. PMID:22973251
2015-05-28
recognition is simpler and requires less computational resources compared to other inputs such as facial expressions . The Berlin database of Emotional ...Processing Magazine, IEEE, vol. 18, no. 1, pp. 32– 80, 2001. [15] K. R. Scherer, T. Johnstone, and G. Klasmeyer, “Vocal expression of emotion ...Network for Real-Time Speech- Emotion Recognition 5a. CONTRACT NUMBER IN-HOUSE 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 62788F 6. AUTHOR(S) Q
Başkent, Deniz; Eiler, Cheryl L; Edwards, Brent
2007-06-01
To present a comprehensive analysis of the feasibility of genetic algorithms (GA) for finding the best fit of hearing aids or cochlear implants for individual users in clinical or research settings, where the algorithm is solely driven by subjective human input. Due to varying pathology, the best settings of an auditory device differ for each user. It is also likely that listening preferences vary at the same time. The settings of a device customized for a particular user can only be evaluated by the user. When optimization algorithms are used for fitting purposes, this situation poses a difficulty for a systematic and quantitative evaluation of the suitability of the fitting parameters produced by the algorithm. In the present study, an artificial listening environment was generated by distorting speech using a noiseband vocoder. The settings produced by the GA for this listening problem could objectively be evaluated by measuring speech recognition and comparing the performance to the best vocoder condition where speech was least distorted. Nine normal-hearing subjects participated in the study. The parameters to be optimized were the number of vocoder channels, the shift between the input frequency range and the synthesis frequency range, and the compression-expansion of the input frequency range over the synthesis frequency range. The subjects listened to pairs of sentences processed with the vocoder, and entered a preference for the sentence with better intelligibility. The GA modified the solutions iteratively according to the subject preferences. The program converged when the user ranked the same set of parameters as the best in three consecutive steps. The results produced by the GA were analyzed for quality by measuring speech intelligibility, for test-retest reliability by running the GA three times with each subject, and for convergence properties. Speech recognition scores averaged across subjects were similar for the best vocoder solution and for the solutions produced by the GA. The average number of iterations was 8 and the average convergence time was 25.5 minutes. The settings produced by different GA runs for the same subject were slightly different; however, speech recognition scores measured with these settings were similar. Individual data from subjects showed that in each run, a small number of GA solutions produced poorer speech intelligibility than for the best setting. This was probably a result of the combination of the inherent randomness of the GA, the convergence criterion used in the present study, and possible errors that the users might have made during the paired comparisons. On the other hand, the effect of these errors was probably small compared to the other two factors, as a comparison between subjective preferences and objective measures showed that for many subjects the two were in good agreement. The results showed that the GA was able to produce good solutions by using listener preferences in a relatively short time. For practical applications, the program can be made more robust by running the GA twice or by not using an automatic stopping criterion, and it can be made faster by optimizing the number of the paired comparisons completed in each iteration.
ERIC Educational Resources Information Center
Tinker, Robert
1984-01-01
The game paddle inputs of Apple microcomputers provide a simple way to get laboratory measurements into the computer. Discusses these game paddles and the necessary interface software. Includes schematics for Apple built-in paddle electronics, TRS-80 game paddle I/O, Commodore circuit for user port, and bus interface for Sinclair/Timex, Commodore,…
Interface Metaphors for Interactive Machine Learning
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jasper, Robert J.; Blaha, Leslie M.
To promote more interactive and dynamic machine learn- ing, we revisit the notion of user-interface metaphors. User-interface metaphors provide intuitive constructs for supporting user needs through interface design elements. A user-interface metaphor provides a visual or action pattern that leverages a user’s knowledge of another domain. Metaphors suggest both the visual representations that should be used in a display as well as the interactions that should be afforded to the user. We argue that user-interface metaphors can also offer a method of extracting interaction-based user feedback for use in machine learning. Metaphors offer indirect, context-based information that can be usedmore » in addition to explicit user inputs, such as user-provided labels. Implicit information from user interactions with metaphors can augment explicit user input for active learning paradigms. Or it might be leveraged in systems where explicit user inputs are more challenging to obtain. Each interaction with the metaphor provides an opportunity to gather data and learn. We argue this approach is especially important in streaming applications, where we desire machine learning systems that can adapt to dynamic, changing data.« less
Keidser, Gitte; Best, Virginia; Freeston, Katrina; Boyce, Alexandra
2015-01-01
It is well-established that communication involves the working memory system, which becomes increasingly engaged in understanding speech as the input signal degrades. The more resources allocated to recovering a degraded input signal, the fewer resources, referred to as cognitive spare capacity (CSC), remain for higher-level processing of speech. Using simulated natural listening environments, the aims of this paper were to (1) evaluate an English version of a recently introduced auditory test to measure CSC that targets the updating process of the executive function, (2) investigate if the test predicts speech comprehension better than the reading span test (RST) commonly used to measure working memory capacity, and (3) determine if the test is sensitive to increasing the number of attended locations during listening. In Experiment I, the CSC test was presented using a male and a female talker, in quiet and in spatially separated babble- and cafeteria-noises, in an audio-only and in an audio-visual mode. Data collected on 21 listeners with normal and impaired hearing confirmed that the English version of the CSC test is sensitive to population group, noise condition, and clarity of speech, but not presentation modality. In Experiment II, performance by 27 normal-hearing listeners on a novel speech comprehension test presented in noise was significantly associated with working memory capacity, but not with CSC. Moreover, this group showed no significant difference in CSC as the number of talker locations in the test increased. There was no consistent association between the CSC test and the RST. It is recommended that future studies investigate the psychometric properties of the CSC test, and examine its sensitivity to the complexity of the listening environment in participants with both normal and impaired hearing. PMID:25999904
Rapid recalibration of speech perception after experiencing the McGurk illusion.
Lüttke, Claudia S; Pérez-Bellido, Alexis; de Lange, Floris P
2018-03-01
The human brain can quickly adapt to changes in the environment. One example is phonetic recalibration: a speech sound is interpreted differently depending on the visual speech and this interpretation persists in the absence of visual information. Here, we examined the mechanisms of phonetic recalibration. Participants categorized the auditory syllables /aba/ and /ada/, which were sometimes preceded by the so-called McGurk stimuli (in which an /aba/ sound, due to visual /aga/ input, is often perceived as 'ada'). We found that only one trial of exposure to the McGurk illusion was sufficient to induce a recalibration effect, i.e. an auditory /aba/ stimulus was subsequently more often perceived as 'ada'. Furthermore, phonetic recalibration took place only when auditory and visual inputs were integrated to 'ada' (McGurk illusion). Moreover, this recalibration depended on the sensory similarity between the preceding and current auditory stimulus. Finally, signal detection theoretical analysis showed that McGurk-induced phonetic recalibration resulted in both a criterion shift towards /ada/ and a reduced sensitivity to distinguish between /aba/ and /ada/ sounds. The current study shows that phonetic recalibration is dependent on the perceptual integration of audiovisual information and leads to a perceptual shift in phoneme categorization.
What happens to the motor theory of perception when the motor system is damaged?
Stasenko, Alena; Garcea, Frank E; Mahon, Bradford Z
2013-09-01
Motor theories of perception posit that motor information is necessary for successful recognition of actions. Perhaps the most well known of this class of proposals is the motor theory of speech perception, which argues that speech recognition is fundamentally a process of identifying the articulatory gestures (i.e. motor representations) that were used to produce the speech signal. Here we review neuropsychological evidence from patients with damage to the motor system, in the context of motor theories of perception applied to both manual actions and speech. Motor theories of perception predict that patients with motor impairments will have impairments for action recognition. Contrary to that prediction, the available neuropsychological evidence indicates that recognition can be spared despite profound impairments to production. These data falsify strong forms of the motor theory of perception, and frame new questions about the dynamical interactions that govern how information is exchanged between input and output systems.
What happens to the motor theory of perception when the motor system is damaged?
Stasenko, Alena; Garcea, Frank E.; Mahon, Bradford Z.
2016-01-01
Motor theories of perception posit that motor information is necessary for successful recognition of actions. Perhaps the most well known of this class of proposals is the motor theory of speech perception, which argues that speech recognition is fundamentally a process of identifying the articulatory gestures (i.e. motor representations) that were used to produce the speech signal. Here we review neuropsychological evidence from patients with damage to the motor system, in the context of motor theories of perception applied to both manual actions and speech. Motor theories of perception predict that patients with motor impairments will have impairments for action recognition. Contrary to that prediction, the available neuropsychological evidence indicates that recognition can be spared despite profound impairments to production. These data falsify strong forms of the motor theory of perception, and frame new questions about the dynamical interactions that govern how information is exchanged between input and output systems. PMID:26823687
To speak or not to speak - A multiple resource perspective
NASA Technical Reports Server (NTRS)
Tsang, P. S.; Hartzell, E. J.; Rothschild, R. A.
1985-01-01
The desirability of employing speech response in a dynamic dual task situation was discussed from a multiple resource perspective. A secondary task technique was employed to examine the time-sharing performance of five dual tasks with various degrees of resource overlap according to the structure-specific resource model of Wickens (1980). The primary task was a visual/manual tracking task which required spatial processing. The secondary task was either another tracking task or a spatial transformation task with one of four input (visual or auditory) and output (manual or speech) configurations. The results show that the dual task performance was best when the primary tracking task was paired with the visual/speech transformation task. This finding was explained by an interaction of the stimulus-central processing-response compatibility of the transformation task and the degree of resource competition between the time-shared tasks. Implications on the utility of speech response were discussed.
Atypical coordination of cortical oscillations in response to speech in autism
Jochaut, Delphine; Lehongre, Katia; Saitovitch, Ana; Devauchelle, Anne-Dominique; Olasagasti, Itsaso; Chabane, Nadia; Zilbovicius, Monica; Giraud, Anne-Lise
2015-01-01
Subjects with autism often show language difficulties, but it is unclear how they relate to neurophysiological anomalies of cortical speech processing. We used combined EEG and fMRI in 13 subjects with autism and 13 control participants and show that in autism, gamma and theta cortical activity do not engage synergistically in response to speech. Theta activity in left auditory cortex fails to track speech modulations, and to down-regulate gamma oscillations in the group with autism. This deficit predicts the severity of both verbal impairment and autism symptoms in the affected sample. Finally, we found that oscillation-based connectivity between auditory and other language cortices is altered in autism. These results suggest that the verbal disorder in autism could be associated with an altered balance of slow and fast auditory oscillations, and that this anomaly could compromise the mapping between sensory input and higher-level cognitive representations. PMID:25870556
36 CFR 1193.41 - Input, control, and mechanical functions.
Code of Federal Regulations, 2012 CFR
2012-07-01
.... Provide at least one mode that does not require user speech. (i) Operable with limited cognitive skills. Provide at least one mode that minimizes the cognitive, memory, language, and learning skills required of...
36 CFR 1193.41 - Input, control, and mechanical functions.
Code of Federal Regulations, 2014 CFR
2014-07-01
.... Provide at least one mode that does not require user speech. (i) Operable with limited cognitive skills. Provide at least one mode that minimizes the cognitive, memory, language, and learning skills required of...
36 CFR § 1193.41 - Input, control, and mechanical functions.
Code of Federal Regulations, 2013 CFR
2013-07-01
.... Provide at least one mode that does not require user speech. (i) Operable with limited cognitive skills. Provide at least one mode that minimizes the cognitive, memory, language, and learning skills required of...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vondy, D.R.; Fowler, T.B.; Cunningham, G.W.
1979-07-01
User input data requirements are presented for certain special processors in a nuclear reactor computation system. These processors generally read data in formatted form and generate binary interface data files. Some data processing is done to convert from the user oriented form to the interface file forms. The VENTURE diffusion theory neutronics code and other computation modules in this system use the interface data files which are generated.
Schaadt, Gesa; van der Meer, Elke; Pannekamp, Ann; Oberecker, Regine; Männel, Claudia
2018-01-17
During information processing, individuals benefit from bimodally presented input, as has been demonstrated for speech perception (i.e., printed letters and speech sounds) or the perception of emotional expressions (i.e., facial expression and voice tuning). While typically developing individuals show this bimodal benefit, school children with dyslexia do not. Currently, it is unknown whether the bimodal processing deficit in dyslexia also occurs for visual-auditory speech processing that is independent of reading and spelling acquisition (i.e., no letter-sound knowledge is required). Here, we tested school children with and without spelling problems on their bimodal perception of video-recorded mouth movements pronouncing syllables. We analyzed the event-related potential Mismatch Response (MMR) to visual-auditory speech information and compared this response to the MMR to monomodal speech information (i.e., auditory-only, visual-only). We found a reduced MMR with later onset to visual-auditory speech information in children with spelling problems compared to children without spelling problems. Moreover, when comparing bimodal and monomodal speech perception, we found that children without spelling problems showed significantly larger responses in the visual-auditory experiment compared to the visual-only response, whereas children with spelling problems did not. Our results suggest that children with dyslexia exhibit general difficulties in bimodal speech perception independently of letter-speech sound knowledge, as apparent in altered bimodal speech perception and lacking benefit from bimodal information. This general deficit in children with dyslexia may underlie the previously reported reduced bimodal benefit for letter-speech sound combinations and similar findings in emotion perception. Copyright © 2018 Elsevier Ltd. All rights reserved.
Speech perception as an active cognitive process
Heald, Shannon L. M.; Nusbaum, Howard C.
2014-01-01
One view of speech perception is that acoustic signals are transformed into representations for pattern matching to determine linguistic structure. This process can be taken as a statistical pattern-matching problem, assuming realtively stable linguistic categories are characterized by neural representations related to auditory properties of speech that can be compared to speech input. This kind of pattern matching can be termed a passive process which implies rigidity of processing with few demands on cognitive processing. An alternative view is that speech recognition, even in early stages, is an active process in which speech analysis is attentionally guided. Note that this does not mean consciously guided but that information-contingent changes in early auditory encoding can occur as a function of context and experience. Active processing assumes that attention, plasticity, and listening goals are important in considering how listeners cope with adverse circumstances that impair hearing by masking noise in the environment or hearing loss. Although theories of speech perception have begun to incorporate some active processing, they seldom treat early speech encoding as plastic and attentionally guided. Recent research has suggested that speech perception is the product of both feedforward and feedback interactions between a number of brain regions that include descending projections perhaps as far downstream as the cochlea. It is important to understand how the ambiguity of the speech signal and constraints of context dynamically determine cognitive resources recruited during perception including focused attention, learning, and working memory. Theories of speech perception need to go beyond the current corticocentric approach in order to account for the intrinsic dynamics of the auditory encoding of speech. In doing so, this may provide new insights into ways in which hearing disorders and loss may be treated either through augementation or therapy. PMID:24672438
NASA Astrophysics Data System (ADS)
Yildirim, Serdar; Montanari, Simona; Andersen, Elaine; Narayanan, Shrikanth S.
2003-10-01
Understanding the fine details of children's speech and gestural characteristics helps, among other things, in creating natural computer interfaces. We analyze the acoustic, lexical/non-lexical and spoken/gestural discourse characteristics of young children's speech using audio-video data gathered using a Wizard of Oz technique from 4 to 6 year old children engaged in resolving a series of age-appropriate cognitive challenges. Fundamental and formant frequencies exhibited greater variations between subjects consistent with previous results on read speech [Lee et al., J. Acoust. Soc. Am. 105, 1455-1468 (1999)]. Also, our analysis showed that, in a given bandwidth, phonemic information contained in the speech of young child is significantly less than that of older ones and adults. To enable an integrated analysis, a multi-track annotation board was constructed using the ANVIL tool kit [M. Kipp, Eurospeech 1367-1370 (2001)]. Along with speech transcriptions and acoustic analysis, non-lexical and discourse characteristics, and child's gesture (facial expressions, body movements, hand/head movements) were annotated in a synchronized multilayer system. Initial results showed that younger children rely more on gestures to emphasize their verbal assertions. Younger children use non-lexical speech (e.g., um, huh) associated with frustration and pondering/reflecting more frequently than older ones. Younger children also repair more with humans than with computer.
A speech-controlled environmental control system for people with severe dysarthria.
Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca
2007-06-01
Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.
Speech Rate Normalization and Phonemic Boundary Perception in Cochlear-Implant Users
Newman, Rochelle S.; Goupell, Matthew J.
2017-01-01
Purpose Normal-hearing (NH) listeners rate normalize, temporarily remapping phonemic category boundaries to account for a talker's speech rate. It is unknown if adults who use auditory prostheses called cochlear implants (CI) can rate normalize, as CIs transmit degraded speech signals to the auditory nerve. Ineffective adjustment to rate information could explain some of the variability in this population's speech perception outcomes. Method Phonemes with manipulated voice-onset-time (VOT) durations were embedded in sentences with different speech rates. Twenty-three CI and 29 NH participants performed a phoneme identification task. NH participants heard the same unprocessed stimuli as the CI participants or stimuli degraded by a sine vocoder, simulating aspects of CI processing. Results CI participants showed larger rate normalization effects (6.6 ms) than the NH participants (3.7 ms) and had shallower (less reliable) category boundary slopes. NH participants showed similarly shallow slopes when presented acoustically degraded vocoded signals, but an equal or smaller rate effect in response to reductions in available spectral and temporal information. Conclusion CI participants can rate normalize, despite their degraded speech input, and show a larger rate effect compared to NH participants. CI participants may particularly rely on rate normalization to better maintain perceptual constancy of the speech signal. PMID:28395319
Electrophysiological evidence for a self-processing advantage during audiovisual speech integration.
Treille, Avril; Vilain, Coriandre; Kandel, Sonia; Sato, Marc
2017-09-01
Previous electrophysiological studies have provided strong evidence for early multisensory integrative mechanisms during audiovisual speech perception. From these studies, one unanswered issue is whether hearing our own voice and seeing our own articulatory gestures facilitate speech perception, possibly through a better processing and integration of sensory inputs with our own sensory-motor knowledge. The present EEG study examined the impact of self-knowledge during the perception of auditory (A), visual (V) and audiovisual (AV) speech stimuli that were previously recorded from the participant or from a speaker he/she had never met. Audiovisual interactions were estimated by comparing N1 and P2 auditory evoked potentials during the bimodal condition (AV) with the sum of those observed in the unimodal conditions (A + V). In line with previous EEG studies, our results revealed an amplitude decrease of P2 auditory evoked potentials in AV compared to A + V conditions. Crucially, a temporal facilitation of N1 responses was observed during the visual perception of self speech movements compared to those of another speaker. This facilitation was negatively correlated with the saliency of visual stimuli. These results provide evidence for a temporal facilitation of the integration of auditory and visual speech signals when the visual situation involves our own speech gestures.
A Selective Deficit in Phonetic Recalibration by Text in Developmental Dyslexia.
Keetels, Mirjam; Bonte, Milene; Vroomen, Jean
2018-01-01
Upon hearing an ambiguous speech sound, listeners may adjust their perceptual interpretation of the speech input in accordance with contextual information, like accompanying text or lipread speech (i.e., phonetic recalibration; Bertelson et al., 2003). As developmental dyslexia (DD) has been associated with reduced integration of text and speech sounds, we investigated whether this deficit becomes manifest when text is used to induce this type of audiovisual learning. Adults with DD and normal readers were exposed to ambiguous consonants halfway between /aba/ and /ada/ together with text or lipread speech. After this audiovisual exposure phase, they categorized auditory-only ambiguous test sounds. Results showed that individuals with DD, unlike normal readers, did not use text to recalibrate their phoneme categories, whereas their recalibration by lipread speech was spared. Individuals with DD demonstrated similar deficits when ambiguous vowels (halfway between /wIt/ and /wet/) were recalibrated by text. These findings indicate that DD is related to a specific letter-speech sound association deficit that extends over phoneme classes (vowels and consonants), but - as lipreading was spared - does not extend to a more general audio-visual integration deficit. In particular, these results highlight diminished reading-related audiovisual learning in addition to the commonly reported phonological problems in developmental dyslexia.
Using DEDICOM for completely unsupervised part-of-speech tagging.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chew, Peter A.; Bader, Brett William; Rozovskaya, Alla
A standard and widespread approach to part-of-speech tagging is based on Hidden Markov Models (HMMs). An alternative approach, pioneered by Schuetze (1993), induces parts of speech from scratch using singular value decomposition (SVD). We introduce DEDICOM as an alternative to SVD for part-of-speech induction. DEDICOM retains the advantages of SVD in that it is completely unsupervised: no prior knowledge is required to induce either the tagset or the associations of terms with tags. However, unlike SVD, it is also fully compatible with the HMM framework, in that it can be used to estimate emission- and transition-probability matrices which can thenmore » be used as the input for an HMM. We apply the DEDICOM method to the CONLL corpus (CONLL 2000) and compare the output of DEDICOM to the part-of-speech tags given in the corpus, and find that the correlation (almost 0.5) is quite high. Using DEDICOM, we also estimate part-of-speech ambiguity for each term, and find that these estimates correlate highly with part-of-speech ambiguity as measured in the original corpus (around 0.88). Finally, we show how the output of DEDICOM can be evaluated and compared against the more familiar output of supervised HMM-based tagging.« less
Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902
Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.
Field-testing the new DECtalk PC system for medical applications
NASA Technical Reports Server (NTRS)
Grams, R. R.; Smillov, A.; Li, B.
1992-01-01
Synthesized human speech has now reached a new level of performance. With the introduction of DEC's new DECtalk PC, the small system developer will have a very powerful tool for creative design. It has been our privilege to be involved in the beta-testing of this new device and to add a medical dictionary which covers a wide range of medical terminology. With the inherent board level understanding of speech synthesis and the medical dictionary, it is now possible to provide full digital speech output for all medical files and terms. The application of these tools will cover a wide range of options for the future and allow a new dimension in dealing with the complex user interface experienced in medical practice.
Emerging technologies with potential for objectively evaluating speech recognition skills.
Rawool, Vishakha Waman
2016-01-01
Work-related exposure to noise and other ototoxins can cause damage to the cochlea, synapses between the inner hair cells, the auditory nerve fibers, and higher auditory pathways, leading to difficulties in recognizing speech. Procedures designed to determine speech recognition scores (SRS) in an objective manner can be helpful in disability compensation cases where the worker claims to have poor speech perception due to exposure to noise or ototoxins. Such measures can also be helpful in determining SRS in individuals who cannot provide reliable responses to speech stimuli, including patients with Alzheimer's disease, traumatic brain injuries, and infants with and without hearing loss. Cost-effective neural monitoring hardware and software is being rapidly refined due to the high demand for neurogaming (games involving the use of brain-computer interfaces), health, and other applications. More specifically, two related advances in neuro-technology include relative ease in recording neural activity and availability of sophisticated analysing techniques. These techniques are reviewed in the current article and their applications for developing objective SRS procedures are proposed. Issues related to neuroaudioethics (ethics related to collection of neural data evoked by auditory stimuli including speech) and neurosecurity (preservation of a person's neural mechanisms and free will) are also discussed.
Dimension-Based Statistical Learning Affects Both Speech Perception and Production.
Lehet, Matthew; Holt, Lori L
2017-04-01
Multiple acoustic dimensions signal speech categories. However, dimensions vary in their informativeness; some are more diagnostic of category membership than others. Speech categorization reflects these dimensional regularities such that diagnostic dimensions carry more "perceptual weight" and more effectively signal category membership to native listeners. Yet perceptual weights are malleable. When short-term experience deviates from long-term language norms, such as in a foreign accent, the perceptual weight of acoustic dimensions in signaling speech category membership rapidly adjusts. The present study investigated whether rapid adjustments in listeners' perceptual weights in response to speech that deviates from the norms also affects listeners' own speech productions. In a word recognition task, the correlation between two acoustic dimensions signaling consonant categories, fundamental frequency (F0) and voice onset time (VOT), matched the correlation typical of English, and then shifted to an "artificial accent" that reversed the relationship, and then shifted back. Brief, incidental exposure to the artificial accent caused participants to down-weight perceptual reliance on F0, consistent with previous research. Throughout the task, participants were intermittently prompted with pictures to produce these same words. In the block in which listeners heard the artificial accent with a reversed F0 × VOT correlation, F0 was a less robust cue to voicing in listeners' own speech productions. The statistical regularities of short-term speech input affect both speech perception and production, as evidenced via shifts in how acoustic dimensions are weighted. Copyright © 2016 Cognitive Science Society, Inc.
Dimension-based statistical learning affects both speech perception and production
Lehet, Matthew; Holt, Lori L.
2016-01-01
Multiple acoustic dimensions signal speech categories. However, dimensions vary in their informativeness; some are more diagnostic of category membership than others. Speech categorization reflects these dimensional regularities such that diagnostic dimensions carry more “perceptual weight” and more effectively signal category membership to native listeners. Yet, perceptual weights are malleable. When short-term experience deviates from long-term language norms, such as in a foreign accent, the perceptual weight of acoustic dimensions in signaling speech category membership rapidly adjusts. The present study investigated whether rapid adjustments in listeners’ perceptual weights in response to speech that deviates from the norms also affects listeners’ own speech productions. In a word recognition task, the correlation between two acoustic dimensions signaling consonant categories, fundamental frequency (F0) and voice onset time (VOT), matched the correlation typical of English, then shifted to an “artificial accent” that reversed the relationship, and then shifted back. Brief, incidental exposure to the artificial accent caused participants to down-weight perceptual reliance on F0, consistent with previous research. Throughout the task, participants were intermittently prompted with pictures to produce these same words. In the block in which listeners heard the artificial accent with a reversed F0 x VOT correlation, F0 was a less robust cue to voicing in listeners’ own speech productions. The statistical regularities of short-term speech input affect both speech perception and production, as evidenced via shifts in how acoustic dimensions are weighted. PMID:27666146
A measure for assessing the effects of audiovisual speech integration.
Altieri, Nicholas; Townsend, James T; Wenger, Michael J
2014-06-01
We propose a measure of audiovisual speech integration that takes into account accuracy and response times. This measure should prove beneficial for researchers investigating multisensory speech recognition, since it relates to normal-hearing and aging populations. As an example, age-related sensory decline influences both the rate at which one processes information and the ability to utilize cues from different sensory modalities. Our function assesses integration when both auditory and visual information are available, by comparing performance on these audiovisual trials with theoretical predictions for performance under the assumptions of parallel, independent self-terminating processing of single-modality inputs. We provide example data from an audiovisual identification experiment and discuss applications for measuring audiovisual integration skills across the life span.
NASA Astrophysics Data System (ADS)
Balbin, Jessie R.; Padilla, Dionis A.; Fausto, Janette C.; Vergara, Ernesto M.; Garcia, Ramon G.; Delos Angeles, Bethsedea Joy S.; Dizon, Neil John A.; Mardo, Mark Kevin N.
2017-02-01
This research is about translating series of hand gesture to form a word and produce its equivalent sound on how it is read and said in Filipino accent using Support Vector Machine and Mel Frequency Cepstral Coefficient analysis. The concept is to detect Filipino speech input and translate the spoken words to their text form in Filipino. This study is trying to help the Filipino deaf community to impart their thoughts through the use of hand gestures and be able to communicate to people who do not know how to read hand gestures. This also helps literate deaf to simply read the spoken words relayed to them using the Filipino speech to text system.
Implementing Artificial Intelligence Behaviors in a Virtual World
NASA Technical Reports Server (NTRS)
Krisler, Brian; Thome, Michael
2012-01-01
In this paper, we will present a look at the current state of the art in human-computer interface technologies, including intelligent interactive agents, natural speech interaction and gestural based interfaces. We describe our use of these technologies to implement a cost effective, immersive experience on a public region in Second Life. We provision our Artificial Agents as a German Shepherd Dog avatar with an external rules engine controlling the behavior and movement. To interact with the avatar, we implemented a natural language and gesture system allowing the human avatars to use speech and physical gestures rather than interacting via a keyboard and mouse. The result is a system that allows multiple humans to interact naturally with AI avatars by playing games such as fetch with a flying disk and even practicing obedience exercises using voice and gesture, a natural seeming day in the park.
Spacelab interface development test, volume 1, sections 1-6
NASA Technical Reports Server (NTRS)
Harris, L. H.
1979-01-01
Data recorded during the following tests is presented: pulse coded modulator master unit to Spacelab (S/L) interface, master timing unit to S/L interface, multiplexer-demultiplexer/serial input-output to S/L interface, and special tests.
2018-01-01
Abstract In real-world environments, humans comprehend speech by actively integrating prior knowledge (P) and expectations with sensory input. Recent studies have revealed effects of prior information in temporal and frontal cortical areas and have suggested that these effects are underpinned by enhanced encoding of speech-specific features, rather than a broad enhancement or suppression of cortical activity. However, in terms of the specific hierarchical stages of processing involved in speech comprehension, the effects of integrating bottom-up sensory responses and top-down predictions are still unclear. In addition, it is unclear whether the predictability that comes with prior information may differentially affect speech encoding relative to the perceptual enhancement that comes with that prediction. One way to investigate these issues is through examining the impact of P on indices of cortical tracking of continuous speech features. Here, we did this by presenting participants with degraded speech sentences that either were or were not preceded by a clear recording of the same sentences while recording non-invasive electroencephalography (EEG). We assessed the impact of prior information on an isolated index of cortical tracking that reflected phoneme-level processing. Our findings suggest the possibility that prior information affects the early encoding of natural speech in a dual manner. Firstly, the availability of prior information, as hypothesized, enhanced the perceived clarity of degraded speech, which was positively correlated with changes in phoneme-level encoding across subjects. In addition, P induced an overall reduction of this cortical measure, which we interpret as resulting from the increase in predictability. PMID:29662947
Analog-to-digital conversion to accommodate the dynamics of live music in hearing instruments.
Hockley, Neil S; Bahlmann, Frauke; Fulton, Bernadette
2012-09-01
Hearing instrument design focuses on the amplification of speech to reduce the negative effects of hearing loss. Many amateur and professional musicians, along with music enthusiasts, also require their hearing instruments to perform well when listening to the frequent, high amplitude peaks of live music. One limitation, in most current digital hearing instruments with 16-bit analog-to-digital (A/D) converters, is that the compressor before the A/D conversion is limited to 95 dB (SPL) or less at the input. This is more than adequate for the dynamic range of speech; however, this does not accommodate the amplitude peaks present in live music. The hearing instrument input compression system can be adjusted to accommodate for the amplitudes present in music that would otherwise be compressed before the A/D converter in the hearing instrument. The methodology behind this technological approach will be presented along with measurements to demonstrate its effectiveness.
Phonologic-graphemic transcodifier for Portuguese Language spoken in Brazil (PLB)
NASA Astrophysics Data System (ADS)
Fragadasilva, Francisco Jose; Saotome, Osamu; Deoliveira, Carlos Alberto
An automatic speech-to-text transformer system, suited to unlimited vocabulary, is presented. The basic acoustic unit considered are the allophones of the phonemes corresponding to the Portuguese language spoken in Brazil (PLB). The input to the system is a phonetic sequence, from a former step of isolated word recognition of slowly spoken speech. In a first stage, the system eliminates phonetic elements that don't belong to PLB. Using knowledge sources such as phonetics, phonology, orthography, and PLB specific lexicon, the output is a sequence of written words, ordered by probabilistic criterion that constitutes the set of graphemic possibilities to that input sequence. Pronunciation differences of some regions of Brazil are considered, but only those that cause differences in phonological transcription, because those of phonetic level are absorbed, during the transformation to phonological level. In the final stage, all possible written words are analyzed for orthography and grammar point of view, to eliminate the incorrect ones.
[The endpoint detection of cough signal in continuous speech].
Yang, Guoqing; Mo, Hongqiang; Li, Wen; Lian, Lianfang; Zheng, Zeguang
2010-06-01
The endpoint detection of cough signal in continuous speech has been researched in order to improve the efficiency and veracity of manual recognition or computer-based automatic recognition. First, using the short time zero crossing ratio(ZCR) for identifying the suspicious coughs and getting the threshold of short time energy based on acoustic characteristics of cough. Then, the short time energy is combined with short time ZCR in order to implement the endpoint detection of cough in continuous speech. To evaluate the effect of the method, first, the virtual number of coughs in each recording was identified by two experienced doctors using the graphical user interface (GUI). Second, the recordings were analyzed by automatic endpoint detection program under Matlab7.0. Finally, the comparison between these two results showed: The error rate of undetected cough is 2.18%, and 98.13% of noise, silence and speech were removed. The way of setting short time energy threshold is robust. The endpoint detection program can remove most speech and noise, thus maintaining a lower rate of error.
Musical melody and speech intonation: singing a different tune.
Zatorre, Robert J; Baum, Shari R
2012-01-01
Music and speech are often cited as characteristically human forms of communication. Both share the features of hierarchical structure, complex sound systems, and sensorimotor sequencing demands, and both are used to convey and influence emotions, among other functions [1]. Both music and speech also prominently use acoustical frequency modulations, perceived as variations in pitch, as part of their communicative repertoire. Given these similarities, and the fact that pitch perception and production involve the same peripheral transduction system (cochlea) and the same production mechanism (vocal tract), it might be natural to assume that pitch processing in speech and music would also depend on the same underlying cognitive and neural mechanisms. In this essay we argue that the processing of pitch information differs significantly for speech and music; specifically, we suggest that there are two pitch-related processing systems, one for more coarse-grained, approximate analysis and one for more fine-grained accurate representation, and that the latter is unique to music. More broadly, this dissociation offers clues about the interface between sensory and motor systems, and highlights the idea that multiple processing streams are a ubiquitous feature of neuro-cognitive architectures.
DeVries, Lindsay; Scheperle, Rachel; Bierer, Julie Arenberg
2016-06-01
Variability in speech perception scores among cochlear implant listeners may largely reflect the variable efficacy of implant electrodes to convey stimulus information to the auditory nerve. In the present study, three metrics were applied to assess the quality of the electrode-neuron interface of individual cochlear implant channels: the electrically evoked compound action potential (ECAP), the estimation of electrode position using computerized tomography (CT), and behavioral thresholds using focused stimulation. The primary motivation of this approach is to evaluate the ECAP as a site-specific measure of the electrode-neuron interface in the context of two peripheral factors that likely contribute to degraded perception: large electrode-to-modiolus distance and reduced neural density. Ten unilaterally implanted adults with Advanced Bionics HiRes90k devices participated. ECAPs were elicited with monopolar stimulation within a forward-masking paradigm to construct channel interaction functions (CIF), behavioral thresholds were obtained with quadrupolar (sQP) stimulation, and data from imaging provided estimates of electrode-to-modiolus distance and scalar location (scala tympani (ST), intermediate, or scala vestibuli (SV)) for each electrode. The width of the ECAP CIF was positively correlated with electrode-to-modiolus distance; both of these measures were also influenced by scalar position. The ECAP peak amplitude was negatively correlated with behavioral thresholds. Moreover, subjects with low behavioral thresholds and large ECAP amplitudes, averaged across electrodes, tended to have higher speech perception scores. These results suggest a potential clinical role for the ECAP in the objective assessment of individual cochlear implant channels, with the potential to improve speech perception outcomes.
Detecting Nasal Vowels in Speech Interfaces Based on Surface Electromyography
Freitas, João; Teixeira, António; Silva, Samuel; Oliveira, Catarina; Dias, Miguel Sales
2015-01-01
Nasality is a very important characteristic of several languages, European Portuguese being one of them. This paper addresses the challenge of nasality detection in surface electromyography (EMG) based speech interfaces. We explore the existence of useful information about the velum movement and also assess if muscles deeper down in the face and neck region can be measured using surface electrodes, and the best electrode location to do so. The procedure we adopted uses Real-Time Magnetic Resonance Imaging (RT-MRI), collected from a set of speakers, providing a method to interpret EMG data. By ensuring compatible data recording conditions, and proper time alignment between the EMG and the RT-MRI data, we are able to accurately estimate the time when the velum moves and the type of movement when a nasal vowel occurs. The combination of these two sources revealed interesting and distinct characteristics in the EMG signal when a nasal vowel is uttered, which motivated a classification experiment. Overall results of this experiment provide evidence that it is possible to detect velum movement using sensors positioned below the ear, between mastoid process and the mandible, in the upper neck region. In a frame-based classification scenario, error rates as low as 32.5% for all speakers and 23.4% for the best speaker have been achieved, for nasal vowel detection. This outcome stands as an encouraging result, fostering the grounds for deeper exploration of the proposed approach as a promising route to the development of an EMG-based speech interface for languages with strong nasal characteristics. PMID:26069968
Modifying Speech to Children based on their Perceived Phonetic Accuracy
Julien, Hannah M.; Munson, Benjamin
2014-01-01
Purpose We examined the relationship between adults' perception of the accuracy of children's speech, and acoustic detail in their subsequent productions to children. Methods Twenty-two adults participated in a task in which they rated the accuracy of 2- and 3-year-old children's word-initial /s/and /∫/ using a visual analog scale (VAS), then produced a token of the same word as if they were responding to the child whose speech they had just rated. Result The duration of adults' fricatives varied as a function of their perception of the accuracy of children's speech: longer fricatives were produced following productions that they rated as inaccurate. This tendency to modify duration in response to perceived inaccurate tokens was mediated by measures of self-reported experience interacting with children. However, speakers did not increase the spectral distinctiveness of their fricatives following the perception of inaccurate tokens. Conclusion These results suggest that adults modify temporal features of their speech in response to perceiving children's inaccurate productions. These longer fricatives are potentially both enhanced input to children, and an error-corrective signal. PMID:22744140
Live Speech Driven Head-and-Eye Motion Generators.
Le, Binh H; Ma, Xiaohan; Deng, Zhigang
2012-11-01
This paper describes a fully automated framework to generate realistic head motion, eye gaze, and eyelid motion simultaneously based on live (or recorded) speech input. Its central idea is to learn separate yet interrelated statistical models for each component (head motion, gaze, or eyelid motion) from a prerecorded facial motion data set: 1) Gaussian Mixture Models and gradient descent optimization algorithm are employed to generate head motion from speech features; 2) Nonlinear Dynamic Canonical Correlation Analysis model is used to synthesize eye gaze from head motion and speech features, and 3) nonnegative linear regression is used to model voluntary eye lid motion and log-normal distribution is used to describe involuntary eye blinks. Several user studies are conducted to evaluate the effectiveness of the proposed speech-driven head and eye motion generator using the well-established paired comparison methodology. Our evaluation results clearly show that this approach can significantly outperform the state-of-the-art head and eye motion generation algorithms. In addition, a novel mocap+video hybrid data acquisition technique is introduced to record high-fidelity head movement, eye gaze, and eyelid motion simultaneously.
ERIC Educational Resources Information Center
Vanormelingen, Liesbeth; Gillis, Steven
2016-01-01
This article investigates the amount of input and the quality of mother-child interactions in mothers who differ in socio-economic status (SES): mid-to-high SES (mhSES) and low SES. The amount of input was measured as the number of utterances per hour, the total duration of speech per hour and the number of turns per hour. The quality of the…
Hu, Hongmei; Krasoulis, Agamemnon; Lutman, Mark; Bleeck, Stefan
2013-01-01
Cochlear implants (CIS) require efficient speech processing to maximize information transmission to the brain, especially in noise. A novel CI processing strategy was proposed in our previous studies, in which sparsity-constrained non-negative matrix factorization (NMF) was applied to the envelope matrix in order to improve the CI performance in noisy environments. It showed that the algorithm needs to be adaptive, rather than fixed, in order to adjust to acoustical conditions and individual characteristics. Here, we explore the benefit of a system that allows the user to adjust the signal processing in real time according to their individual listening needs and their individual hearing capabilities. In this system, which is based on MATLAB®, SIMULINK® and the xPC Target™ environment, the input/outupt (I/O) boards are interfaced between the SIMULINK blocks and the CI stimulation system, such that the output can be controlled successfully in the manner of a hardware-in-the-loop (HIL) simulation, hence offering a convenient way to implement a real time signal processing module that does not require any low level language. The sparsity constrained parameter of the algorithm was adapted online subjectively during an experiment with normal-hearing subjects and noise vocoded speech simulation. Results show that subjects chose different parameter values according to their own intelligibility preferences, indicating that adaptive real time algorithms are beneficial to fully explore subjective preferences. We conclude that the adaptive real time systems are beneficial for the experimental design, and such systems allow one to conduct psychophysical experiments with high ecological validity. PMID:24129021
Hu, Hongmei; Krasoulis, Agamemnon; Lutman, Mark; Bleeck, Stefan
2013-10-14
Cochlear implants (CIs) require efficient speech processing to maximize information transmission to the brain, especially in noise. A novel CI processing strategy was proposed in our previous studies, in which sparsity-constrained non-negative matrix factorization (NMF) was applied to the envelope matrix in order to improve the CI performance in noisy environments. It showed that the algorithm needs to be adaptive, rather than fixed, in order to adjust to acoustical conditions and individual characteristics. Here, we explore the benefit of a system that allows the user to adjust the signal processing in real time according to their individual listening needs and their individual hearing capabilities. In this system, which is based on MATLAB®, SIMULINK® and the xPC Target™ environment, the input/outupt (I/O) boards are interfaced between the SIMULINK blocks and the CI stimulation system, such that the output can be controlled successfully in the manner of a hardware-in-the-loop (HIL) simulation, hence offering a convenient way to implement a real time signal processing module that does not require any low level language. The sparsity constrained parameter of the algorithm was adapted online subjectively during an experiment with normal-hearing subjects and noise vocoded speech simulation. Results show that subjects chose different parameter values according to their own intelligibility preferences, indicating that adaptive real time algorithms are beneficial to fully explore subjective preferences. We conclude that the adaptive real time systems are beneficial for the experimental design, and such systems allow one to conduct psychophysical experiments with high ecological validity.
Choi, Kup-Sze; Chan, Tak-Yin
2015-03-01
To investigate the feasibility of using tablet device as user interface for students with upper extremity disabilities to input mathematics efficiently into computer. A touch-input system using tablet device as user interface was proposed to assist these students to write mathematics. User-switchable and context-specific keyboard layouts were designed to streamline the input process. The system could be integrated with conventional computer systems only with minor software setup. A two-week pre-post test study involving five participants was conducted to evaluate the performance of the system and collect user feedback. The mathematics input efficiency of the participants was found to improve during the experiment sessions. In particular, their performance in entering trigonometric expressions by using the touch-input system was significantly better than that by using conventional mathematics editing software with keyboard and mouse. The participants rated the touch-input system positively and were confident that they could operate at ease with more practice. The proposed touch-input system provides a convenient way for the students with hand impairment to write mathematics and has the potential to facilitate their mathematics learning. Implications for Rehabilitation Students with upper extremity disabilities often face barriers to learning mathematics which is largely based on handwriting. Conventional computer user interfaces are inefficient for them to input mathematics into computer. A touch-input system with context-specific and user-switchable keyboard layouts was designed to improve the efficiency of mathematics input. Experimental results and user feedback suggested that the system has the potential to facilitate mathematics learning for the students.
A Voice and Mouse Input Interface for 3D Virtual Environments
NASA Technical Reports Server (NTRS)
Kao, David L.; Bryson, Steve T.
2003-01-01
There have been many successful stories on how 3D input devices can be fully integrated into an immersive virtual environment. Electromagnetic trackers, optical trackers, gloves, and flying mice are just some of these input devices. Though we can use existing 3D input devices that are commonly used for VR applications, there are several factors that prevent us from choosing these input devices for our applications. One main factor is that most of these tracking devices are not suitable for prolonged use due to human fatigue associated with using them. A second factor is that many of them would occupy additional office space. Another factor is that many of the 3D input devices are expensive due to the unusual hardware that are required. For our VR applications, we want a user interface that would work naturally with standard equipment. In this paper, we demonstrate applications or our proposed muitimodal interface using a 3D dome display. We also show that effective data analysis can be achieved while the scientists view their data rendered inside the dome display and perform user interactions simply using the mouse and voice input. Though the sphere coordinate grid seems to be ideal for interaction using a 3D dome display, we can also use other non-spherical grids as well.
CARE 3 user-friendly interface user's guide
NASA Technical Reports Server (NTRS)
Martensen, A. L.
1987-01-01
CARE 3 predicts the unreliability of highly reliable reconfigurable fault-tolerant systems that include redundant computers or computer systems. CARE3MENU is a user-friendly interface used to create an input for the CARE 3 program. The CARE3MENU interface has been designed to minimize user input errors. Although a CARE3MENU session may be successfully completed and all parameters may be within specified limits or ranges, the CARE 3 program is not guaranteed to produce meaningful results if the user incorrectly interprets the CARE 3 stochastic model. The CARE3MENU User Guide provides complete information on how to create a CARE 3 model with the interface. The CARE3MENU interface runs under the VAX/VMS operating system.
Look Who’s Talking NOW! Parentese Speech, Social Context, and Language Development Across Time
Ramírez-Esparza, Nairán; García-Sierra, Adrián; Kuhl, Patricia K.
2017-01-01
In previous studies, we found that the social interactions infants experience in their everyday lives at 11- and 14-months of age affect language ability at 24 months of age. These studies investigated relationships between the speech style (i.e., parentese speech vs. standard speech) and social context [i.e., one-on-one (1:1) vs. group] of language input in infancy and later speech development (i.e., at 24 months of age), controlling for socioeconomic status (SES). Results showed that the amount of exposure to parentese speech-1:1 in infancy was related to productive vocabulary at 24 months. The general goal of the present study was to investigate changes in (1) the pattern of social interactions between caregivers and their children from infancy to childhood and (2) relationships among speech style, social context, and language learning across time. Our study sample consisted of 30 participants from the previously published infant studies, evaluated at 33 months of age. Social interactions were assessed at home using digital first-person perspective recordings of the auditory environment. We found that caregivers use less parentese speech-1:1, and more standard speech-1:1, as their children get older. Furthermore, we found that the effects of parentese speech-1:1 in infancy on later language development at 24 months persist at 33 months of age. Finally, we found that exposure to standard speech-1:1 in childhood was the only social interaction that related to concurrent word production/use. Mediation analyses showed that standard speech-1:1 in childhood fully mediated the effects of parentese speech-1:1 in infancy on language development in childhood, controlling for SES. This study demonstrates that engaging in one-on-one interactions in infancy and later in life has important implications for language development. PMID:28676774
Speech Recognition as a Transcription Aid: A Randomized Comparison With Standard Transcription
Mohr, David N.; Turner, David W.; Pond, Gregory R.; Kamath, Joseph S.; De Vos, Cathy B.; Carpenter, Paul C.
2003-01-01
Objective. Speech recognition promises to reduce information entry costs for clinical information systems. It is most likely to be accepted across an organization if physicians can dictate without concerning themselves with real-time recognition and editing; assistants can then edit and process the computer-generated document. Our objective was to evaluate the use of speech-recognition technology in a randomized controlled trial using our institutional infrastructure. Design. Clinical note dictations from physicians in two specialty divisions were randomized to either a standard transcription process or a speech-recognition process. Secretaries and transcriptionists also were assigned randomly to each of these processes. Measurements. The duration of each dictation was measured. The amount of time spent processing a dictation to yield a finished document also was measured. Secretarial and transcriptionist productivity, defined as hours of secretary work per minute of dictation processed, was determined for speech recognition and standard transcription. Results. Secretaries in the endocrinology division were 87.3% (confidence interval, 83.3%, 92.3%) as productive with the speech-recognition technology as implemented in this study as they were using standard transcription. Psychiatry transcriptionists and secretaries were similarly less productive. Author, secretary, and type of clinical note were significant (p < 0.05) predictors of productivity. Conclusion. When implemented in an organization with an existing document-processing infrastructure (which included training and interfaces of the speech-recognition editor with the existing document entry application), speech recognition did not improve the productivity of secretaries or transcriptionists. PMID:12509359
A real-time phoneme counting algorithm and application for speech rate monitoring.
Aharonson, Vered; Aharonson, Eran; Raichlin-Levi, Katia; Sotzianu, Aviv; Amir, Ofer; Ovadia-Blechman, Zehava
2017-03-01
Adults who stutter can learn to control and improve their speech fluency by modifying their speaking rate. Existing speech therapy technologies can assist this practice by monitoring speaking rate and providing feedback to the patient, but cannot provide an accurate, quantitative measurement of speaking rate. Moreover, most technologies are too complex and costly to be used for home practice. We developed an algorithm and a smartphone application that monitor a patient's speaking rate in real time and provide user-friendly feedback to both patient and therapist. Our speaking rate computation is performed by a phoneme counting algorithm which implements spectral transition measure extraction to estimate phoneme boundaries. The algorithm is implemented in real time in a mobile application that presents its results in a user-friendly interface. The application incorporates two modes: one provides the patient with visual feedback of his/her speech rate for self-practice and another provides the speech therapist with recordings, speech rate analysis and tools to manage the patient's practice. The algorithm's phoneme counting accuracy was validated on ten healthy subjects who read a paragraph at slow, normal and fast paces, and was compared to manual counting of speech experts. Test-retest and intra-counter reliability were assessed. Preliminary results indicate differences of -4% to 11% between automatic and human phoneme counting. Differences were largest for slow speech. The application can thus provide reliable, user-friendly, real-time feedback for speaking rate control practice. Copyright © 2017 Elsevier Inc. All rights reserved.
The functional neuroanatomy of language
NASA Astrophysics Data System (ADS)
Hickok, Gregory
2009-09-01
There has been substantial progress over the last several years in understanding aspects of the functional neuroanatomy of language. Some of these advances are summarized in this review. It will be argued that recognizing speech sounds is carried out in the superior temporal lobe bilaterally, that the superior temporal sulcus bilaterally is involved in phonological-level aspects of this process, that the frontal/motor system is not central to speech recognition although it may modulate auditory perception of speech, that conceptual access mechanisms are likely located in the lateral posterior temporal lobe (middle and inferior temporal gyri), that speech production involves sensory-related systems in the posterior superior temporal lobe in the left hemisphere, that the interface between perceptual and motor systems is supported by a sensory-motor circuit for vocal tract actions (not dedicated to speech) that is very similar to sensory-motor circuits found in primate parietal lobe, and that verbal short-term memory can be understood as an emergent property of this sensory-motor circuit. These observations are considered within the context of a dual stream model of speech processing in which one pathway supports speech comprehension and the other supports sensory-motor integration. Additional topics of discussion include the functional organization of the planum temporale for spatial hearing and speech-related sensory-motor processes, the anatomical and functional basis of a form of acquired language disorder, conduction aphasia, the neural basis of vocabulary development, and sentence-level/grammatical processing.
Noise Robust Speech Recognition Applied to Voice-Driven Wheelchair
NASA Astrophysics Data System (ADS)
Sasou, Akira; Kojima, Hiroaki
2009-12-01
Conventional voice-driven wheelchairs usually employ headset microphones that are capable of achieving sufficient recognition accuracy, even in the presence of surrounding noise. However, such interfaces require users to wear sensors such as a headset microphone, which can be an impediment, especially for the hand disabled. Conversely, it is also well known that the speech recognition accuracy drastically degrades when the microphone is placed far from the user. In this paper, we develop a noise robust speech recognition system for a voice-driven wheelchair. This system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors. We verified the effectiveness of our system in experiments in different environments, and confirmed that our system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors.
Neurophysiology underlying influence of stimulus reliability on audiovisual integration.
Shatzer, Hannah; Shen, Stanley; Kerlin, Jess R; Pitt, Mark A; Shahin, Antoine J
2018-01-24
We tested the predictions of the dynamic reweighting model (DRM) of audiovisual (AV) speech integration, which posits that spectrotemporally reliable (informative) AV speech stimuli induce a reweighting of processing from low-level to high-level auditory networks. This reweighting decreases sensitivity to acoustic onsets and in turn increases tolerance to AV onset asynchronies (AVOA). EEG was recorded while subjects watched videos of a speaker uttering trisyllabic nonwords that varied in spectrotemporal reliability and asynchrony of the visual and auditory inputs. Subjects judged the stimuli as in-sync or out-of-sync. Results showed that subjects exhibited greater AVOA tolerance for non-blurred than blurred visual speech and for less than more degraded acoustic speech. Increased AVOA tolerance was reflected in reduced amplitude of the P1-P2 auditory evoked potentials, a neurophysiological indication of reduced sensitivity to acoustic onsets and successful AV integration. There was also sustained visual alpha band (8-14 Hz) suppression (desynchronization) following acoustic speech onsets for non-blurred vs. blurred visual speech, consistent with continuous engagement of the visual system as the speech unfolds. The current findings suggest that increased spectrotemporal reliability of acoustic and visual speech promotes robust AV integration, partly by suppressing sensitivity to acoustic onsets, in support of the DRM's reweighting mechanism. Increased visual signal reliability also sustains the engagement of the visual system with the auditory system to maintain alignment of information across modalities. © 2018 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Natural user interface as a supplement of the holographic Raman tweezers
NASA Astrophysics Data System (ADS)
Tomori, Zoltan; Kanka, Jan; Kesa, Peter; Jakl, Petr; Sery, Mojmir; Bernatova, Silvie; Antalik, Marian; Zemánek, Pavel
2014-09-01
Holographic Raman tweezers (HRT) manipulates with microobjects by controlling the positions of multiple optical traps via the mouse or joystick. Several attempts have appeared recently to exploit touch tablets, 2D cameras or Kinect game console instead. We proposed a multimodal "Natural User Interface" (NUI) approach integrating hands tracking, gestures recognition, eye tracking and speech recognition. For this purpose we exploited "Leap Motion" and "MyGaze" low-cost sensors and a simple speech recognition program "Tazti". We developed own NUI software which processes signals from the sensors and sends the control commands to HRT which subsequently controls the positions of trapping beams, micropositioning stage and the acquisition system of Raman spectra. System allows various modes of operation proper for specific tasks. Virtual tools (called "pin" and "tweezers") serving for the manipulation with particles are displayed on the transparent "overlay" window above the live camera image. Eye tracker identifies the position of the observed particle and uses it for the autofocus. Laser trap manipulation navigated by the dominant hand can be combined with the gestures recognition of the secondary hand. Speech commands recognition is useful if both hands are busy. Proposed methods make manual control of HRT more efficient and they are also a good platform for its future semi-automated and fully automated work.
Cracking the Language Code: Neural Mechanisms Underlying Speech Parsing
McNealy, Kristin; Mazziotta, John C.; Dapretto, Mirella
2013-01-01
Word segmentation, detecting word boundaries in continuous speech, is a critical aspect of language learning. Previous research in infants and adults demonstrated that a stream of speech can be readily segmented based solely on the statistical and speech cues afforded by the input. Using functional magnetic resonance imaging (fMRI), the neural substrate of word segmentation was examined on-line as participants listened to three streams of concatenated syllables, containing either statistical regularities alone, statistical regularities and speech cues, or no cues. Despite the participants’ inability to explicitly detect differences between the speech streams, neural activity differed significantly across conditions, with left-lateralized signal increases in temporal cortices observed only when participants listened to streams containing statistical regularities, particularly the stream containing speech cues. In a second fMRI study, designed to verify that word segmentation had implicitly taken place, participants listened to trisyllabic combinations that occurred with different frequencies in the streams of speech they just heard (“words,” 45 times; “partwords,” 15 times; “nonwords,” once). Reliably greater activity in left inferior and middle frontal gyri was observed when comparing words with partwords and, to a lesser extent, when comparing partwords with nonwords. Activity in these regions, taken to index the implicit detection of word boundaries, was positively correlated with participants’ rapid auditory processing skills. These findings provide a neural signature of on-line word segmentation in the mature brain and an initial model with which to study developmental changes in the neural architecture involved in processing speech cues during language learning. PMID:16855090
Dynamic speech representations in the human temporal lobe.
Leonard, Matthew K; Chang, Edward F
2014-09-01
Speech perception requires rapid integration of acoustic input with context-dependent knowledge. Recent methodological advances have allowed researchers to identify underlying information representations in primary and secondary auditory cortex and to examine how context modulates these representations. We review recent studies that focus on contextual modulations of neural activity in the superior temporal gyrus (STG), a major hub for spectrotemporal encoding. Recent findings suggest a highly interactive flow of information processing through the auditory ventral stream, including influences of higher-level linguistic and metalinguistic knowledge, even within individual areas. Such mechanisms may give rise to more abstract representations, such as those for words. We discuss the importance of characterizing representations of context-dependent and dynamic patterns of neural activity in the approach to speech perception research. Copyright © 2014 Elsevier Ltd. All rights reserved.
Do parents lead their children by the hand?
Ozçalişkan, Seyda; Goldin-Meadow, Susan
2005-08-01
The types of gesture+speech combinations children produce during the early stages of language development change over time. This change, in turn, predicts the onset of two-word speech and thus might reflect a cognitive transition that the child is undergoing. An alternative, however, is that the change merely reflects changes in the types of gesture + speech combinations that their caregivers produce. To explore this possibility, we videotaped 40 American child-caregiver dyads in their homes for 90 minutes when the children were 1;2, 1;6, and 1;10. Each gesture was classified according to type (deictic, conventional, representational) and the relation it held to speech (reinforcing, disambiguating, supplementary). Children and their caregivers produced the same types of gestures and in approximately the same distribution. However, the children differed from their caregivers in the way they used gesture in relation to speech. Over time, children produced many more REINFORCING (bike+point at bike), DISAMBIGUATING (that one+ point at bike), and SUPPLEMENTARY combinations (ride+point at bike). In contrast, the frequency and distribution of caregivers' gesture+speech combinations remained constant over time. Thus, the changing relation between gesture and speech observed in the children cannot be traced back to the gestural input the children receive. Rather, it appears to reflect changes in the children's own skills, illustrating once again gesture's ability to shed light on developing cognitive and linguistic processes.
Treating speech subsystems in childhood apraxia of speech with tactual input: the PROMPT approach.
Dale, Philip S; Hayden, Deborah A
2013-11-01
Prompts for Restructuring Oral Muscular Phonetic Targets (PROMPT; Hayden, 2004; Hayden, Eigen, Walker, & Olsen, 2010)-a treatment approach for the improvement of speech sound disorders in children-uses tactile-kinesthetic- proprioceptive (TKP) cues to support and shape movements of the oral articulators. No research to date has systematically examined the efficacy of PROMPT for children with childhood apraxia of speech (CAS). Four children (ages 3;6 [years;months] to 4;8), all meeting the American Speech-Language-Hearing Association (2007) criteria for CAS, were treated using PROMPT. All children received 8 weeks of 2 × per week treatment, including at least 4 weeks of full PROMPT treatment that included TKP cues. During the first 4 weeks, 2 of the 4 children received treatment that included all PROMPT components except TKP cues. This design permitted both between-subjects and within-subjects comparisons to evaluate the effect of TKP cues. Gains in treatment were measured by standardized tests and by criterion-referenced measures based on the production of untreated probe words, reflecting change in speech movements and auditory perceptual accuracy. All 4 children made significant gains during treatment, but measures of motor speech control and untreated word probes provided evidence for more gain when TKP cues were included. PROMPT as a whole appears to be effective for treating children with CAS, and the inclusion of TKP cues appears to facilitate greater effect.
Development of a Math Input Interface with Flick Operation for Mobile Devices
ERIC Educational Resources Information Center
Nakamura, Yasuyuki; Nakahara, Takahiro
2016-01-01
Developing online test environments for e-learning for mobile devices will be useful to increase drill practice opportunities. In order to provide a drill practice environment for calculus using an online math test system, such as STACK, we develop a flickable math input interface that can be easily used on mobile devices. The number of taps…
Rapid recalibration of speech perception after experiencing the McGurk illusion
Pérez-Bellido, Alexis; de Lange, Floris P.
2018-01-01
The human brain can quickly adapt to changes in the environment. One example is phonetic recalibration: a speech sound is interpreted differently depending on the visual speech and this interpretation persists in the absence of visual information. Here, we examined the mechanisms of phonetic recalibration. Participants categorized the auditory syllables /aba/ and /ada/, which were sometimes preceded by the so-called McGurk stimuli (in which an /aba/ sound, due to visual /aga/ input, is often perceived as ‘ada’). We found that only one trial of exposure to the McGurk illusion was sufficient to induce a recalibration effect, i.e. an auditory /aba/ stimulus was subsequently more often perceived as ‘ada’. Furthermore, phonetic recalibration took place only when auditory and visual inputs were integrated to ‘ada’ (McGurk illusion). Moreover, this recalibration depended on the sensory similarity between the preceding and current auditory stimulus. Finally, signal detection theoretical analysis showed that McGurk-induced phonetic recalibration resulted in both a criterion shift towards /ada/ and a reduced sensitivity to distinguish between /aba/ and /ada/ sounds. The current study shows that phonetic recalibration is dependent on the perceptual integration of audiovisual information and leads to a perceptual shift in phoneme categorization. PMID:29657743
Multibody dynamics model building using graphical interfaces
NASA Technical Reports Server (NTRS)
Macala, Glenn A.
1989-01-01
In recent years, the extremely laborious task of manually deriving equations of motion for the simulation of multibody spacecraft dynamics has largely been eliminated. Instead, the dynamicist now works with commonly available general purpose dynamics simulation programs which generate the equations of motion either explicitly or implicitly via computer codes. The user interface to these programs has predominantly been via input data files, each with its own required format and peculiarities, causing errors and frustrations during program setup. Recent progress in a more natural method of data input for dynamics programs: the graphical interface, is described.
Online EEG Classification of Covert Speech for Brain-Computer Interfacing.
Sereshkeh, Alborz Rezazadeh; Trott, Robert; Bricout, Aurélien; Chau, Tom
2017-12-01
Brain-computer interfaces (BCIs) for communication can be nonintuitive, often requiring the performance of hand motor imagery or some other conversation-irrelevant task. In this paper, electroencephalography (EEG) was used to develop two intuitive online BCIs based solely on covert speech. The goal of the first BCI was to differentiate between 10[Formula: see text]s of mental repetitions of the word "no" and an equivalent duration of unconstrained rest. The second BCI was designed to discern between 10[Formula: see text]s each of covert repetition of the words "yes" and "no". Twelve participants used these two BCIs to answer yes or no questions. Each participant completed four sessions, comprising two offline training sessions and two online sessions, one for testing each of the BCIs. With a support vector machine and a combination of spectral and time-frequency features, an average accuracy of [Formula: see text] was reached across participants in the online classification of no versus rest, with 10 out of 12 participants surpassing the chance level (60.0% for [Formula: see text]). The online classification of yes versus no yielded an average accuracy of [Formula: see text], with eight participants exceeding the chance level. Task-specific changes in EEG beta and gamma power in language-related brain areas tended to provide discriminatory information. To our knowledge, this is the first report of online EEG classification of covert speech. Our findings support further study of covert speech as a BCI activation task, potentially leading to the development of more intuitive BCIs for communication.
Speech versus manual control of camera functions during a telerobotic task
NASA Technical Reports Server (NTRS)
Bierschwale, John M.; Sampaio, Carlos E.; Stuart, Mark A.; Smith, Randy L.
1993-01-01
This investigation has evaluated the voice-commanded camera control concept. For this particular task, total voice control of continuous and discrete camera functions was significantly slower than manual control. There was no significant difference between voice and manual input for several types of errors. There was not a clear trend in subjective preference of camera command input modality. Task performance, in terms of both accuracy and speed, was very similar across both levels of experience.
Schafer, Erin C; Romine, Denise; Musgrave, Elizabeth; Momin, Sadaf; Huynh, Christy
2013-01-01
Previous research has suggested that electrically coupled frequency modulation (FM) systems substantially improved speech-recognition performance in noise in individuals with cochlear implants (CIs). However, there is limited evidence to support the use of electromagnetically coupled (neck loop) FM receivers with contemporary CI sound processors containing telecoils. The primary goal of this study was to compare speech-recognition performance in noise and subjective ratings of adolescents and adults using one of three contemporary CI sound processors coupled to electromagnetically and electrically coupled FM receivers from Oticon. A repeated-measures design was used to compare speech-recognition performance in noise and subjective ratings without and with the FM systems across three test sessions (Experiment 1) and to compare performance at different FM-gain settings (Experiment 2). Descriptive statistics were used in Experiment 3 to describe output differences measured through a CI sound processor. Experiment 1 included nine adolescents or adults with unilateral or bilateral Advanced Bionics Harmony (n = 3), Cochlear Nucleus 5 (n = 3), and MED-EL OPUS 2 (n = 3) CI sound processors. In Experiment 2, seven of the original nine participants were tested. In Experiment 3, electroacoustic output was measured from a Nucleus 5 sound processor when coupled to the electromagnetically coupled Oticon Arc neck loop and electrically coupled Oticon R2. In Experiment 1, participants completed a field trial with each FM receiver and three test sessions that included speech-recognition performance in noise and a subjective rating scale. In Experiment 2, participants were tested in three receiver-gain conditions. Results in both experiments were analyzed using repeated-measures analysis of variance. Experiment 3 involved electroacoustic-test measures to determine the monitor-earphone output of the CI alone and CI coupled to the two FM receivers. The results in Experiment 1 suggested that both FM receivers provided significantly better speech-recognition performance in noise than the CI alone; however, the electromagnetically coupled receiver provided significantly better speech-recognition performance in noise and better ratings in some situations than the electrically coupled receiver when set to the same gain. In Experiment 2, the primary analysis suggested significantly better speech-recognition performance in noise for the neck-loop versus electrically coupled receiver, but a second analysis, using the best performance across gain settings for each device, revealed no significant differences between the two FM receivers. Experiment 3 revealed monitor-earphone output differences in the Nucleus 5 sound processor for the two FM receivers when set to the +8 setting used in Experiment 1 but equal output when the electrically coupled device was set to a +16 gain setting and the electromagnetically coupled device was set to the +8 gain setting. Individuals with contemporary sound processors may show more favorable speech-recognition performance in noise electromagnetically coupled FM systems (i.e., Oticon Arc), which is most likely related to the input processing and signal processing pathway within the CI sound processor for direct input versus telecoil input. Further research is warranted to replicate these findings with a larger sample size and to develop and validate a more objective approach to fitting FM systems to CI sound processors. American Academy of Audiology.
Language learning, socioeconomic status, and child-directed speech.
Schwab, Jessica F; Lew-Williams, Casey
2016-07-01
Young children's language experiences and language outcomes are highly variable. Research in recent decades has focused on understanding the extent to which family socioeconomic status (SES) relates to parents' language input to their children and, subsequently, children's language learning. Here, we first review research demonstrating differences in the quantity and quality of language that children hear across low-, mid-, and high-SES groups, but also-and perhaps more importantly-research showing that differences in input and learning also exist within SES groups. Second, in order to better understand the defining features of 'high-quality' input, we highlight findings from laboratory studies examining specific characteristics of the sounds, words, sentences, and social contexts of child-directed speech (CDS) that influence children's learning. Finally, after narrowing in on these particular features of CDS, we broaden our discussion by considering family and community factors that may constrain parents' ability to participate in high-quality interactions with their young children. A unification of research on SES and CDS will facilitate a more complete understanding of the specific means by which input shapes learning, as well as generate ideas for crafting policies and programs designed to promote children's language outcomes. WIREs Cogn Sci 2016, 7:264-275. doi: 10.1002/wcs.1393 For further resources related to this article, please visit the WIREs website. © 2016 Wiley Periodicals, Inc.
Yoo, Sejin; Chung, Jun-Young; Jeon, Hyeon-Ae; Lee, Kyoung-Min; Kim, Young-Bo; Cho, Zang-Hee
2012-07-01
Speech production is inextricably linked to speech perception, yet they are usually investigated in isolation. In this study, we employed a verbal-repetition task to identify the neural substrates of speech processing with two ends active simultaneously using functional MRI. Subjects verbally repeated auditory stimuli containing an ambiguous vowel sound that could be perceived as either a word or a pseudoword depending on the interpretation of the vowel. We found verbal repetition commonly activated the audition-articulation interface bilaterally at Sylvian fissures and superior temporal sulci. Contrasting word-versus-pseudoword trials revealed neural activities unique to word repetition in the left posterior middle temporal areas and activities unique to pseudoword repetition in the left inferior frontal gyrus. These findings imply that the tasks are carried out using different speech codes: an articulation-based code of pseudowords and an acoustic-phonetic code of words. It also supports the dual-stream model and imitative learning of vocabulary. Copyright © 2012 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Viswanathan, V. R.; Makhoul, J.; Schwartz, R. M.; Huggins, A. W. F.
1982-04-01
The variable frame rate (VFR) transmission methodology developed, implemented, and tested in the years 1973-1978 for efficiently transmitting linear predictive coding (LPC) vocoder parameters extracted from the input speech at a fixed frame rate is reviewed. With the VFR method, parameters are transmitted only when their values have changed sufficiently over the interval since their preceding transmission. Two distinct approaches to automatic implementation of the VFR method are discussed. The first bases the transmission decisions on comparisons between the parameter values of the present frame and the last transmitted frame. The second, which is based on a functional perceptual model of speech, compares the parameter values of all the frames that lie in the interval between the present frame and the last transmitted frame against a linear model of parameter variation over that interval. Also considered is the application of VFR transmission to the design of narrow-band LPC speech coders with average bit rates of 2000-2400 bts/s.
Radicevic, Zoran; Jelicic Dobrijevic, Ljiljana; Sovilj, Mirjana; Barlov, Ivana
2009-06-01
Aim of the research was to examine similarities and differences between the periods of experiencing visually stimulated directed speech-language information and periods of undirected attention. The examined group comprised N = 64 children, aged 4-5, with different speech-language disorders (developmental dysphasia, hyperactive syndrome with attention disorder, children with borderline intellectual abilities, autistic complex). Theta EEG was registered in children in the period of watching and describing the picture ("task"), and in the period of undirected attention ("passive period"). The children were recorded in standard EEG conditions, at 19 points of EEG registration and in longitudinal bipolar montage. Results in the observed age-operative theta rhythm indicated significant similarities and differences in the prevalence of spatial engagement of certain regions between the two hemispheres at the input and output of processing, which opens the possibility for more detailed analysis of conscious control of speech-language processing and its disorders.
Simeon, Katherine M.; Bicknell, Klinton; Grieco-Calub, Tina M.
2018-01-01
Individuals use semantic expectancy – applying conceptual and linguistic knowledge to speech input – to improve the accuracy and speed of language comprehension. This study tested how adults use semantic expectancy in quiet and in the presence of speech-shaped broadband noise at -7 and -12 dB signal-to-noise ratio. Twenty-four adults (22.1 ± 3.6 years, mean ±SD) were tested on a four-alternative-forced-choice task whereby they listened to sentences and were instructed to select an image matching the sentence-final word. The semantic expectancy of the sentences was unrelated to (neutral), congruent with, or conflicting with the acoustic target. Congruent expectancy improved accuracy and conflicting expectancy decreased accuracy relative to neutral, consistent with a theory where expectancy shifts beliefs toward likely words and away from unlikely words. Additionally, there were no significant interactions of expectancy and noise level when analyzed in log-odds, supporting the predictions of ideal observer models of speech perception. PMID:29472883
Simeon, Katherine M; Bicknell, Klinton; Grieco-Calub, Tina M
2018-01-01
Individuals use semantic expectancy - applying conceptual and linguistic knowledge to speech input - to improve the accuracy and speed of language comprehension. This study tested how adults use semantic expectancy in quiet and in the presence of speech-shaped broadband noise at -7 and -12 dB signal-to-noise ratio. Twenty-four adults (22.1 ± 3.6 years, mean ± SD ) were tested on a four-alternative-forced-choice task whereby they listened to sentences and were instructed to select an image matching the sentence-final word. The semantic expectancy of the sentences was unrelated to (neutral), congruent with, or conflicting with the acoustic target. Congruent expectancy improved accuracy and conflicting expectancy decreased accuracy relative to neutral, consistent with a theory where expectancy shifts beliefs toward likely words and away from unlikely words. Additionally, there were no significant interactions of expectancy and noise level when analyzed in log-odds, supporting the predictions of ideal observer models of speech perception.
Simulation for noise cancellation using LMS adaptive filter
NASA Astrophysics Data System (ADS)
Lee, Jia-Haw; Ooi, Lu-Ean; Ko, Ying-Hao; Teoh, Choe-Yung
2017-06-01
In this paper, the fundamental algorithm of noise cancellation, Least Mean Square (LMS) algorithm is studied and enhanced with adaptive filter. The simulation of the noise cancellation using LMS adaptive filter algorithm is developed. The noise corrupted speech signal and the engine noise signal are used as inputs for LMS adaptive filter algorithm. The filtered signal is compared to the original noise-free speech signal in order to highlight the level of attenuation of the noise signal. The result shows that the noise signal is successfully canceled by the developed adaptive filter. The difference of the noise-free speech signal and filtered signal are calculated and the outcome implies that the filtered signal is approaching the noise-free speech signal upon the adaptive filtering. The frequency range of the successfully canceled noise by the LMS adaptive filter algorithm is determined by performing Fast Fourier Transform (FFT) on the signals. The LMS adaptive filter algorithm shows significant noise cancellation at lower frequency range.
Accurate visible speech synthesis based on concatenating variable length motion capture data.
Ma, Jiyong; Cole, Ron; Pellom, Bryan; Ward, Wayne; Wise, Barbara
2006-01-01
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
Auditory Selective Attention to Speech Modulates Activity in the Visual Word Form Area
Yoncheva, Yuliya N.; Zevin, Jason D.; Maurer, Urs
2010-01-01
Selective attention to speech versus nonspeech signals in complex auditory input could produce top-down modulation of cortical regions previously linked to perception of spoken, and even visual, words. To isolate such top-down attentional effects, we contrasted 2 equally challenging active listening tasks, performed on the same complex auditory stimuli (words overlaid with a series of 3 tones). Instructions required selectively attending to either the speech signals (in service of rhyme judgment) or the melodic signals (tone-triplet matching). Selective attention to speech, relative to attention to melody, was associated with blood oxygenation level–dependent (BOLD) increases during functional magnetic resonance imaging (fMRI) in left inferior frontal gyrus, temporal regions, and the visual word form area (VWFA). Further investigation of the activity in visual regions revealed overall deactivation relative to baseline rest for both attention conditions. Topographic analysis demonstrated that while attending to melody drove deactivation equivalently across all fusiform regions of interest examined, attending to speech produced a regionally specific modulation: deactivation of all fusiform regions, except the VWFA. Results indicate that selective attention to speech can topographically tune extrastriate cortex, leading to increased activity in VWFA relative to surrounding regions, in line with the well-established connectivity between areas related to spoken and visual word perception in skilled readers. PMID:19571269
Altieri, Nicholas; Pisoni, David B.; Townsend, James T.
2012-01-01
Summerfield (1987) proposed several accounts of audiovisual speech perception, a field of research that has burgeoned in recent years. The proposed accounts included the integration of discrete phonetic features, vectors describing the values of independent acoustical and optical parameters, the filter function of the vocal tract, and articulatory dynamics of the vocal tract. The latter two accounts assume that the representations of audiovisual speech perception are based on abstract gestures, while the former two assume that the representations consist of symbolic or featural information obtained from visual and auditory modalities. Recent converging evidence from several different disciplines reveals that the general framework of Summerfield’s feature-based theories should be expanded. An updated framework building upon the feature-based theories is presented. We propose a processing model arguing that auditory and visual brain circuits provide facilitatory information when the inputs are correctly timed, and that auditory and visual speech representations do not necessarily undergo translation into a common code during information processing. Future research on multisensory processing in speech perception should investigate the connections between auditory and visual brain regions, and utilize dynamic modeling tools to further understand the timing and information processing mechanisms involved in audiovisual speech integration. PMID:21968081
Distinct developmental profiles in typical speech acquisition
Campbell, Thomas F.; Shriberg, Lawrence D.; Green, Jordan R.; Abdi, Hervé; Rusiewicz, Heather Leavy; Venkatesh, Lakshmi; Moore, Christopher A.
2012-01-01
Three- to five-year-old children produce speech that is characterized by a high level of variability within and across individuals. This variability, which is manifest in speech movements, acoustics, and overt behaviors, can be input to subgroup discovery methods to identify cohesive subgroups of speakers or to reveal distinct developmental pathways or profiles. This investigation characterized three distinct groups of typically developing children and provided normative benchmarks for speech development. These speech development profiles, identified among 63 typically developing preschool-aged speakers (ages 36–59 mo), were derived from the children's performance on multiple measures. These profiles were obtained by submitting to a k-means cluster analysis of 72 measures that composed three levels of speech analysis: behavioral (e.g., task accuracy, percentage of consonants correct), acoustic (e.g., syllable duration, syllable stress), and kinematic (e.g., variability of movements of the upper lip, lower lip, and jaw). Two of the discovered group profiles were distinguished by measures of variability but not by phonemic accuracy; the third group of children was characterized by their relatively low phonemic accuracy but not by an increase in measures of variability. Analyses revealed that of the original 72 measures, 8 key measures were sufficient to best distinguish the 3 profile groups. PMID:22357794
Heimbauer, Lisa A; Beran, Michael J; Owren, Michael J
2011-07-26
A long-standing debate concerns whether humans are specialized for speech perception, which some researchers argue is demonstrated by the ability to understand synthetic speech with significantly reduced acoustic cues to phonetic content. We tested a chimpanzee (Pan troglodytes) that recognizes 128 spoken words, asking whether she could understand such speech. Three experiments presented 48 individual words, with the animal selecting a corresponding visuographic symbol from among four alternatives. Experiment 1 tested spectrally reduced, noise-vocoded (NV) synthesis, originally developed to simulate input received by human cochlear-implant users. Experiment 2 tested "impossibly unspeechlike" sine-wave (SW) synthesis, which reduces speech to just three moving tones. Although receiving only intermittent and noncontingent reward, the chimpanzee performed well above chance level, including when hearing synthetic versions for the first time. Recognition of SW words was least accurate but improved in experiment 3 when natural words in the same session were rewarded. The chimpanzee was more accurate with NV than SW versions, as were 32 human participants hearing these items. The chimpanzee's ability to spontaneously recognize acoustically reduced synthetic words suggests that experience rather than specialization is critical for speech-perception capabilities that some have suggested are uniquely human. Copyright © 2011 Elsevier Ltd. All rights reserved.
Altieri, Nicholas; Pisoni, David B; Townsend, James T
2011-01-01
Summerfield (1987) proposed several accounts of audiovisual speech perception, a field of research that has burgeoned in recent years. The proposed accounts included the integration of discrete phonetic features, vectors describing the values of independent acoustical and optical parameters, the filter function of the vocal tract, and articulatory dynamics of the vocal tract. The latter two accounts assume that the representations of audiovisual speech perception are based on abstract gestures, while the former two assume that the representations consist of symbolic or featural information obtained from visual and auditory modalities. Recent converging evidence from several different disciplines reveals that the general framework of Summerfield's feature-based theories should be expanded. An updated framework building upon the feature-based theories is presented. We propose a processing model arguing that auditory and visual brain circuits provide facilitatory information when the inputs are correctly timed, and that auditory and visual speech representations do not necessarily undergo translation into a common code during information processing. Future research on multisensory processing in speech perception should investigate the connections between auditory and visual brain regions, and utilize dynamic modeling tools to further understand the timing and information processing mechanisms involved in audiovisual speech integration.
An acoustic feature-based similarity scoring system for speech rehabilitation assistance.
Syauqy, Dahnial; Wu, Chao-Min; Setyawati, Onny
2016-08-01
The purpose of this study is to develop a tool to assist speech therapy and rehabilitation, which focused on automatic scoring based on the comparison of the patient's speech with another normal speech on several aspects including pitch, vowel, voiced-unvoiced segments, strident fricative and sound intensity. The pitch estimation employed the use of cepstrum-based algorithm for its robustness; the vowel classification used multilayer perceptron (MLP) to classify vowel from pitch and formants; and the strident fricative detection was based on the major peak spectral intensity, location and the pitch existence in the segment. In order to evaluate the performance of the system, this study analyzed eight patient's speech recordings (four males, four females; 4-58-years-old), which had been recorded in previous study in cooperation with Taipei Veterans General Hospital and Taoyuan General Hospital. The experiment result on pitch algorithm showed that the cepstrum method had 5.3% of gross pitch error from a total of 2086 frames. On the vowel classification algorithm, MLP method provided 93% accuracy (men), 87% (women) and 84% (children). In total, the overall results showed that 156 tool's grading results (81%) were consistent compared to 192 audio and visual observations done by four experienced respondents. Implication for Rehabilitation Difficulties in communication may limit the ability of a person to transfer and exchange information. The fact that speech is one of the primary means of communication has encouraged the needs of speech diagnosis and rehabilitation. The advances of technology in computer-assisted speech therapy (CAST) improve the quality, time efficiency of the diagnosis and treatment of the disorders. The present study attempted to develop tool to assist speech therapy and rehabilitation, which provided simple interface to let the assessment be done even by the patient himself without the need of particular knowledge of speech processing while at the same time, also provided further deep analysis of the speech, which can be useful for the speech therapist.
Electro-optical processing of phased array data
NASA Technical Reports Server (NTRS)
Casasent, D.
1973-01-01
An on-line spatial light modulator for application as the input transducer for a real-time optical data processing system is described. The use of such a device in the analysis and processing of radar data in real time is reported. An interface from the optical processor to a control digital computer was designed, constructed, and tested. The input transducer, optical system, and computer interface have been operated in real time with real time radar data with the input data returns recorded on the input crystal, processed by the optical system, and the output plane pattern digitized, thresholded, and outputted to a display and storage in the computer memory. The correlation of theoretical and experimental results is discussed.
NASA Technical Reports Server (NTRS)
Lyons, J. T.; Borchers, William R.
1993-01-01
Documentation for the User Interface Program for the Minimum Hamiltonian Ascent Trajectory Evaluation (MASTRE) is provided. The User Interface Program is a separate software package designed to ease the user input requirements when using the MASTRE Trajectory Program. This document supplements documentation on the MASTRE Program that consists of the MASTRE Engineering Manual and the MASTRE Programmers Guide. The User Interface Program provides a series of menus and tables using the VAX Screen Management Guideline (SMG) software. These menus and tables allow the user to modify the MASTRE Program input without the need for learning the various program dependent mnemonics. In addition, the User Interface Program allows the user to modify and/or review additional input Namelist and data files, to build and review command files, to formulate and calculate mass properties related data, and to have a plotting capability.
Input Range Testing for the General Mission Analysis Tool (GMAT)
NASA Technical Reports Server (NTRS)
Hughes, Steven P.
2007-01-01
This document contains a test plan for testing input values to the General Mission Analysis Tool (GMAT). The plan includes four primary types of information, which rigorously define all tests that should be performed to validate that GMAT will accept allowable inputs and deny disallowed inputs. The first is a complete list of all allowed object fields in GMAT. The second type of information, is test input to be attempted for each field. The third type of information is allowable input values for all objects fields in GMAT. The final piece of information is how GMAT should respond to both valid and invalid information. It is VERY important to note that the tests below must be performed for both the Graphical User Interface and the script!! The examples are illustrated using a scripting perspective, because it is simpler to write up. However, the test must be performed for both interfaces to GMAT.
A Smartphone Application for Customized Frequency Table Selection in Cochlear Implants.
Jethanamest, Daniel; Azadpour, Mahan; Zeman, Annette M; Sagi, Elad; Svirsky, Mario A
2017-09-01
A novel smartphone-based software application can facilitate self-selection of frequency allocation tables (FAT) in postlingually deaf cochlear implant (CI) users. CIs use FATs to represent the tonotopic organization of a normal cochlea. Current CI fitting methods typically use a standard FAT for all patients regardless of individual differences in cochlear size and electrode location. In postlingually deaf patients, different amounts of mismatch can result between the frequency-place function they experienced when they had normal hearing and the frequency-place function that results from the standard FAT. For some CI users, an alternative FAT may enhance sound quality or speech perception. Currently, no widely available tools exist to aid real-time selection of different FATs. This study aims to develop a new smartphone tool for this purpose and to evaluate speech perception and sound quality measures in a pilot study of CI subjects using this application. A smartphone application for a widely available mobile platform (iOS) was developed to serve as a preprocessor of auditory input to a clinical CI speech processor and enable interactive real-time selection of FATs. The application's output was validated by measuring electrodograms for various inputs. A pilot study was conducted in six CI subjects. Speech perception was evaluated using word recognition tests. All subjects successfully used the portable application with their clinical speech processors to experience different FATs while listening to running speech. The users were all able to select one table that they judged provided the best sound quality. All subjects chose a FAT different from the standard FAT in their everyday clinical processor. Using the smartphone application, the mean consonant-nucleus-consonant score with the default FAT selection was 28.5% (SD 16.8) and 29.5% (SD 16.4) when using a self-selected FAT. A portable smartphone application enables CI users to self-select frequency allocation tables in real time. Even though the self-selected FATs that were deemed to have better sound quality were only tested acutely (i.e., without long-term experience with them), speech perception scores were not inferior to those obtained with the clinical FATs. This software application may be a valuable tool for improving future methods of CI fitting.
Reading Machines for Blind People.
ERIC Educational Resources Information Center
Fender, Derek H.
1983-01-01
Ten stages of developing reading machines for blind people are analyzed: handling of text material; optics; electro-optics; pattern recognition; character recognition; storage; speech synthesizers; browsing and place finding; computer indexing; and other sources of input. Cost considerations of the final product are emphasized. (CL)
ERIC Educational Resources Information Center
Alden, John D.
Contained in this booklet are the speeches given at the annual joint meeting of the Engineering Manpower Commission and the Scientific Manpower Commission. Each dealt with some problem aspect of the engineer-scientist interface. The presentation by Rear Admiral W. C. Hushing of the U. S. Navy was entitled "The Impact of High Performance Science…
Li, Kan; Príncipe, José C.
2018-01-01
This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime. PMID:29666568
Li, Kan; Príncipe, José C
2018-01-01
This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.
Maffei, Chiara; Capasso, Rita; Cazzolli, Giulia; Colosimo, Cesare; Dell'Acqua, Flavio; Piludu, Francesca; Catani, Marco; Miceli, Gabriele
2017-12-01
Pure Word Deafness (PWD) is a rare disorder, characterized by selective loss of speech input processing. Its most common cause is temporal damage to the primary auditory cortex of both hemispheres, but it has been reported also following unilateral lesions. In unilateral cases, PWD has been attributed to the disconnection of Wernicke's area from both right and left primary auditory cortex. Here we report behavioral and neuroimaging evidence from a new case of left unilateral PWD with both cortical and white matter damage due to a relatively small stroke lesion in the left temporal gyrus. Selective impairment in auditory language processing was accompanied by intact processing of nonspeech sounds and normal speech, reading and writing. Performance on dichotic listening was characterized by a reversal of the right-ear advantage typically observed in healthy subjects. Cortical thickness and gyral volume were severely reduced in the left superior temporal gyrus (STG), although abnormalities were not uniformly distributed and residual intact cortical areas were detected, for example in the medial portion of the Heschl's gyrus. Diffusion tractography documented partial damage to the acoustic radiations (AR), callosal temporal connections and intralobar tracts dedicated to single words comprehension. Behavioral and neuroimaging results in this case are difficult to integrate in a pure cortical or disconnection framework, as damage to primary auditory cortex in the left STG was only partial and Wernicke's area was not completely isolated from left or right-hemisphere input. On the basis of our findings we suggest that in this case of PWD, concurrent partial topological (cortical) and disconnection mechanisms have contributed to a selective impairment of speech sounds. The discrepancy between speech and non-speech sounds suggests selective damage to a language-specific left lateralized network involved in phoneme processing. Copyright © 2017 Elsevier Ltd. All rights reserved.
JAva GUi for Applied Research (JAGUAR) v 3.0
DOE Office of Scientific and Technical Information (OSTI.GOV)
JAGUAR is a Java software tool for automatically rendering a graphical user interface (GUI) from a structured input specification. It is designed as a plug-in to the Eclipse workbench to enable users to create, edit, and externally execute analysis application input decks and then view the results. JAGUAR serves as a GUI for Sandia's DAKOTA software toolkit for optimization and uncertainty quantification. It will include problem (input deck)set-up, option specification, analysis execution, and results visualization. Through the use of wizards, templates, and views, JAGUAR helps uses navigate the complexity of DAKOTA's complete input specification. JAGUAR is implemented in Java, leveragingmore » Eclipse extension points and Eclipse user interface. JAGUAR parses a DAKOTA NIDR input specification and presents the user with linked graphical and plain text representations of problem set-up and option specification for DAKOTA studies. After the data has been input by the user, JAGUAR generates one or more input files for DAKOTA, executes DAKOTA, and captures and interprets the results« less
Noise-Robust Monitoring of Lombard Speech Using a Wireless Neck-surface Accelerometer and Microphone
2017-08-20
rechargeable, lithium - ion polymer battery that can be charged through a micro-USB input on the circuit. The micro-USB input also allows for communication to...protection, an on/off switch for the battery , status LEDs, and a logic switch that enables the `Bluetooth module to be fully functional when...simultaneously powered via USB and battery . The system contains a small receiver that is equipped with the same Bluetooth module as the transmitter (BC127
Boyd, Paul J
2006-12-01
The principal task in the programming of a cochlear implant (CI) speech processor is the setting of the electrical dynamic range (output) for each electrode, to ensure that a comfortable loudness percept is obtained for a range of input levels. This typically involves separate psychophysical measurement of electrical threshold ([theta] e) and upper tolerance levels using short current bursts generated by the fitting software. Anecdotal clinical experience and some experimental studies suggest that the measurement of [theta]e is relatively unimportant and that the setting of upper tolerance limits is more critical for processor programming. The present study aims to test this hypothesis and examines in detail how acoustic thresholds and speech recognition are affected by setting of the lower limit of the output ("Programming threshold" or "PT") to understand better the influence of this parameter and how it interacts with certain other programming parameters. Test programs (maps) were generated with PT set to artificially high and low values and tested on users of the MED-EL COMBI 40+ CI system. Acoustic thresholds and speech recognition scores (sentence tests) were measured for each of the test maps. Acoustic thresholds were also measured using maps with a range of output compression functions ("maplaws"). In addition, subjective reports were recorded regarding the presence of "background threshold stimulation" which is occasionally reported by CI users if PT is set to relatively high values when using the CIS strategy. Manipulation of PT was found to have very little effect. Setting PT to minimum produced a mean 5 dB (S.D. = 6.25) increase in acoustic thresholds, relative to thresholds with PT set normally, and had no statistically significant effect on speech recognition scores on a sentence test. On the other hand, maplaw setting was found to have a significant effect on acoustic thresholds (raised as maplaw is made more linear), which provides some theoretical explanation as to why PT has little effect when using the default maplaw of c = 500. Subjective reports of background threshold stimulation showed that most users could perceive a relatively loud auditory percept, in the absence of microphone input, when PT was set to double the behaviorally measured electrical thresholds ([theta]e), but that this produced little intrusion when microphone input was present. The results of these investigations have direct clinical relevance, showing that setting of PT is indeed relatively unimportant in terms of speech discrimination, but that it is worth ensuring that PT is not set excessively high, as this can produce distracting background stimulation. Indeed, it may even be set to minimum values without deleterious effect.
CBM First-level Event Selector Input Interface Demonstrator
NASA Astrophysics Data System (ADS)
Hutter, Dirk; de Cuveland, Jan; Lindenstruth, Volker
2017-10-01
CBM is a heavy-ion experiment at the future FAIR facility in Darmstadt, Germany. Featuring self-triggered front-end electronics and free-streaming read-out, event selection will exclusively be done by the First Level Event Selector (FLES). Designed as an HPC cluster with several hundred nodes its task is an online analysis and selection of the physics data at a total input data rate exceeding 1 TByte/s. To allow efficient event selection, the FLES performs timeslice building, which combines the data from all given input links to self-contained, potentially overlapping processing intervals and distributes them to compute nodes. Partitioning the input data streams into specialized containers allows performing this task very efficiently. The FLES Input Interface defines the linkage between the FEE and the FLES data transport framework. A custom FPGA PCIe board, the FLES Interface Board (FLIB), is used to receive data via optical links and transfer them via DMA to the host’s memory. The current prototype of the FLIB features a Kintex-7 FPGA and provides up to eight 10 GBit/s optical links. A custom FPGA design has been developed for this board. DMA transfers and data structures are optimized for subsequent timeslice building. Index tables generated by the FPGA enable fast random access to the written data containers. In addition the DMA target buffers can directly serve as InfiniBand RDMA source buffers without copying the data. The usage of POSIX shared memory for these buffers allows data access from multiple processes. An accompanying HDL module has been developed to integrate the FLES link into the front-end FPGA designs. It implements the front-end logic interface as well as the link protocol. Prototypes of all Input Interface components have been implemented and integrated into the FLES test framework. This allows the implementation and evaluation of the foreseen CBM read-out chain.
Expressive facial animation synthesis by learning speech coarticulation and expression spaces.
Deng, Zhigang; Neumann, Ulrich; Lewis, J P; Kim, Tae-Yong; Bulut, Murtaza; Narayanan, Shrikanth
2006-01-01
Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D motions of the markers on the face of a human subject are captured while he/she recites a predesigned corpus, with specific spoken and visual expressions. We present a novel motion capture mining technique that "learns" speech coarticulation models for diphones and triphones from the recorded data. A Phoneme-Independent Expression Eigenspace (PIEES) that encloses the dynamic expression signals is constructed by motion signal processing (phoneme-based time-warping and subtraction) and Principal Component Analysis (PCA) reduction. New expressive facial animations are synthesized as follows: First, the learned coarticulation models are concatenated to synthesize neutral visual speech according to novel speech input, then a texture-synthesis-based approach is used to generate a novel dynamic expression signal from the PIEES model, and finally the synthesized expression signal is blended with the synthesized neutral visual speech to create the final expressive facial animation. Our experiments demonstrate that the system can effectively synthesize realistic expressive facial animation.
When cognition kicks in: working memory and speech understanding in noise.
Rönnberg, Jerker; Rudner, Mary; Lunner, Thomas; Zekveld, Adriana A
2010-01-01
Perceptual load and cognitive load can be separately manipulated and dissociated in their effects on speech understanding in noise. The Ease of Language Understanding model assumes a theoretical position where perceptual task characteristics interact with the individual's implicit capacities to extract the phonological elements of speech. Phonological precision and speed of lexical access are important determinants for listening in adverse conditions. If there are mismatches between the phonological elements perceived and phonological representations in long-term memory, explicit working memory (WM)-related capacities will be continually invoked to reconstruct and infer the contents of the ongoing discourse. Whether this induces a high cognitive load or not will in turn depend on the individual's storage and processing capacities in WM. Data suggest that modulated noise maskers may serve as triggers for speech maskers and therefore induce a WM, explicit mode of processing. Individuals with high WM capacity benefit more than low WM-capacity individuals from fast amplitude compression at low or negative input speech-to-noise ratios. The general conclusion is that there is an overarching interaction between the focal purpose of processing in the primary listening task and the extent to which a secondary, distracting task taps into these processes.
Visual contribution to the multistable perception of speech.
Sato, Marc; Basirat, Anahita; Schwartz, Jean-Luc
2007-11-01
The multistable perception of speech, or verbal transformation effect, refers to perceptual changes experienced while listening to a speech form that is repeated rapidly and continuously. In order to test whether visual information from the speaker's articulatory gestures may modify the emergence and stability of verbal auditory percepts, subjects were instructed to report any perceptual changes during unimodal, audiovisual, and incongruent audiovisual presentations of distinct repeated syllables. In a first experiment, the perceptual stability of reported auditory percepts was significantly modulated by the modality of presentation. In a second experiment, when audiovisual stimuli consisting of a stable audio track dubbed with a video track that alternated between congruent and incongruent stimuli were presented, a strong correlation between the timing of perceptual transitions and the timing of video switches was found. Finally, a third experiment showed that the vocal tract opening onset event provided by the visual input could play the role of a bootstrap mechanism in the search for transformations. Altogether, these results demonstrate the capacity of visual information to control the multistable perception of speech in its phonetic content and temporal course. The verbal transformation effect thus provides a useful experimental paradigm to explore audiovisual interactions in speech perception.
Simonyan, Kristina; Fuertinger, Stefan
2015-04-01
Speech production is one of the most complex human behaviors. Although brain activation during speaking has been well investigated, our understanding of interactions between the brain regions and neural networks remains scarce. We combined seed-based interregional correlation analysis with graph theoretical analysis of functional MRI data during the resting state and sentence production in healthy subjects to investigate the interface and topology of functional networks originating from the key brain regions controlling speech, i.e., the laryngeal/orofacial motor cortex, inferior frontal and superior temporal gyri, supplementary motor area, cingulate cortex, putamen, and thalamus. During both resting and speaking, the interactions between these networks were bilaterally distributed and centered on the sensorimotor brain regions. However, speech production preferentially recruited the inferior parietal lobule (IPL) and cerebellum into the large-scale network, suggesting the importance of these regions in facilitation of the transition from the resting state to speaking. Furthermore, the cerebellum (lobule VI) was the most prominent region showing functional influences on speech-network integration and segregation. Although networks were bilaterally distributed, interregional connectivity during speaking was stronger in the left vs. right hemisphere, which may have underlined a more homogeneous overlap between the examined networks in the left hemisphere. Among these, the laryngeal motor cortex (LMC) established a core network that fully overlapped with all other speech-related networks, determining the extent of network interactions. Our data demonstrate complex interactions of large-scale brain networks controlling speech production and point to the critical role of the LMC, IPL, and cerebellum in the formation of speech production network. Copyright © 2015 the American Physiological Society.
DOT National Transportation Integrated Search
2013-05-01
Cognitively oriented in-vehicle activities (cell-phone calls, speech interfaces, audio translations of text : messages, etc.) increasingly place non-visual demands on a drivers attention. While a drivers eyes may : remain oriented towards the r...
NASA Astrophysics Data System (ADS)
Gîlcă, G.; Bîzdoacă, N. G.; Diaconu, I.
2016-08-01
This article aims to implement some practical applications using the Socibot Desktop social robot. We mean to realize three applications: creating a speech sequence using the Kiosk menu of the browser interface, creating a program in the Virtual Robot browser interface and making a new guise to be loaded into the robot's memory in order to be projected onto it face. The first application is actually created in the Compose submenu that contains 5 file categories: audio, eyes, face, head, mood, this being helpful in the creation of the projected sequence. The second application is more complex, the completed program containing: audio files, speeches (can be created in over 20 languages), head movements, the robot's facial parameters function of each action units (AUs) of the facial muscles, its expressions and its line of sight. Last application aims to change the robot's appearance with the guise created by us. The guise was created in Adobe Photoshop and then loaded into the robot's memory.
Subcortical processing of speech regularities underlies reading and music aptitude in children.
Strait, Dana L; Hornickel, Jane; Kraus, Nina
2011-10-17
Neural sensitivity to acoustic regularities supports fundamental human behaviors such as hearing in noise and reading. Although the failure to encode acoustic regularities in ongoing speech has been associated with language and literacy deficits, how auditory expertise, such as the expertise that is associated with musical skill, relates to the brainstem processing of speech regularities is unknown. An association between musical skill and neural sensitivity to acoustic regularities would not be surprising given the importance of repetition and regularity in music. Here, we aimed to define relationships between the subcortical processing of speech regularities, music aptitude, and reading abilities in children with and without reading impairment. We hypothesized that, in combination with auditory cognitive abilities, neural sensitivity to regularities in ongoing speech provides a common biological mechanism underlying the development of music and reading abilities. We assessed auditory working memory and attention, music aptitude, reading ability, and neural sensitivity to acoustic regularities in 42 school-aged children with a wide range of reading ability. Neural sensitivity to acoustic regularities was assessed by recording brainstem responses to the same speech sound presented in predictable and variable speech streams. Through correlation analyses and structural equation modeling, we reveal that music aptitude and literacy both relate to the extent of subcortical adaptation to regularities in ongoing speech as well as with auditory working memory and attention. Relationships between music and speech processing are specifically driven by performance on a musical rhythm task, underscoring the importance of rhythmic regularity for both language and music. These data indicate common brain mechanisms underlying reading and music abilities that relate to how the nervous system responds to regularities in auditory input. Definition of common biological underpinnings for music and reading supports the usefulness of music for promoting child literacy, with the potential to improve reading remediation.
Visual activity predicts auditory recovery from deafness after adult cochlear implantation.
Strelnikov, Kuzma; Rouger, Julien; Demonet, Jean-François; Lagleyre, Sebastien; Fraysse, Bernard; Deguine, Olivier; Barone, Pascal
2013-12-01
Modern cochlear implantation technologies allow deaf patients to understand auditory speech; however, the implants deliver only a coarse auditory input and patients must use long-term adaptive processes to achieve coherent percepts. In adults with post-lingual deafness, the high progress of speech recovery is observed during the first year after cochlear implantation, but there is a large range of variability in the level of cochlear implant outcomes and the temporal evolution of recovery. It has been proposed that when profoundly deaf subjects receive a cochlear implant, the visual cross-modal reorganization of the brain is deleterious for auditory speech recovery. We tested this hypothesis in post-lingually deaf adults by analysing whether brain activity shortly after implantation correlated with the level of auditory recovery 6 months later. Based on brain activity induced by a speech-processing task, we found strong positive correlations in areas outside the auditory cortex. The highest positive correlations were found in the occipital cortex involved in visual processing, as well as in the posterior-temporal cortex known for audio-visual integration. The other area, which positively correlated with auditory speech recovery, was localized in the left inferior frontal area known for speech processing. Our results demonstrate that the visual modality's functional level is related to the proficiency level of auditory recovery. Based on the positive correlation of visual activity with auditory speech recovery, we suggest that visual modality may facilitate the perception of the word's auditory counterpart in communicative situations. The link demonstrated between visual activity and auditory speech perception indicates that visuoauditory synergy is crucial for cross-modal plasticity and fostering speech-comprehension recovery in adult cochlear-implanted deaf patients.
Optimal input selection for neural machine interfaces predicting multiple non-explicit outputs.
Krepkovich, Eileen T; Perreault, Eric J
2008-01-01
This study implemented a novel algorithm that optimally selects inputs for neural machine interface (NMI) devices intended to control multiple outputs and evaluated its performance on systems lacking explicit output. NMIs often incorporate signals from multiple physiological sources and provide predictions for multidimensional control, leading to multiple-input multiple-output systems. Further, NMIs often are used with subjects who have motor disabilities and thus lack explicit motor outputs. Our algorithm was tested on simulated multiple-input multiple-output systems and on electromyogram and kinematic data collected from healthy subjects performing arm reaches. Effects of output noise in simulated systems indicated that the algorithm could be useful for systems with poor estimates of the output states, as is true for systems lacking explicit motor output. To test efficacy on physiological data, selection was performed using inputs from one subject and outputs from a different subject. Selection was effective for these cases, again indicating that this algorithm will be useful for predictions where there is no motor output, as often is the case for disabled subjects. Further, prediction results generalized for different movement types not used for estimation. These results demonstrate the efficacy of this algorithm for the development of neural machine interfaces.
Analog-to-Digital Conversion to Accommodate the Dynamics of Live Music in Hearing Instruments
Bahlmann, Frauke; Fulton, Bernadette
2012-01-01
Hearing instrument design focuses on the amplification of speech to reduce the negative effects of hearing loss. Many amateur and professional musicians, along with music enthusiasts, also require their hearing instruments to perform well when listening to the frequent, high amplitude peaks of live music. One limitation, in most current digital hearing instruments with 16-bit analog-to-digital (A/D) converters, is that the compressor before the A/D conversion is limited to 95 dB (SPL) or less at the input. This is more than adequate for the dynamic range of speech; however, this does not accommodate the amplitude peaks present in live music. The hearing instrument input compression system can be adjusted to accommodate for the amplitudes present in music that would otherwise be compressed before the A/D converter in the hearing instrument. The methodology behind this technological approach will be presented along with measurements to demonstrate its effectiveness. PMID:23258618
Electrophysiological evidence for a general auditory prediction deficit in adults who stutter
Daliri, Ayoub; Max, Ludo
2015-01-01
We previously found that stuttering individuals do not show the typical auditory modulation observed during speech planning in nonstuttering individuals. In this follow-up study, we further elucidate this difference by investigating whether stuttering speakers’ atypical auditory modulation is observed only when sensory predictions are based on movement planning or also when predictable auditory input is not a consequence of one’s own actions. We recorded 10 stuttering and 10 nonstuttering adults’ auditory evoked potentials in response to random probe tones delivered while anticipating either speaking aloud or hearing one’s own speech played back and in a control condition without auditory input (besides probe tones). N1 amplitude of nonstuttering speakers was reduced prior to both speaking and hearing versus the control condition. Stuttering speakers, however, showed no N1 amplitude reduction in either the speaking or hearing condition as compared with control. Thus, findings suggest that stuttering speakers have general auditory prediction difficulties. PMID:26335995
Ultrasonic speech translator and communications system
Akerman, M.A.; Ayers, C.W.; Haynes, H.D.
1996-07-23
A wireless communication system undetectable by radio frequency methods for converting audio signals, including human voice, to electronic signals in the ultrasonic frequency range, transmitting the ultrasonic signal by way of acoustical pressure waves across a carrier medium, including gases, liquids, or solids, and reconverting the ultrasonic acoustical pressure waves back to the original audio signal. The ultrasonic speech translator and communication system includes an ultrasonic transmitting device and an ultrasonic receiving device. The ultrasonic transmitting device accepts as input an audio signal such as human voice input from a microphone or tape deck. The ultrasonic transmitting device frequency modulates an ultrasonic carrier signal with the audio signal producing a frequency modulated ultrasonic carrier signal, which is transmitted via acoustical pressure waves across a carrier medium such as gases, liquids or solids. The ultrasonic receiving device converts the frequency modulated ultrasonic acoustical pressure waves to a frequency modulated electronic signal, demodulates the audio signal from the ultrasonic carrier signal, and conditions the demodulated audio signal to reproduce the original audio signal at its output. 7 figs.
NASA Astrophysics Data System (ADS)
Kajiwara, Yusuke; Murata, Hiroaki; Kimura, Haruhiko; Abe, Koji
As a communication support tool for cases of amyotrophic lateral sclerosis (ALS), researches on eye gaze human-computer interfaces have been active. However, since voluntary and involuntary eye movements cannot be distinguished in the interfaces, their performance is still not sufficient for practical use. This paper presents a high performance human-computer interface system which unites high quality recognitions of horizontal directional eye movements and voluntary blinks. The experimental results have shown that the number of incorrect inputs is decreased by 35.1% in an existing system which equips recognitions of horizontal and vertical directional eye movements in addition to voluntary blinks and character inputs are speeded up by 17.4% from the existing system.
Early Sign Language Exposure and Cochlear Implantation Benefits.
Geers, Ann E; Mitchell, Christine M; Warner-Czyz, Andrea; Wang, Nae-Yuh; Eisenberg, Laurie S
2017-07-01
Most children with hearing loss who receive cochlear implants (CI) learn spoken language, and parents must choose early on whether to use sign language to accompany speech at home. We address whether parents' use of sign language before and after CI positively influences auditory-only speech recognition, speech intelligibility, spoken language, and reading outcomes. Three groups of children with CIs from a nationwide database who differed in the duration of early sign language exposure provided in their homes were compared in their progress through elementary grades. The groups did not differ in demographic, auditory, or linguistic characteristics before implantation. Children without early sign language exposure achieved better speech recognition skills over the first 3 years postimplant and exhibited a statistically significant advantage in spoken language and reading near the end of elementary grades over children exposed to sign language. Over 70% of children without sign language exposure achieved age-appropriate spoken language compared with only 39% of those exposed for 3 or more years. Early speech perception predicted speech intelligibility in middle elementary grades. Children without sign language exposure produced speech that was more intelligible (mean = 70%) than those exposed to sign language (mean = 51%). This study provides the most compelling support yet available in CI literature for the benefits of spoken language input for promoting verbal development in children implanted by 3 years of age. Contrary to earlier published assertions, there was no advantage to parents' use of sign language either before or after CI. Copyright © 2017 by the American Academy of Pediatrics.
'Fly Like This': Natural Language Interface for UAV Mission Planning
NASA Technical Reports Server (NTRS)
Chandarana, Meghan; Meszaros, Erica L.; Trujillo, Anna; Allen, B. Danette
2017-01-01
With the increasing presence of unmanned aerial vehicles (UAVs) in everyday environments, the user base of these powerful and potentially intelligent machines is expanding beyond exclusively highly trained vehicle operators to include non-expert system users. Scientists seeking to augment costly and often inflexible methods of data collection historically used are turning towards lower cost and reconfigurable UAVs. These new users require more intuitive and natural methods for UAV mission planning. This paper explores two natural language interfaces - gesture and speech - for UAV flight path generation through individual user studies. Subjects who participated in the user studies also used a mouse-based interface for a baseline comparison. Each interface allowed the user to build flight paths from a library of twelve individual trajectory segments. Individual user studies evaluated performance, efficacy, and ease-of-use of each interface using background surveys, subjective questionnaires, and observations on time and correctness. Analysis indicates that natural language interfaces are promising alternatives to traditional interfaces. The user study data collected on the efficacy and potential of each interface will be used to inform future intuitive UAV interface design for non-expert users.
Souza, Pamela; Arehart, Kathryn; Neher, Tobias
2015-01-01
Working memory—the ability to process and store information—has been identified as an important aspect of speech perception in difficult listening environments. Working memory can be envisioned as a limited-capacity system which is engaged when an input signal cannot be readily matched to a stored representation or template. This “mismatch” is expected to occur more frequently when the signal is degraded. Because working memory capacity varies among individuals, those with smaller capacity are expected to demonstrate poorer speech understanding when speech is degraded, such as in background noise. However, it is less clear whether (and how) working memory should influence practical decisions, such as hearing treatment. Here, we consider the relationship between working memory capacity and response to specific hearing aid processing strategies. Three types of signal processing are considered, each of which will alter the acoustic signal: fast-acting wide-dynamic range compression, which smooths the amplitude envelope of the input signal; digital noise reduction, which may inadvertently remove speech signal components as it suppresses noise; and frequency compression, which alters the relationship between spectral peaks. For fast-acting wide-dynamic range compression, a growing body of data suggests that individuals with smaller working memory capacity may be more susceptible to such signal alterations, and may receive greater amplification benefit with “low alteration” processing. While the evidence for a relationship between wide-dynamic range compression and working memory appears robust, the effects of working memory on perceptual response to other forms of hearing aid signal processing are less clear cut. We conclude our review with a discussion of the opportunities (and challenges) in translating information on individual working memory into clinical treatment, including clinically feasible measures of working memory. PMID:26733899
Applications of Microcomputers in the Education of the Physically Disabled Child.
ERIC Educational Resources Information Center
Foulds, Richard A.
1982-01-01
Microcomputers can serve as expressive communication tools for severely physically disabled persons. Features such as single input devices, direct selection aids, and speech synthesis capabilities can be extremely useful. The trend toward portable battery-operated computers will make the technology even more accessible. (CL)
NASA Technical Reports Server (NTRS)
1973-01-01
The users manual for the word recognition computer program contains flow charts of the logical diagram, the memory map for templates, the speech analyzer card arrangement, minicomputer input/output routines, and assembly language program listings.
Second Language Acquisition Research and Second Language Teaching.
ERIC Educational Resources Information Center
Corder, S. Pit
1985-01-01
Discusses second language acquisition, the importance of comprehensible input to this acquisition, and the inadequacy of the theory of language interference as an explanation for errors in second language speech. The role of the teacher in the language classroom and the "procedural syllabus" are described. (SED)
Child implant users' imitation of happy- and sad-sounding speech
Wang, David J.; Trehub, Sandra E.; Volkova, Anna; van Lieshout, Pascal
2013-01-01
Cochlear implants have enabled many congenitally or prelingually deaf children to acquire their native language and communicate successfully on the basis of electrical rather than acoustic input. Nevertheless, degraded spectral input provided by the device reduces the ability to perceive emotion in speech. We compared the vocal imitations of 5- to 7-year-old deaf children who were highly successful bilateral implant users with those of a control sample of children who had normal hearing. First, the children imitated several happy and sad sentences produced by a child model. When adults in Experiment 1 rated the similarity of imitated to model utterances, ratings were significantly higher for the hearing children. Both hearing and deaf children produced poorer imitations of happy than sad utterances because of difficulty matching the greater pitch modulation of the happy versions. When adults in Experiment 2 rated electronically filtered versions of the utterances, which obscured the verbal content, ratings of happy and sad utterances were significantly differentiated for deaf as well as hearing children. The ratings of deaf children, however, were significantly less differentiated. Although deaf children's utterances exhibited culturally typical pitch modulation, their pitch modulation was reduced relative to that of hearing children. One practical implication is that therapeutic interventions for deaf children could expand their focus on suprasegmental aspects of speech perception and production, especially intonation patterns. PMID:23801976
Modeling the Development of Audiovisual Cue Integration in Speech Perception
Getz, Laura M.; Nordeen, Elke R.; Vrabic, Sarah C.; Toscano, Joseph C.
2017-01-01
Adult speech perception is generally enhanced when information is provided from multiple modalities. In contrast, infants do not appear to benefit from combining auditory and visual speech information early in development. This is true despite the fact that both modalities are important to speech comprehension even at early stages of language acquisition. How then do listeners learn how to process auditory and visual information as part of a unified signal? In the auditory domain, statistical learning processes provide an excellent mechanism for acquiring phonological categories. Is this also true for the more complex problem of acquiring audiovisual correspondences, which require the learner to integrate information from multiple modalities? In this paper, we present simulations using Gaussian mixture models (GMMs) that learn cue weights and combine cues on the basis of their distributional statistics. First, we simulate the developmental process of acquiring phonological categories from auditory and visual cues, asking whether simple statistical learning approaches are sufficient for learning multi-modal representations. Second, we use this time course information to explain audiovisual speech perception in adult perceivers, including cases where auditory and visual input are mismatched. Overall, we find that domain-general statistical learning techniques allow us to model the developmental trajectory of audiovisual cue integration in speech, and in turn, allow us to better understand the mechanisms that give rise to unified percepts based on multiple cues. PMID:28335558
Speech Enhancement Using Gaussian Scale Mixture Models
Hao, Jiucang; Lee, Te-Won; Sejnowski, Terrence J.
2011-01-01
This paper presents a novel probabilistic approach to speech enhancement. Instead of a deterministic logarithmic relationship, we assume a probabilistic relationship between the frequency coefficients and the log-spectra. The speech model in the log-spectral domain is a Gaussian mixture model (GMM). The frequency coefficients obey a zero-mean Gaussian whose covariance equals to the exponential of the log-spectra. This results in a Gaussian scale mixture model (GSMM) for the speech signal in the frequency domain, since the log-spectra can be regarded as scaling factors. The probabilistic relation between frequency coefficients and log-spectra allows these to be treated as two random variables, both to be estimated from the noisy signals. Expectation-maximization (EM) was used to train the GSMM and Bayesian inference was used to compute the posterior signal distribution. Because exact inference of this full probabilistic model is computationally intractable, we developed two approaches to enhance the efficiency: the Laplace method and a variational approximation. The proposed methods were applied to enhance speech corrupted by Gaussian noise and speech-shaped noise (SSN). For both approximations, signals reconstructed from the estimated frequency coefficients provided higher signal-to-noise ratio (SNR) and those reconstructed from the estimated log-spectra produced lower word recognition error rate because the log-spectra fit the inputs to the recognizer better. Our algorithms effectively reduced the SSN, which algorithms based on spectral analysis were not able to suppress. PMID:21359139
Modeling the Development of Audiovisual Cue Integration in Speech Perception.
Getz, Laura M; Nordeen, Elke R; Vrabic, Sarah C; Toscano, Joseph C
2017-03-21
Adult speech perception is generally enhanced when information is provided from multiple modalities. In contrast, infants do not appear to benefit from combining auditory and visual speech information early in development. This is true despite the fact that both modalities are important to speech comprehension even at early stages of language acquisition. How then do listeners learn how to process auditory and visual information as part of a unified signal? In the auditory domain, statistical learning processes provide an excellent mechanism for acquiring phonological categories. Is this also true for the more complex problem of acquiring audiovisual correspondences, which require the learner to integrate information from multiple modalities? In this paper, we present simulations using Gaussian mixture models (GMMs) that learn cue weights and combine cues on the basis of their distributional statistics. First, we simulate the developmental process of acquiring phonological categories from auditory and visual cues, asking whether simple statistical learning approaches are sufficient for learning multi-modal representations. Second, we use this time course information to explain audiovisual speech perception in adult perceivers, including cases where auditory and visual input are mismatched. Overall, we find that domain-general statistical learning techniques allow us to model the developmental trajectory of audiovisual cue integration in speech, and in turn, allow us to better understand the mechanisms that give rise to unified percepts based on multiple cues.
Bilingualism modulates infants' selective attention to the mouth of a talking face.
Pons, Ferran; Bosch, Laura; Lewkowicz, David J
2015-04-01
Infants growing up in bilingual environments succeed at learning two languages. What adaptive processes enable them to master the more complex nature of bilingual input? One possibility is that bilingual infants take greater advantage of the redundancy of the audiovisual speech that they usually experience during social interactions. Thus, we investigated whether bilingual infants' need to keep languages apart increases their attention to the mouth as a source of redundant and reliable speech cues. We measured selective attention to talking faces in 4-, 8-, and 12-month-old Catalan and Spanish monolingual and bilingual infants. Monolinguals looked more at the eyes than the mouth at 4 months and more at the mouth than the eyes at 8 months in response to both native and nonnative speech, but they looked more at the mouth than the eyes at 12 months only in response to nonnative speech. In contrast, bilinguals looked equally at the eyes and mouth at 4 months, more at the mouth than the eyes at 8 months, and more at the mouth than the eyes at 12 months, and these patterns of responses were found for both native and nonnative speech at all ages. Thus, to support their dual-language acquisition processes, bilingual infants exploit the greater perceptual salience of redundant audiovisual speech cues at an earlier age and for a longer time than monolingual infants. © The Author(s) 2015.
François, Clément; Schön, Daniele
2014-02-01
There is increasing evidence that humans and other nonhuman mammals are sensitive to the statistical structure of auditory input. Indeed, neural sensitivity to statistical regularities seems to be a fundamental biological property underlying auditory learning. In the case of speech, statistical regularities play a crucial role in the acquisition of several linguistic features, from phonotactic to more complex rules such as morphosyntactic rules. Interestingly, a similar sensitivity has been shown with non-speech streams: sequences of sounds changing in frequency or timbre can be segmented on the sole basis of conditional probabilities between adjacent sounds. We recently ran a set of cross-sectional and longitudinal experiments showing that merging music and speech information in song facilitates stream segmentation and, further, that musical practice enhances sensitivity to statistical regularities in speech at both neural and behavioral levels. Based on recent findings showing the involvement of a fronto-temporal network in speech segmentation, we defend the idea that enhanced auditory learning observed in musicians originates via at least three distinct pathways: enhanced low-level auditory processing, enhanced phono-articulatory mapping via the left Inferior Frontal Gyrus and Pre-Motor cortex and increased functional connectivity within the audio-motor network. Finally, we discuss how these data predict a beneficial use of music for optimizing speech acquisition in both normal and impaired populations. Copyright © 2013 Elsevier B.V. All rights reserved.
Initial Development of a Spatially Separated Speech-in-Noise and Localization Training Program
Tyler, Richard S.; Witt, Shelley A.; Dunn, Camille C.; Wang, Wenjun
2010-01-01
Objective This article describes the initial development of a novel approach for training hearing-impaired listeners to improve their ability to understand speech in the presence of background noise and to also improve their ability to localize sounds. Design Most people with hearing loss, even those well fit with hearing devices, still experience significant problems understanding speech in noise. Prior research suggests that at least some subjects can experience improved speech understanding with training. However, all training systems that we are aware of have one basic, critical limitation. They do not provide spatial separation of the speech and noise, therefore ignoring the potential benefits of training binaural hearing. In this paper we describe our initial experience with a home-based training system that includes spatially separated speech-in-noise and localization training. Results Throughout the development of this system patient input, training and preliminary pilot data from individuals with bilateral cochlear implants were utilized. Positive feedback from subjective reports indicated that some individuals were engaged in the treatment, and formal testing showed benefit. Feedback and practical issues resulted from the reduction of an eight-loudspeaker to a two-loudspeaker system. Conclusions These preliminary findings suggest we have successfully developed a viable spatial hearing training system that can improve binaural hearing in noise and localization. Applications include, but are not limited to, hearing with hearing aids and cochlear implants. PMID:20701836
Angelici, Bartolomeo; Mailand, Erik; Haefliger, Benjamin; Benenson, Yaakov
2016-08-30
One of the goals of synthetic biology is to develop programmable artificial gene networks that can transduce multiple endogenous molecular cues to precisely control cell behavior. Realizing this vision requires interfacing natural molecular inputs with synthetic components that generate functional molecular outputs. Interfacing synthetic circuits with endogenous mammalian transcription factors has been particularly difficult. Here, we describe a systematic approach that enables integration and transduction of multiple mammalian transcription factor inputs by a synthetic network. The approach is facilitated by a proportional amplifier sensor based on synergistic positive autoregulation. The circuits efficiently transduce endogenous transcription factor levels into RNAi, transcriptional transactivation, and site-specific recombination. They also enable AND logic between pairs of arbitrary transcription factors. The results establish a framework for developing synthetic gene networks that interface with cellular processes through transcriptional regulators. Copyright © 2016 The Author(s). Published by Elsevier Inc. All rights reserved.
Phonological universals constrain the processing of nonspeech stimuli.
Berent, Iris; Balaban, Evan; Lennertz, Tracy; Vaknin-Nusbaum, Vered
2010-08-01
Domain-specific systems are hypothetically specialized with respect to the outputs they compute and the inputs they allow (Fodor, 1983). Here, we examine whether these 2 conditions for specialization are dissociable. An initial experiment suggests that English speakers could extend a putatively universal phonological restriction to inputs identified as nonspeech. A subsequent comparison of English and Russian participants indicates that the processing of nonspeech inputs is modulated by linguistic experience. Striking, qualitative differences between English and Russian participants suggest that they rely on linguistic principles, both universal and language-particular, rather than generic auditory processing strategies. Thus, the computation of idiosyncratic linguistic outputs is apparently not restricted to speech inputs. This conclusion presents various challenges to both domain-specific and domain-general accounts of cognition. 2010 APA, all rights reserved
Circling motion and screen edges as an alternative input method for on-screen target manipulation.
Ka, Hyun W; Simpson, Richard C
2017-04-01
To investigate a new alternative interaction method, called circling interface, for manipulating on-screen objects. To specify a target, the user makes a circling motion around the target. To specify a desired pointing command with the circling interface, each edge of the screen is used. The user selects a command before circling the target. To evaluate the circling interface, we conducted an experiment with 16 participants, comparing the performance on pointing tasks with different combinations of selection method (circling interface, physical mouse and dwelling interface) and input device (normal computer mouse, head pointer and joystick mouse emulator). A circling interface is compatible with many types of pointing devices, not requiring physical activation of mouse buttons, and is more efficient than dwell-clicking. Across all common pointing operations, the circling interface had a tendency to produce faster performance with a head-mounted mouse emulator than with a joystick mouse. The performance accuracy of the circling interface outperformed the dwelling interface. It was demonstrated that the circling interface has the potential as another alternative pointing method for selecting and manipulating objects in a graphical user interface. Implications for Rehabilitation A circling interface will improve clinical practice by providing an alternative pointing method that does not require physically activating mouse buttons and is more efficient than dwell-clicking. The Circling interface can also work with AAC devices.
Parmanto, Bambang; Saptono, Andi; Murthi, Raymond; Safos, Charlotte; Lathan, Corinna E
2008-11-01
A secure telemonitoring system was developed to transform CosmoBot system, a stand-alone speech-language therapy software, into a telerehabilitation system. The CosmoBot system is a motivating, computer-based play character designed to enhance children's communication skills and stimulate verbal interaction during the remediation of speech and language disorders. The CosmoBot system consists of the Mission Control human interface device and Cosmo's Play and Learn software featuring a robot character named Cosmo that targets educational goals for children aged 3-5 years. The secure telemonitoring infrastructure links a distant speech-language therapist and child/parents at home or school settings. The result is a telerehabilitation system that allows a speech-language therapist to monitor children's activities at home while providing feedback and therapy materials remotely. We have developed the means for telerehabilitation of communication skills that can be implemented in children's home settings. The architecture allows the therapist to remotely monitor the children after completion of the therapy session and to provide feedback for the following session.
Tremblay, Pascale; Small, Steven L.
2011-01-01
What is the nature of the interface between speech perception and production, where auditory and motor representations converge? One set of explanations suggests that during perception, the motor circuits involved in producing a perceived action are in some way enacting the action without actually causing movement (covert simulation) or sending along the motor information to be used to predict its sensory consequences (i.e., efference copy). Other accounts either reject entirely the involvement of motor representations in perception, or explain their role as being more supportive than integral, and not employing the identical circuits used in production. Using fMRI, we investigated whether there are brain regions that are conjointly active for both speech perception and production, and whether these regions are sensitive to articulatory (syllabic) complexity during both processes, which is predicted by a covert simulation account. A group of healthy young adults (1) observed a female speaker produce a set of familiar words (perception), and (2) observed and then repeated the words (production). There were two types of words, varying in articulatory complexity, as measured by the presence or absence of consonant clusters. The simple words contained no consonant cluster (e.g. “palace”), while the complex words contained one to three consonant clusters (e.g. “planet”). Results indicate that the left ventral premotor cortex (PMv) was significantly active during speech perception and speech production but that activation in this region was scaled to articulatory complexity only during speech production, revealing an incompletely specified efferent motor signal during speech perception. The right planum temporal (PT) was also active during speech perception and speech production, and activation in this region was scaled to articulatory complexity during both production and perception. These findings are discussed in the context of current theories theory of speech perception, with particular attention to accounts that include an explanatory role for mirror neurons. PMID:21664275
ATCA-based ATLAS FTK input interface system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Okumura, Yasuyuki; Liu, Tiehui Ted; Olsen, Jamieson
The first stage of the ATLAS Fast TracKer (FTK) is an ATCA-based input interface system, where hits from the entire silicon tracker are clustered and organized into overlapping eta-phi trigger towers before being sent to the tracking engines. First, FTK Input Mezzanine cards receive hit data and perform clustering to reduce data volume. Then, the ATCA-based Data Formatter system will organize the trigger tower data, sharing data among boards over full mesh backplanes and optic fibers. The board and system level design concepts and implementation details, as well as the operation experiences from the FTK full-chain testing, will be presented.
Nested-cone transformer antenna
Ekdahl, C.A.
1991-05-28
A plurality of conical transmission lines are concentrically nested to form an output antenna for pulsed-power, radio-frequency, and microwave sources. The diverging conical conductors enable a high power input density across a bulk dielectric to be reduced below a breakdown power density at the antenna interface with the transmitting medium. The plurality of cones maintain a spacing between conductors which minimizes the generation of high order modes between the conductors. Further, the power input feeds are isolated at the input while enabling the output electromagnetic waves to add at the transmission interface. Thus, very large power signals from a pulse rf, or microwave source can be radiated. 6 figures.
Nested-cone transformer antenna
Ekdahl, Carl A.
1991-01-01
A plurality of conical transmission lines are concentrically nested to form n output antenna for pulsed-power, radio-frequency, and microwave sources. The diverging conical conductors enable a high power input density across a bulk dielectric to be reduced below a breakdown power density at the antenna interface with the transmitting medium. The plurality of cones maintain a spacing between conductors which minimizes the generation of high order modes between the conductors. Further, the power input feeds are isolated at the input while enabling the output electromagnetic waves to add at the transmission interface. Thus, very large power signals from a pulse rf, or microwave source can be radiated.
Research Operations for Advanced Warfighter Interface Technologies
2009-06-01
tk.sourceforge.net 11 Available at http://www.speech.kth.se/ snack 17 Cross-word triphone Multi-Space probability Distribution (MSD)-HMMs [23...prepared and delivered for shipment as part of on-going coating evaluations. 6) The IR terrain board was mounted to an extruded aluminum frame for ease of
Study of Man-Machine Communications Systems for the Handicapped. Interim Report.
ERIC Educational Resources Information Center
Kafafian, Haig
Newly developed communications systems for exceptional children include Cybercom; CYBERTYPE; Cyberplace, a keyless keyboard; Cyberphone, a telephonic communication system for deaf and speech impaired persons; Cyberlamp, a visual display; Cyberview, a fiber optic bundle remote visual display; Cybersem, an interface for the blind, fingerless, and…
Analysis of masking effects on speech intelligibility with respect to moving sound stimulus
NASA Astrophysics Data System (ADS)
Chen, Chiung Yao
2004-05-01
The purpose of this study is to compare the disturbed degree of speech by an immovable noise source and an apparent moving one (AMN). In the study of the sound localization, we found that source-directional sensitivity (SDS) well associates with the magnitude of interaural cross correlation (IACC). Ando et al. [Y. Ando, S. H. Kang, and H. Nagamatsu, J. Acoust. Soc. Jpn. (E) 8, 183-190 (1987)] reported that potential correlation between left and right inferior colliculus at auditory path in the brain is in harmony with the correlation function of amplitude input into two ear-canal entrances. We assume that the degree of disturbance under the apparent moving noisy source is probably different from that being installed in front of us within a constant distance in a free field (no reflection). Then, we found there is a different influence on speech intelligibility between a moving and a fixed source generated by 1/3-octave narrow-band noise with the center frequency 2 kHz. However, the reasons for the moving speed and the masking effects on speech intelligibility were uncertain.
Arnold, Denis; Tomaschek, Fabian; Sering, Konstantin; Lopez, Florence; Baayen, R Harald
2017-01-01
Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20-44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a 'wide' yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.
Can mergers-in-progress be unmerged in speech accommodation?
Babel, Molly; McAuliffe, Michael; Haber, Graham
2013-01-01
This study examines spontaneous phonetic accommodation of a dialect with distinct categories by speakers who are in the process of merging those categories. We focus on the merger of the NEAR and SQUARE lexical sets in New Zealand English, presenting New Zealand participants with an unmerged speaker of Australian English. Mergers-in-progress are a uniquely interesting sound change as they showcase the asymmetry between speech perception and production. Yet, we examine mergers using spontaneous phonetic imitation, which is phenomenon that is necessarily a behavior where perceptual input influences speech production. Phonetic imitation is quantified by a perceptual measure and an acoustic calculation of mergedness using a Pillai-Bartlett trace. The results from both analyses indicate spontaneous phonetic imitation is moderated by extra-linguistic factors such as the valence of assigned conditions and social bias. We also find evidence for a decrease in the degree of mergedness in post-exposure productions. Taken together, our results suggest that under the appropriate conditions New Zealanders phonetically accommodate to Australian English and that in the process of speech imitation, mergers-in-progress can, but do not consistently, become less merged.
Can mergers-in-progress be unmerged in speech accommodation?
Babel, Molly; McAuliffe, Michael; Haber, Graham
2013-01-01
This study examines spontaneous phonetic accommodation of a dialect with distinct categories by speakers who are in the process of merging those categories. We focus on the merger of the NEAR and SQUARE lexical sets in New Zealand English, presenting New Zealand participants with an unmerged speaker of Australian English. Mergers-in-progress are a uniquely interesting sound change as they showcase the asymmetry between speech perception and production. Yet, we examine mergers using spontaneous phonetic imitation, which is phenomenon that is necessarily a behavior where perceptual input influences speech production. Phonetic imitation is quantified by a perceptual measure and an acoustic calculation of mergedness using a Pillai-Bartlett trace. The results from both analyses indicate spontaneous phonetic imitation is moderated by extra-linguistic factors such as the valence of assigned conditions and social bias. We also find evidence for a decrease in the degree of mergedness in post-exposure productions. Taken together, our results suggest that under the appropriate conditions New Zealanders phonetically accommodate to Australian English and that in the process of speech imitation, mergers-in-progress can, but do not consistently, become less merged. PMID:24069011
Comprehension: an overlooked component in augmented language development.
Sevcik, Rose A
2006-02-15
Despite the importance of children's receptive skills as a foundation for later productive word use, the role of receptive language traditionally has received very limited attention since the focus in linguistic development has centered on language production. For children with significant developmental disabilities and communication impairments, augmented language systems have been devised as a tool both for language input and output. The role of both speech and symbol comprehension skills is emphasized in this paper. Data collected from two longitudinal studies of children and youth with severe disabilities and limited speech serve as illustrations in this paper. The acquisition and use of the System for Augmenting Language (SAL) was studied in home and school settings. Communication behaviors of the children and youth and their communication partners were observed and language assessment measures were collected. Two patterns of symbol learning and achievement--beginning and advanced--were observed. Extant speech comprehension skills brought to the augmented language learning task impacted the participants' patterns of symbol learning and use. Though often overlooked, the importance of speech and symbol comprehension skills were underscored in the studies described. Future areas for research are identified.
Spatiotemporal dynamics of auditory attention synchronize with speech
Wöstmann, Malte; Herrmann, Björn; Maess, Burkhard
2016-01-01
Attention plays a fundamental role in selectively processing stimuli in our environment despite distraction. Spatial attention induces increasing and decreasing power of neural alpha oscillations (8–12 Hz) in brain regions ipsilateral and contralateral to the locus of attention, respectively. This study tested whether the hemispheric lateralization of alpha power codes not just the spatial location but also the temporal structure of the stimulus. Participants attended to spoken digits presented to one ear and ignored tightly synchronized distracting digits presented to the other ear. In the magnetoencephalogram, spatial attention induced lateralization of alpha power in parietal, but notably also in auditory cortical regions. This alpha power lateralization was not maintained steadily but fluctuated in synchrony with the speech rate and lagged the time course of low-frequency (1–5 Hz) sensory synchronization. Higher amplitude of alpha power modulation at the speech rate was predictive of a listener’s enhanced performance of stream-specific speech comprehension. Our findings demonstrate that alpha power lateralization is modulated in tune with the sensory input and acts as a spatiotemporal filter controlling the read-out of sensory content. PMID:27001861
Vector adaptive predictive coder for speech and audio
NASA Technical Reports Server (NTRS)
Chen, Juin-Hwey (Inventor); Gersho, Allen (Inventor)
1990-01-01
A real-time vector adaptive predictive coder which approximates each vector of K speech samples by using each of M fixed vectors in a first codebook to excite a time-varying synthesis filter and picking the vector that minimizes distortion. Predictive analysis for each frame determines parameters used for computing from vectors in the first codebook zero-state response vectors that are stored at the same address (index) in a second codebook. Encoding of input speech vectors s.sub.n is then carried out using the second codebook. When the vector that minimizes distortion is found, its index is transmitted to a decoder which has a codebook identical to the first codebook of the decoder. There the index is used to read out a vector that is used to synthesize an output speech vector s.sub.n. The parameters used in the encoder are quantized, for example by using a table, and the indices are transmitted to the decoder where they are decoded to specify transfer characteristics of filters used in producing the vector s.sub.n from the receiver codebook vector selected by the vector index transmitted.
A keyword spotting model using perceptually significant energy features
NASA Astrophysics Data System (ADS)
Umakanthan, Padmalochini
The task of a keyword recognition system is to detect the presence of certain words in a conversation based on the linguistic information present in human speech. Such keyword spotting systems have applications in homeland security, telephone surveillance and human-computer interfacing. General procedure of a keyword spotting system involves feature generation and matching. In this work, new set of features that are based on the psycho-acoustic masking nature of human speech are proposed. After developing these features a time aligned pattern matching process was implemented to locate the words in a set of unknown words. A word boundary detection technique based on frame classification using the nonlinear characteristics of speech is also addressed in this work. Validation of this keyword spotting model was done using widely acclaimed Cepstral features. The experimental results indicate the viability of using these perceptually significant features as an augmented feature set in keyword spotting.
Advances in EPG for treatment and research: an illustrative case study.
Scobbie, James M; Wood, Sara E; Wrench, Alan A
2004-01-01
Electropalatography (EPG), a technique which reveals tongue-palate contact patterns over time, is a highly effective tool for speech research. We report here on recent developments by Articulate Instruments Ltd. These include hardware for Windows-based computers, backwardly compatible (with Reading EPG3) software systems for clinical intervention and laboratory-based analysis for EPG and acoustic data, and an enhanced clinical interface with client and file management tools. We focus here on a single case study of a child aged 10+/-years who had been diagnosed with an intractable speech disorder possibly resulting ultimately from a complete cleft of hard and soft palate. We illustrate how assessment, diagnosis and treatment of the intractable speech disorder are undertaken using this new generation of instrumental phonetic support. We also look forward to future developments in articulatory phonetics that will link EPG with ultrasound for research and clinical communities.
What makes an automated teller machine usable by blind users?
Manzke, J M; Egan, D H; Felix, D; Krueger, H
1998-07-01
Fifteen blind and sighted subjects, who featured as a control group for acceptance, were asked for their requirements for automated teller machines (ATMs). Both groups also tested the usability of a partially operational ATM mock-up. This machine was based on an existing cash dispenser, providing natural speech output, different function menus and different key arrangements. Performance and subjective evaluation data of blind and sighted subjects were collected. All blind subjects were able to operate the ATM successfully. The implemented speech output was the main usability factor for them. The different interface designs did not significantly affect performance and subjective evaluation. Nevertheless, design recommendations can be derived from the requirement assessment. The sighted subjects were rather open for design modifications, especially the implementation of speech output. However, there was also a mismatch of the requirements of the two subject groups, mainly concerning the key arrangement.
Negative input for grammatical errors: effects after a lag of 12 weeks.
Saxton, Matthew; Backley, Phillip; Gallaway, Clare
2005-08-01
Effects of negative input for 13 categories of grammatical error were assessed in a longitudinal study of naturalistic adult-child discourse. Two-hour samples of conversational interaction were obtained at two points in time, separated by a lag of 12 weeks, for 12 children (mean age 2;0 at the start). The data were interpreted within the framework offered by Saxton's (1997, 2000) contrast theory of negative input. Corrective input was associated with subsequent improvements in the grammaticality of child speech for three of the target structures. No effects were found for two forms of positive input: non-contingent models, where the adult produces target structures in non-error-contingent contexts; and contingent models, where grammatical forms follow grammatical child usages. The findings lend support to the view that, in some cases at least, the structure of adult-child discourse yields information on the bounds of grammaticality for the language-learning child.
How to Display Hazards and other Scientific Data Using Google Maps
NASA Astrophysics Data System (ADS)
Venezky, D. Y.; Fee, J. M.
2007-12-01
The U.S. Geological Survey's (USGS) Volcano Hazard Program (VHP) is launching a map-based interface to display hazards information using the Google® Map API (Application Program Interface). Map-based interfaces provide a synoptic view of data, making patterns easier to detect and allowing users to quickly ascertain where hazards are in relation to major population and infrastructure centers. Several map-based interfaces are now simple to run on a web server, providing ideal platforms for sharing information with colleagues, emergency managers, and the public. There are three main steps to making data accessible on a map-based interface; formatting the input data, plotting the data on the map, and customizing the user interface. The presentation, "Creating Geospatial RSS and ATOM feeds for Map-based Interfaces" (Fee and Venezky, this session), reviews key features for map input data. Join us for this presentation on how to plot data in a geographic context and then format the display with images, custom markers, and links to external data. Examples will show how the VHP Volcano Status Map was created and how to plot a field trip with driving directions.
Speech as a pilot input medium
NASA Technical Reports Server (NTRS)
Plummer, R. P.; Coler, C. R.
1977-01-01
The speech recognition system under development is a trainable pattern classifier based on a maximum-likelihood technique. An adjustable uncertainty threshold allows the rejection of borderline cases for which the probability of misclassification is high. The syntax of the command language spoken may be used as an aid to recognition, and the system adapts to changes in pronunciation if feedback from the user is available. Words must be separated by .25 second gaps. The system runs in real time on a mini-computer (PDP 11/10) and was tested on 120,000 speech samples from 10- and 100-word vocabularies. The results of these tests were 99.9% correct recognition for a vocabulary consisting of the ten digits, and 99.6% recognition for a 100-word vocabulary of flight commands, with a 5% rejection rate in each case. With no rejection, the recognition accuracies for the same vocabularies were 99.5% and 98.6% respectively.
Communication as group process media of aircrew performance
NASA Technical Reports Server (NTRS)
Kanki, B. G.; Foushee, H. C.
1989-01-01
This study of group process was motivated by a high-fidelity flight simulator project in which aircrew performance was found to be better when the crew had recently flown together. Considering recent operating experience as a group-level input factor, aspects of the communication process between crewmembers (Captain and First Officer), were explored as a possible mediator to performance. Communication patterns were defined by a speech act typology adapted for the flightdeck setting and distinguished crews that had previously flown together (FT) from those that had not flown together (NFT). A more open communication channel with respect to information exchange and validation and greater First Officer participation in task-related topics was shown by FT crews while NFT crews engaged in more non-task discourse, a speech mode less structured by roles and probably serving a more interpersonal function. Relationships between the speech categories themselves, representing linguistic, and role-related interdependencies provide guidelines for interpreting the primary findings.
NASA Technical Reports Server (NTRS)
1979-01-01
The pilot's perception and performance in flight simulators is examined. The areas investigated include: vestibular stimulation, flight management and man cockpit information interfacing, and visual perception in flight simulation. The effects of higher levels of rotary acceleration on response time to constant acceleration, tracking performance, and thresholds for angular acceleration are examined. Areas of flight management examined are cockpit display of traffic information, work load, synthetic speech call outs during the landing phase of flight, perceptual factors in the use of a microwave landing system, automatic speech recognition, automation of aircraft operation, and total simulation of flight training.
Ching, Teresa Y C; Quar, Tian Kar; Johnson, Earl E; Newall, Philip; Sharma, Mridula
2015-03-01
An important goal of providing amplification to children with hearing loss is to ensure that hearing aids are adjusted to match targets of prescriptive procedures as closely as possible. The Desired Sensation Level (DSL) v5 and the National Acoustic Laboratories' prescription for nonlinear hearing aids, version 1 (NAL-NL1) procedures are widely used in fitting hearing aids to children. Little is known about hearing aid fitting outcomes for children with severe or profound hearing loss. The purpose of this study was to investigate the prescribed and measured gain of hearing aids fit according to the NAL-NL1 and the DSL v5 procedure for children with moderately severe to profound hearing loss; and to examine the impact of choice of prescription on predicted speech intelligibility and loudness. Participants were fit with Phonak Naida V SP hearing aids according to the NAL-NL1 and DSL v5 procedures. The Speech Intelligibility Index (SII) and estimated loudness were calculated using published models. The sample consisted of 16 children (30 ears) aged between 7 and 17 yr old. The measured hearing aid gains were compared with the prescribed gains at 50 (low), 65 (medium), and 80 dB SPL (high) input levels. The goodness of fit-to-targets was quantified by calculating the average root-mean-square (RMS) error of the measured gain compared with prescriptive gain targets for 0.5, 1, 2, and 4 kHz. The significance of difference between prescriptions for hearing aid gains, SII, and loudness was examined by performing analyses of variance. Correlation analyses were used to examine the relationship between measures. The DSL v5 prescribed significantly higher overall gain than the NAL-NL1 procedure for the same audiograms. For low and medium input levels, the hearing aids of all children fit with NAL-NL1 were within 5 dB RMS of prescribed targets, but 33% (10 ears) deviated from the DSL v5 targets by more than 5 dB RMS on average. For high input level, the hearing aid fittings of 60% and 43% of ears deviated by more than 5 dB RMS from targets of NAL-NL1 and DSL v5, respectively. Greater deviations from targets were associated with more severe hearing loss. On average, the SII was higher for DSL v5 than for NAL-NL1 at low input level. No significant difference in SII was found between prescriptions at medium or high input level, despite greater loudness for DSL v5 than for NAL-NL1. Although targets between 0.25 and 2 kHz were well matched for both prescriptions in commercial hearing aids, gain targets at 4 kHz were matched for NAL-NL1 only. Although the two prescriptions differ markedly in estimated loudness, they resulted in comparable predicted speech intelligibility for medium and high input levels. American Academy of Audiology.
Neger, Thordis M.; Rietveld, Toni; Janse, Esther
2014-01-01
Within a few sentences, listeners learn to understand severely degraded speech such as noise-vocoded speech. However, individuals vary in the amount of such perceptual learning and it is unclear what underlies these differences. The present study investigates whether perceptual learning in speech relates to statistical learning, as sensitivity to probabilistic information may aid identification of relevant cues in novel speech input. If statistical learning and perceptual learning (partly) draw on the same general mechanisms, then statistical learning in a non-auditory modality using non-linguistic sequences should predict adaptation to degraded speech. In the present study, 73 older adults (aged over 60 years) and 60 younger adults (aged between 18 and 30 years) performed a visual artificial grammar learning task and were presented with 60 meaningful noise-vocoded sentences in an auditory recall task. Within age groups, sentence recognition performance over exposure was analyzed as a function of statistical learning performance, and other variables that may predict learning (i.e., hearing, vocabulary, attention switching control, working memory, and processing speed). Younger and older adults showed similar amounts of perceptual learning, but only younger adults showed significant statistical learning. In older adults, improvement in understanding noise-vocoded speech was constrained by age. In younger adults, amount of adaptation was associated with lexical knowledge and with statistical learning ability. Thus, individual differences in general cognitive abilities explain listeners' variability in adapting to noise-vocoded speech. Results suggest that perceptual and statistical learning share mechanisms of implicit regularity detection, but that the ability to detect statistical regularities is impaired in older adults if visual sequences are presented quickly. PMID:25225475
Preisig, Basil C; Eggenberger, Noëmi; Zito, Giuseppe; Vanbellingen, Tim; Schumacher, Rahel; Hopfner, Simone; Nyffeler, Thomas; Gutbrod, Klemens; Annoni, Jean-Marie; Bohlhalter, Stephan; Müri, René M
2015-03-01
Co-speech gestures are part of nonverbal communication during conversations. They either support the verbal message or provide the interlocutor with additional information. Furthermore, they prompt as nonverbal cues the cooperative process of turn taking. In the present study, we investigated the influence of co-speech gestures on the perception of dyadic dialogue in aphasic patients. In particular, we analysed the impact of co-speech gestures on gaze direction (towards speaker or listener) and fixation of body parts. We hypothesized that aphasic patients, who are restricted in verbal comprehension, adapt their visual exploration strategies. Sixteen aphasic patients and 23 healthy control subjects participated in the study. Visual exploration behaviour was measured by means of a contact-free infrared eye-tracker while subjects were watching videos depicting spontaneous dialogues between two individuals. Cumulative fixation duration and mean fixation duration were calculated for the factors co-speech gesture (present and absent), gaze direction (to the speaker or to the listener), and region of interest (ROI), including hands, face, and body. Both aphasic patients and healthy controls mainly fixated the speaker's face. We found a significant co-speech gesture × ROI interaction, indicating that the presence of a co-speech gesture encouraged subjects to look at the speaker. Further, there was a significant gaze direction × ROI × group interaction revealing that aphasic patients showed reduced cumulative fixation duration on the speaker's face compared to healthy controls. Co-speech gestures guide the observer's attention towards the speaker, the source of semantic input. It is discussed whether an underlying semantic processing deficit or a deficit to integrate audio-visual information may cause aphasic patients to explore less the speaker's face. Copyright © 2014 Elsevier Ltd. All rights reserved.
Neger, Thordis M; Rietveld, Toni; Janse, Esther
2014-01-01
Within a few sentences, listeners learn to understand severely degraded speech such as noise-vocoded speech. However, individuals vary in the amount of such perceptual learning and it is unclear what underlies these differences. The present study investigates whether perceptual learning in speech relates to statistical learning, as sensitivity to probabilistic information may aid identification of relevant cues in novel speech input. If statistical learning and perceptual learning (partly) draw on the same general mechanisms, then statistical learning in a non-auditory modality using non-linguistic sequences should predict adaptation to degraded speech. In the present study, 73 older adults (aged over 60 years) and 60 younger adults (aged between 18 and 30 years) performed a visual artificial grammar learning task and were presented with 60 meaningful noise-vocoded sentences in an auditory recall task. Within age groups, sentence recognition performance over exposure was analyzed as a function of statistical learning performance, and other variables that may predict learning (i.e., hearing, vocabulary, attention switching control, working memory, and processing speed). Younger and older adults showed similar amounts of perceptual learning, but only younger adults showed significant statistical learning. In older adults, improvement in understanding noise-vocoded speech was constrained by age. In younger adults, amount of adaptation was associated with lexical knowledge and with statistical learning ability. Thus, individual differences in general cognitive abilities explain listeners' variability in adapting to noise-vocoded speech. Results suggest that perceptual and statistical learning share mechanisms of implicit regularity detection, but that the ability to detect statistical regularities is impaired in older adults if visual sequences are presented quickly.
Scientific bases of human-machine communication by voice.
Schafer, R W
1995-01-01
The scientific bases for human-machine communication by voice are in the fields of psychology, linguistics, acoustics, signal processing, computer science, and integrated circuit technology. The purpose of this paper is to highlight the basic scientific and technological issues in human-machine communication by voice and to point out areas of future research opportunity. The discussion is organized around the following major issues in implementing human-machine voice communication systems: (i) hardware/software implementation of the system, (ii) speech synthesis for voice output, (iii) speech recognition and understanding for voice input, and (iv) usability factors related to how humans interact with machines. PMID:7479802
The Next Wave: Humans, Computers, and Redefining Reality
NASA Technical Reports Server (NTRS)
Little, William
2018-01-01
The Augmented/Virtual Reality (AVR) Lab at KSC is dedicated to " exploration into the growing computer fields of Extended Reality and the Natural User Interface (it is) a proving ground for new technologies that can be integrated into future NASA projects and programs." The topics of Human Computer Interface, Human Computer Interaction, Augmented Reality, Virtual Reality, and Mixed Reality are defined; examples of work being done in these fields in the AVR Lab are given. Current new and future work in Computer Vision, Speech Recognition, and Artificial Intelligence are also outlined.
Akce, Abdullah; Norton, James J S; Bretl, Timothy
2015-09-01
This paper presents a brain-computer interface for text entry using steady-state visually evoked potentials (SSVEP). Like other SSVEP-based spellers, ours identifies the desired input character by posing questions (or queries) to users through a visual interface. Each query defines a mapping from possible characters to steady-state stimuli. The user responds by attending to one of these stimuli. Unlike other SSVEP-based spellers, ours chooses from a much larger pool of possible queries-on the order of ten thousand instead of ten. The larger query pool allows our speller to adapt more effectively to the inherent structure of what is being typed and to the input performance of the user, both of which make certain queries provide more information than others. In particular, our speller chooses queries from this pool that maximize the amount of information to be received per unit of time, a measure of mutual information that we call information gain rate. To validate our interface, we compared it with two other state-of-the-art SSVEP-based spellers, which were re-implemented to use the same input mechanism. Results showed that our interface, with the larger query pool, allowed users to spell multiple-word texts nearly twice as fast as they could with the compared spellers.
User interface for ground-water modeling: Arcview extension
Tsou, Ming‐shu; Whittemore, Donald O.
2001-01-01
Numerical simulation for ground-water modeling often involves handling large input and output data sets. A geographic information system (GIS) provides an integrated platform to manage, analyze, and display disparate data and can greatly facilitate modeling efforts in data compilation, model calibration, and display of model parameters and results. Furthermore, GIS can be used to generate information for decision making through spatial overlay and processing of model results. Arc View is the most widely used Windows-based GIS software that provides a robust user-friendly interface to facilitate data handling and display. An extension is an add-on program to Arc View that provides additional specialized functions. An Arc View interface for the ground-water flow and transport models MODFLOW and MT3D was built as an extension for facilitating modeling. The extension includes preprocessing of spatially distributed (point, line, and polygon) data for model input and postprocessing of model output. An object database is used for linking user dialogs and model input files. The Arc View interface utilizes the capabilities of the 3D Analyst extension. Models can be automatically calibrated through the Arc View interface by external linking to such programs as PEST. The efficient pre- and postprocessing capabilities and calibration link were demonstrated for ground-water modeling in southwest Kansas.
Presentation planning using an integrated knowledge base
NASA Technical Reports Server (NTRS)
Arens, Yigal; Miller, Lawrence; Sondheimer, Norman
1988-01-01
A description is given of user interface research aimed at bringing together multiple input and output modes in a way that handles mixed mode input (commands, menus, forms, natural language), interacts with a diverse collection of underlying software utilities in a uniform way, and presents the results through a combination of output modes including natural language text, maps, charts and graphs. The system, Integrated Interfaces, derives much of its ability to interact uniformly with the user and the underlying services and to build its presentations, from the information present in a central knowledge base. This knowledge base integrates models of the application domain (Navy ships in the Pacific region, in the current demonstration version); the structure of visual displays and their graphical features; the underlying services (data bases and expert systems); and interface functions. The emphasis is on a presentation planner that uses the knowledge base to produce multi-modal output. There has been a flurry of recent work in user interface management systems. (Several recent examples are listed in the references). Existing work is characterized by an attempt to relieve the software designer of the burden of handcrafting an interface for each application. The work has generally focused on intelligently handling input. This paper deals with the other end of the pipeline - presentations.
Kelly, Spencer D.; Hirata, Yukari; Manansala, Michael; Huang, Jessica
2014-01-01
Co-speech hand gestures are a type of multimodal input that has received relatively little attention in the context of second language learning. The present study explored the role that observing and producing different types of gestures plays in learning novel speech sounds and word meanings in an L2. Naïve English-speakers were taught two components of Japanese—novel phonemic vowel length contrasts and vocabulary items comprised of those contrasts—in one of four different gesture conditions: Syllable Observe, Syllable Produce, Mora Observe, and Mora Produce. Half of the gestures conveyed intuitive information about syllable structure, and the other half, unintuitive information about Japanese mora structure. Within each Syllable and Mora condition, half of the participants only observed the gestures that accompanied speech during training, and the other half also produced the gestures that they observed along with the speech. The main finding was that participants across all four conditions had similar outcomes in two different types of auditory identification tasks and a vocabulary test. The results suggest that hand gestures may not be well suited for learning novel phonetic distinctions at the syllable level within a word, and thus, gesture-speech integration may break down at the lowest levels of language processing and learning. PMID:25071646
Van Lierde, K M; Mortier, G; Huysman, E; Vermeersch, H
2010-03-01
The purpose of the present case study was to determine the long-term impact of partial glossectomy (using the keyhole technique) on overall speech intelligibility and articulation in a Dutch-speaking child with Beckwith-Wiedemann syndrome (BWS). Furthermore the present study is meant as a contribution to the further delineation of the phonation, resonance, articulation and language characteristics and oral behaviour in a child with BWS. Detailed information on the speech and language characteristics of children with BWS may lead to better guidance of pediatric management programs. The child's speech was assessed 9 years after partial glossectomy with regard to ENT characteristics, overall intelligibility (perceptual consensus evaluation), articulation (phonetic and phonological errors), voice (videostroboscopy, vocal quality), resonance (perceptual, nasometric assessment), language (expressive and receptive) and oral behaviour. A class III malocclusion, an anterior open bite, diastema, overangulation of lower incisors and an enlarged but normal symmetric shaped tongue were present. The overall speech intelligibility improved from severely impaired (presurgical) to slightly impaired (5 months post-glossectomy) to normal (9 years postoperative). Comparative phonetic inventory showed a remarkable improvement of articulation. Nine years post-glossectomy three types of distortions seemed to predominate: a rhotacism and sigmatism and the substitution of the alveolar /z/. Oral behaviour, vocal characteristics and resonance were normal, but problems with expressive syntactic abilities were present. The long-term impact of partial glossectomy, using the keyhole technique (preserving the vascularity and the nervous input of the remaining intrinsic tongue muscles), on speech intelligibility, articulation, and oral behaviour in this Dutch-speaking child with congenital macroglossia can be regarded as successful. It is not clear how these expressive syntactical problems demonstrated in this child can be explained. Certainly they are not part of a more general developmental delay, hearing problems or cognitive malfunctioning. To what extent the presence of expressive syntactical problems is a possible aspect of the phenotypic spectrum of children with BWS is subject for further research. Multiple variables, both known and unknown can affect the long-term outcome after partial glossectomy in a child with BWS. The timing and type of the surgical technique, hearing and cognitive functioning are known variables in this study. But variables such as children's motivation, the contribution of the motor-oriented speech therapy, the parental articulation input and stimulation and other family, school and community factors are unknown and are all factors which can influence speech outcome after partial glossectomy. Detailed analyses in a greater number of subjects with BWS may help further illustrate the long-term impact of partial glossectomy. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
Distributed neural signatures of natural audiovisual speech and music in the human auditory cortex.
Salmi, Juha; Koistinen, Olli-Pekka; Glerean, Enrico; Jylänki, Pasi; Vehtari, Aki; Jääskeläinen, Iiro P; Mäkelä, Sasu; Nummenmaa, Lauri; Nummi-Kuisma, Katarina; Nummi, Ilari; Sams, Mikko
2017-08-15
During a conversation or when listening to music, auditory and visual information are combined automatically into audiovisual objects. However, it is still poorly understood how specific type of visual information shapes neural processing of sounds in lifelike stimulus environments. Here we applied multi-voxel pattern analysis to investigate how naturally matching visual input modulates supratemporal cortex activity during processing of naturalistic acoustic speech, singing and instrumental music. Bayesian logistic regression classifiers with sparsity-promoting priors were trained to predict whether the stimulus was audiovisual or auditory, and whether it contained piano playing, speech, or singing. The predictive performances of the classifiers were tested by leaving one participant at a time for testing and training the model using the remaining 15 participants. The signature patterns associated with unimodal auditory stimuli encompassed distributed locations mostly in the middle and superior temporal gyrus (STG/MTG). A pattern regression analysis, based on a continuous acoustic model, revealed that activity in some of these MTG and STG areas were associated with acoustic features present in speech and music stimuli. Concurrent visual stimulus modulated activity in bilateral MTG (speech), lateral aspect of right anterior STG (singing), and bilateral parietal opercular cortex (piano). Our results suggest that specific supratemporal brain areas are involved in processing complex natural speech, singing, and piano playing, and other brain areas located in anterior (facial speech) and posterior (music-related hand actions) supratemporal cortex are influenced by related visual information. Those anterior and posterior supratemporal areas have been linked to stimulus identification and sensory-motor integration, respectively. Copyright © 2017 Elsevier Inc. All rights reserved.
[Influence of mixing ratios of a FM-system on speech understanding of CI-users].
Hey, M; Anft, D; Hocke, T; Scholz, G; Hessel, H; Begall, K
2009-05-01
At school we find two major acoustic situations: (first) the "teacher is talking" being disturbed by the pupils making noise and (second) another "pupil is talking" disturbed by other pupils. The understanding of words and sentences in hearing impaired patients with a cochlear implant (CI) in a noisy situation can be improved by using a FM system. The aim of this study is to test speech understanding depending on mixing ratios between FM input and microphone input to the speech processor in different circumstances. Speech understanding was evaluated using the adaptive Oldenburger sentence test (OLSA) in background noise. CI patients used the FM system Microlink for Freedom CIs together with a Campus transmitter (Phonak AG). 17 postlingually deafened adults were tested, using unilateral Freedom cochlear implant systems (Cochlear Ltd). A group of eight normally hearing adults was used as a control group in the same setup. We found that the median value of L (50)=1.6 dB in CI patients without a FM system is higher than the median value of L(50)=-13 dB in normally hearing subjects. The sentence recognition in CI patients with FM system increased with increasing mixing ratio. The benefit using the FM system to understand the teacher is of high advantage in any mixing ratio. The difference between the L(50) values in situations with or without a FM-system is 15 dB for the mixing ratio 3:1 (FM to microphone). If we take into account an increase of 15% per dB in the OLSA (at L(50)) in CI patients, the difference of 15 dB means a calculated advantage of 225%. The speech understanding during the second condition ("pupil is talking") however remained nearly the same in all used mixing ratios. The calculations showed no statistical difference between these situations with and without a FM system. The speaker comprehension for the two investigated listening conditions showed different results. Understanding in the "teacher is talking" situation increased with increasing mixing ratio (FM to microphone) and in the "pupil is talking" situation remained on the same level. We could not find an optimal FM setting for both listening conditions. This leads to different suggestions for different listening conditions. All patients showed an increased speech understanding in noisy environments. This result strongly encourages the use of a FM-system in a classroom.
Neural representations and mechanisms for the performance of simple speech sequences
Bohland, Jason W.; Bullock, Daniel; Guenther, Frank H.
2010-01-01
Speakers plan the phonological content of their utterances prior to their release as speech motor acts. Using a finite alphabet of learned phonemes and a relatively small number of syllable structures, speakers are able to rapidly plan and produce arbitrary syllable sequences that fall within the rules of their language. The class of computational models of sequence planning and performance termed competitive queuing (CQ) models have followed Lashley (1951) in assuming that inherently parallel neural representations underlie serial action, and this idea is increasingly supported by experimental evidence. In this paper we develop a neural model that extends the existing DIVA model of speech production in two complementary ways. The new model includes paired structure and content subsystems (cf. MacNeilage, 1998) that provide parallel representations of a forthcoming speech plan, as well as mechanisms for interfacing these phonological planning representations with learned sensorimotor programs to enable stepping through multi-syllabic speech plans. On the basis of previous reports, the model’s components are hypothesized to be localized to specific cortical and subcortical structures, including the left inferior frontal sulcus, the medial premotor cortex, the basal ganglia and thalamus. The new model, called GODIVA (Gradient Order DIVA), thus fills a void in current speech research by providing formal mechanistic hypotheses about both phonological and phonetic processes that are grounded by neuroanatomy and physiology. This framework also generates predictions that can be tested in future neuroimaging and clinical case studies. PMID:19583476
Software handlers for process interfaces
NASA Technical Reports Server (NTRS)
Bercaw, R. W.
1976-01-01
The principles involved in the development of software handlers for custom interfacing problems are discussed. Handlers for the CAMAC standard are examined in detail. The types of transactions that must be supported have been established by standards groups, eliminating conflicting requirements arising out of different design philosophies and applications. Implementation of the standard handlers has been facilititated by standardization of hardware. The necessary local processors can be placed in the handler when it is written or at run time by means of input/output directives, or they can be built into a high-performance input/output processor. The full benefits of these process interfaces will only be realized when software requirements are incorporated uniformly into the hardware.
NASA Technical Reports Server (NTRS)
Adams, Richard J.
2015-01-01
The patent-pending Glove-Enabled Computer Operations (GECO) design leverages extravehicular activity (EVA) glove design features as platforms for instrumentation and tactile feedback, enabling the gloves to function as human-computer interface devices. Flexible sensors in each finger enable control inputs that can be mapped to any number of functions (e.g., a mouse click, a keyboard strike, or a button press). Tracking of hand motion is interpreted alternatively as movement of a mouse (change in cursor position on a graphical user interface) or a change in hand position on a virtual keyboard. Programmable vibro-tactile actuators aligned with each finger enrich the interface by creating the haptic sensations associated with control inputs, such as recoil of a button press.
Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.
Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina
2015-07-01
It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line with the 'auditory-visual view' of auditory speech perception, which assumes that auditory speech recognition is optimized by using predictions from previously encoded speaker-specific audio-visual internal models. Copyright © 2015 Elsevier Ltd. All rights reserved.
Remote direct memory access over datagrams
Grant, Ryan Eric; Rashti, Mohammad Javad; Balaji, Pavan; Afsahi, Ahmad
2014-12-02
A communication stack for providing remote direct memory access (RDMA) over a datagram network is disclosed. The communication stack has a user level interface configured to accept datagram related input and communicate with an RDMA enabled network interface card (NIC) via an NIC driver. The communication stack also has an RDMA protocol layer configured to supply one or more data transfer primitives for the datagram related input of the user level. The communication stack further has a direct data placement (DDP) layer configured to transfer the datagram related input from a user storage to a transport layer based on the one or more data transfer primitives by way of a lower layer protocol (LLP) over the datagram network.
Scarbel, Lucie; Beautemps, Denis; Schwartz, Jean-Luc; Sato, Marc
2014-01-01
One classical argument in favor of a functional role of the motor system in speech perception comes from the close-shadowing task in which a subject has to identify and to repeat as quickly as possible an auditory speech stimulus. The fact that close-shadowing can occur very rapidly and much faster than manual identification of the speech target is taken to suggest that perceptually induced speech representations are already shaped in a motor-compatible format. Another argument is provided by audiovisual interactions often interpreted as referring to a multisensory-motor framework. In this study, we attempted to combine these two paradigms by testing whether the visual modality could speed motor response in a close-shadowing task. To this aim, both oral and manual responses were evaluated during the perception of auditory and audiovisual speech stimuli, clear or embedded in white noise. Overall, oral responses were faster than manual ones, but it also appeared that they were less accurate in noise, which suggests that motor representations evoked by the speech input could be rough at a first processing stage. In the presence of acoustic noise, the audiovisual modality led to both faster and more accurate responses than the auditory modality. No interaction was however, observed between modality and response. Altogether, these results are interpreted within a two-stage sensory-motor framework, in which the auditory and visual streams are integrated together and with internally generated motor representations before a final decision may be available. PMID:25009512
Deep bottleneck features for spoken language identification.
Jiang, Bing; Song, Yan; Wei, Si; Liu, Jun-Hua; McLoughlin, Ian Vince; Dai, Li-Rong
2014-01-01
A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.
A Binaural Grouping Model for Predicting Speech Intelligibility in Multitalker Environments
Colburn, H. Steven
2016-01-01
Spatially separating speech maskers from target speech often leads to a large intelligibility improvement. Modeling this phenomenon has long been of interest to binaural-hearing researchers for uncovering brain mechanisms and for improving signal-processing algorithms in hearing-assistive devices. Much of the previous binaural modeling work focused on the unmasking enabled by binaural cues at the periphery, and little quantitative modeling has been directed toward the grouping or source-separation benefits of binaural processing. In this article, we propose a binaural model that focuses on grouping, specifically on the selection of time-frequency units that are dominated by signals from the direction of the target. The proposed model uses Equalization-Cancellation (EC) processing with a binary decision rule to estimate a time-frequency binary mask. EC processing is carried out to cancel the target signal and the energy change between the EC input and output is used as a feature that reflects target dominance in each time-frequency unit. The processing in the proposed model requires little computational resources and is straightforward to implement. In combination with the Coherence-based Speech Intelligibility Index, the model is applied to predict the speech intelligibility data measured by Marrone et al. The predicted speech reception threshold matches the pattern of the measured data well, even though the predicted intelligibility improvements relative to the colocated condition are larger than some of the measured data, which may reflect the lack of internal noise in this initial version of the model. PMID:27698261
A Binaural Grouping Model for Predicting Speech Intelligibility in Multitalker Environments.
Mi, Jing; Colburn, H Steven
2016-10-03
Spatially separating speech maskers from target speech often leads to a large intelligibility improvement. Modeling this phenomenon has long been of interest to binaural-hearing researchers for uncovering brain mechanisms and for improving signal-processing algorithms in hearing-assistive devices. Much of the previous binaural modeling work focused on the unmasking enabled by binaural cues at the periphery, and little quantitative modeling has been directed toward the grouping or source-separation benefits of binaural processing. In this article, we propose a binaural model that focuses on grouping, specifically on the selection of time-frequency units that are dominated by signals from the direction of the target. The proposed model uses Equalization-Cancellation (EC) processing with a binary decision rule to estimate a time-frequency binary mask. EC processing is carried out to cancel the target signal and the energy change between the EC input and output is used as a feature that reflects target dominance in each time-frequency unit. The processing in the proposed model requires little computational resources and is straightforward to implement. In combination with the Coherence-based Speech Intelligibility Index, the model is applied to predict the speech intelligibility data measured by Marrone et al. The predicted speech reception threshold matches the pattern of the measured data well, even though the predicted intelligibility improvements relative to the colocated condition are larger than some of the measured data, which may reflect the lack of internal noise in this initial version of the model. © The Author(s) 2016.
Davidson, Lisa S; Geers, Ann E; Brenner, Christine
2010-10-01
Updated cochlear implant technology and optimized fitting can have a substantial impact on speech perception. The effects of upgrades in processor technology and aided thresholds on word recognition at soft input levels and sentence recognition in noise were examined. We hypothesized that updated speech processors and lower aided thresholds would allow improved recognition of soft speech without compromising performance in noise. 109 teenagers who had used a Nucleus 22-cochlear implant since preschool were tested with their current speech processor(s) (101 unilateral and 8 bilateral): 13 used the Spectra, 22 the ESPrit 22, 61 the ESPrit 3G, and 13 the Freedom. The Lexical Neighborhood Test (LNT) was administered at 70 and 50 dB SPL and the Bamford Kowal Bench sentences were administered in quiet and in noise. Aided thresholds were obtained for frequency-modulated tones from 250 to 4,000 Hz. Results were analyzed using repeated measures analysis of variance. Aided thresholds for the Freedom/3G group were significantly lower (better) than the Spectra/Sprint group. LNT scores at 50 dB were significantly higher for the Freedom/3G group. No significant differences between the 2 groups were found for the LNT at 70 or sentences in quiet or noise. Adolescents using updated processors that allowed for aided detection thresholds of 30 dB HL or better performed the best at soft levels. The BKB in noise results suggest that greater access to soft speech does not compromise listening in noise.
Spatial Learning Using Locomotion Interface to Virtual Environment
ERIC Educational Resources Information Center
Patel, K. K.; Vij, S.
2012-01-01
The inability to navigate independently and interact with the wider world is one of the most significant handicaps that can be caused by blindness, second only to the inability to communicate through reading and writing. Many difficulties are encountered when visually impaired people (VIP) need to visit new and unknown places. Current speech or…
ERIC Educational Resources Information Center
Kolehmainen, Leena; Skaffari, Janne
2016-01-01
This article serves as an introduction to a collection of four articles on multilingual practices in speech and writing, exploring both contemporary and historical sources. It not only introduces the articles but also discusses the scope and definitions of code-switching, attitudes towards multilingual interaction and, most pertinently, the…
Prediction, Performance, and Promise: Perspective on Time-Shortened Degree Programs.
ERIC Educational Resources Information Center
Smart, John M., Ed.; Howard, Toni A., Ed.
Among the papers and presentations are: the keynote speech (E. Alden Dunham); the quality baccalaureate myth (Richard Giardina); the high school/college interface and time-shortening (panel presentation); restructuring the baccalaureate: a follow-up study (Robert Bersi); a point of view (Richard Meisler); more options: less time? (DeVere E.…
The Function of Gesture in Lexically Focused L2 Instructional Conversations
ERIC Educational Resources Information Center
Smotrova, Tetyana; Lantolf, James P.
2013-01-01
The purpose of the present study is to investigate the mediational function of the gesture-speech interface in the instructional conversation that emerged as teachers attempted to explain the meaning of English words to their students in two EFL classrooms in the Ukraine. Its analytical framework is provided by Vygotsky's sociocultural psychology…
Wang, Nancy X. R.; Olson, Jared D.; Ojemann, Jeffrey G.; Rao, Rajesh P. N.; Brunton, Bingni W.
2016-01-01
Fully automated decoding of human activities and intentions from direct neural recordings is a tantalizing challenge in brain-computer interfacing. Implementing Brain Computer Interfaces (BCIs) outside carefully controlled experiments in laboratory settings requires adaptive and scalable strategies with minimal supervision. Here we describe an unsupervised approach to decoding neural states from naturalistic human brain recordings. We analyzed continuous, long-term electrocorticography (ECoG) data recorded over many days from the brain of subjects in a hospital room, with simultaneous audio and video recordings. We discovered coherent clusters in high-dimensional ECoG recordings using hierarchical clustering and automatically annotated them using speech and movement labels extracted from audio and video. To our knowledge, this represents the first time techniques from computer vision and speech processing have been used for natural ECoG decoding. Interpretable behaviors were decoded from ECoG data, including moving, speaking and resting; the results were assessed by comparison with manual annotation. Discovered clusters were projected back onto the brain revealing features consistent with known functional areas, opening the door to automated functional brain mapping in natural settings. PMID:27148018
Rosemann, Stephanie; Thiel, Christiane M
2018-07-15
Hearing loss is associated with difficulties in understanding speech, especially under adverse listening conditions. In these situations, seeing the speaker improves speech intelligibility in hearing-impaired participants. On the neuronal level, previous research has shown cross-modal plastic reorganization in the auditory cortex following hearing loss leading to altered processing of auditory, visual and audio-visual information. However, how reduced auditory input effects audio-visual speech perception in hearing-impaired subjects is largely unknown. We here investigated the impact of mild to moderate age-related hearing loss on processing audio-visual speech using functional magnetic resonance imaging. Normal-hearing and hearing-impaired participants performed two audio-visual speech integration tasks: a sentence detection task inside the scanner and the McGurk illusion outside the scanner. Both tasks consisted of congruent and incongruent audio-visual conditions, as well as auditory-only and visual-only conditions. We found a significantly stronger McGurk illusion in the hearing-impaired participants, which indicates stronger audio-visual integration. Neurally, hearing loss was associated with an increased recruitment of frontal brain areas when processing incongruent audio-visual, auditory and also visual speech stimuli, which may reflect the increased effort to perform the task. Hearing loss modulated both the audio-visual integration strength measured with the McGurk illusion and brain activation in frontal areas in the sentence task, showing stronger integration and higher brain activation with increasing hearing loss. Incongruent compared to congruent audio-visual speech revealed an opposite brain activation pattern in left ventral postcentral gyrus in both groups, with higher activation in hearing-impaired participants in the incongruent condition. Our results indicate that already mild to moderate hearing loss impacts audio-visual speech processing accompanied by changes in brain activation particularly involving frontal areas. These changes are modulated by the extent of hearing loss. Copyright © 2018 Elsevier Inc. All rights reserved.
Subcortical processing of speech regularities underlies reading and music aptitude in children
2011-01-01
Background Neural sensitivity to acoustic regularities supports fundamental human behaviors such as hearing in noise and reading. Although the failure to encode acoustic regularities in ongoing speech has been associated with language and literacy deficits, how auditory expertise, such as the expertise that is associated with musical skill, relates to the brainstem processing of speech regularities is unknown. An association between musical skill and neural sensitivity to acoustic regularities would not be surprising given the importance of repetition and regularity in music. Here, we aimed to define relationships between the subcortical processing of speech regularities, music aptitude, and reading abilities in children with and without reading impairment. We hypothesized that, in combination with auditory cognitive abilities, neural sensitivity to regularities in ongoing speech provides a common biological mechanism underlying the development of music and reading abilities. Methods We assessed auditory working memory and attention, music aptitude, reading ability, and neural sensitivity to acoustic regularities in 42 school-aged children with a wide range of reading ability. Neural sensitivity to acoustic regularities was assessed by recording brainstem responses to the same speech sound presented in predictable and variable speech streams. Results Through correlation analyses and structural equation modeling, we reveal that music aptitude and literacy both relate to the extent of subcortical adaptation to regularities in ongoing speech as well as with auditory working memory and attention. Relationships between music and speech processing are specifically driven by performance on a musical rhythm task, underscoring the importance of rhythmic regularity for both language and music. Conclusions These data indicate common brain mechanisms underlying reading and music abilities that relate to how the nervous system responds to regularities in auditory input. Definition of common biological underpinnings for music and reading supports the usefulness of music for promoting child literacy, with the potential to improve reading remediation. PMID:22005291
Repetition across Successive Sentences Facilitates Young Children's Word Learning
ERIC Educational Resources Information Center
Schwab, Jessica F.; Lew-Williams, Casey
2016-01-01
Young children who hear more child-directed speech (CDS) tend to have larger vocabularies later in childhood, but the specific characteristics of CDS underlying this link are currently underspecified. The present study sought to elucidate how the structure of language input boosts learning by investigating whether repetition of object labels in…
Implicit Language Learning: Adults' Ability to Segment Words in Norwegian
ERIC Educational Resources Information Center
Kittleson, Megan M.; Aguilar, Jessica M.; Tokerud, Gry Line; Plante, Elena; Asbjornsen, Arve E.
2010-01-01
Previous language learning research reveals that the statistical properties of the input offer sufficient information to allow listeners to segment words from fluent speech in an artificial language. The current pair of studies uses a natural language to test the ecological validity of these findings and to determine whether a listener's language…
ERIC Educational Resources Information Center
D'Mello, Sidney K.; Dowell, Nia; Graesser, Arthur
2011-01-01
There is the question of whether learning differs when students speak versus type their responses when interacting with intelligent tutoring systems with natural language dialogues. Theoretical bases exist for three contrasting hypotheses. The "speech facilitation" hypothesis predicts that spoken input will "increase" learning,…
Ambiguity in Speaking Chemistry and Other STEM Content: Educational Implications
ERIC Educational Resources Information Center
Isaacson, Mick D.; Michaels, Michelle
2015-01-01
Ambiguity in speech is a possible barrier to the acquisition of knowledge for students who have print disabilities (such as blindness, visual impairments, and some specific learning disabilities) and rely on auditory input for learning. Chemistry appears to have considerable potential for being spoken ambiguously and may be a barrier to accessing…
ERIC Educational Resources Information Center
Kabadayi, Abdulkadir
2006-01-01
Language, as is known, is acquired under certain conditions: rapid and sequential brain maturation and cognitive development, the need to exchange information and to control others' actions, and an exposure to appropriate speech input. This research aims at analyzing preschoolers' overgeneralizations of the object labeling process in different…
Audio Frequency Analysis in Mobile Phones
ERIC Educational Resources Information Center
Aguilar, Horacio Munguía
2016-01-01
A new experiment using mobile phones is proposed in which its audio frequency response is analyzed using the audio port for inputting external signal and getting a measurable output. This experiment shows how the limited audio bandwidth used in mobile telephony is the main cause of the poor speech quality in this service. A brief discussion is…
He Said, She Said: Effects of Bilingualism on Cross-Talker Word Recognition in Infancy
ERIC Educational Resources Information Center
Singh, Leher
2018-01-01
The purpose of the current study was to examine effects of bilingual language input on infant word segmentation and on talker generalization. In the present study, monolingually and bilingually exposed infants were compared on their abilities to recognize familiarized words in speech and to maintain generalizable representations of familiarized…
Are Some Parents' Interaction Styles Associated with Richer Grammatical Input?
ERIC Educational Resources Information Center
Fitzgerald, Colleen E.; Hadley, Pamela A.; Rispoli, Matthew
2013-01-01
Purpose: Evidence for tense marking in child-directed speech varies both across languages (Guasti, 2002; Legate & Yang, 2007) and across speakers of a single language (Hadley, Rispoli, Fitzgerald, & Bahnsen, 2011). The purpose of this study was to understand how parent interaction styles and register use overlap with the tense-marking…
Rapid and automatic speech-specific learning mechanism in human neocortex.
Kimppa, Lilli; Kujala, Teija; Leminen, Alina; Vainio, Martti; Shtyrov, Yury
2015-09-01
A unique feature of human communication system is our ability to rapidly acquire new words and build large vocabularies. However, its neurobiological foundations remain largely unknown. In an electrophysiological study optimally designed to probe this rapid formation of new word memory circuits, we employed acoustically controlled novel word-forms incorporating native and non-native speech sounds, while manipulating the subjects' attention on the input. We found a robust index of neurolexical memory-trace formation: a rapid enhancement of the brain's activation elicited by novel words during a short (~30min) perceptual exposure, underpinned by fronto-temporal cortical networks, and, importantly, correlated with behavioural learning outcomes. Crucially, this neural memory trace build-up took place regardless of focused attention on the input or any pre-existing or learnt semantics. Furthermore, it was found only for stimuli with native-language phonology, but not for acoustically closely matching non-native words. These findings demonstrate a specialised cortical mechanism for rapid, automatic and phonology-dependent formation of neural word memory circuits. Copyright © 2015. Published by Elsevier Inc.
Ultrasonic speech translator and communications system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Akerman, M.A.; Ayers, C.W.; Haynes, H.D.
1996-07-23
A wireless communication system undetectable by radio frequency methods for converting audio signals, including human voice, to electronic signals in the ultrasonic frequency range, transmitting the ultrasonic signal by way of acoustical pressure waves across a carrier medium, including gases, liquids, or solids, and reconverting the ultrasonic acoustical pressure waves back to the original audio signal. The ultrasonic speech translator and communication system includes an ultrasonic transmitting device and an ultrasonic receiving device. The ultrasonic transmitting device accepts as input an audio signal such as human voice input from a microphone or tape deck. The ultrasonic transmitting device frequency modulatesmore » an ultrasonic carrier signal with the audio signal producing a frequency modulated ultrasonic carrier signal, which is transmitted via acoustical pressure waves across a carrier medium such as gases, liquids or solids. The ultrasonic receiving device converts the frequency modulated ultrasonic acoustical pressure waves to a frequency modulated electronic signal, demodulates the audio signal from the ultrasonic carrier signal, and conditions the demodulated audio signal to reproduce the original audio signal at its output. 7 figs.« less
Music and Hearing Aids—An Introduction
2012-01-01
Modern digital hearing aids have provided improved fidelity over those of earlier decades for speech. The same however cannot be said for music. Most modern hearing aids have a limitation of their “front end,” which comprises the analog-to-digital (A/D) converter. For a number of reasons, the spectral nature of music as an input to a hearing aid is beyond the optimal operating conditions of the “front end” components. Amplified music tends to be of rather poor fidelity. Once the music signal is distorted, no amount of software manipulation that occurs later in the circuitry can improve things. The solution is not a software issue. Some characteristics of music that make it difficult to be transduced without significant distortion include an increased sound level relative to that of speech, and the crest factor- the difference in dB between the instantaneous peak of a signal and its RMS value. Clinical strategies and technical innovations have helped to improve the fidelity of amplified music and these include a reduction of the level of the input that is presented to the A/D converter. PMID:23258616
Voice recognition products-an occupational risk for users with ULDs?
Williams, N R
2003-10-01
Voice recognition systems (VRS) allow speech to be converted both directly into text-which appears on the screen of a computer-and to direct equipment to perform specific functions. Suggested applications are many and varied, including increasing efficiency in the reporting of radiographs, allowing directed surgery and enabling individuals with upper limb disorders (ULDs) who cannot use other input devices, such as keyboards and mice, to carry out word processing and other activities. Aim This paper describes four cases of vocal dysfunction related to the use of such software, which have been identified from the database of the Voice and Speech Laboratory of the Massachusetts Eye and Ear infirmary (MEEI). The database was searched using key words 'voice recognition' and four cases were identified from a total of 4800. In all cases, the VRS was supplied to assist individuals with ULDs who could not use conventional input devices. Case reports illustrate time of onset and symptoms experienced. The cases illustrate the need for risk assessment and consideration of the ergonomic aspects of voice use prior to such adaptations being used, particularly in those who already experience work-related ULDs.
Music and hearing aids--an introduction.
Chasin, Marshall
2012-09-01
Modern digital hearing aids have provided improved fidelity over those of earlier decades for speech. The same however cannot be said for music. Most modern hearing aids have a limitation of their "front end," which comprises the analog-to-digital (A/D) converter. For a number of reasons, the spectral nature of music as an input to a hearing aid is beyond the optimal operating conditions of the "front end" components. Amplified music tends to be of rather poor fidelity. Once the music signal is distorted, no amount of software manipulation that occurs later in the circuitry can improve things. The solution is not a software issue. Some characteristics of music that make it difficult to be transduced without significant distortion include an increased sound level relative to that of speech, and the crest factor- the difference in dB between the instantaneous peak of a signal and its RMS value. Clinical strategies and technical innovations have helped to improve the fidelity of amplified music and these include a reduction of the level of the input that is presented to the A/D converter.
Ultrasonic speech translator and communications system
Akerman, M. Alfred; Ayers, Curtis W.; Haynes, Howard D.
1996-01-01
A wireless communication system undetectable by radio frequency methods for converting audio signals, including human voice, to electronic signals in the ultrasonic frequency range, transmitting the ultrasonic signal by way of acoustical pressure waves across a carrier medium, including gases, liquids, or solids, and reconverting the ultrasonic acoustical pressure waves back to the original audio signal. The ultrasonic speech translator and communication system (20) includes an ultrasonic transmitting device (100) and an ultrasonic receiving device (200). The ultrasonic transmitting device (100) accepts as input (115) an audio signal such as human voice input from a microphone (114) or tape deck. The ultrasonic transmitting device (100) frequency modulates an ultrasonic carrier signal with the audio signal producing a frequency modulated ultrasonic carrier signal, which is transmitted via acoustical pressure waves across a carrier medium such as gases, liquids or solids. The ultrasonic receiving device (200) converts the frequency modulated ultrasonic acoustical pressure waves to a frequency modulated electronic signal, demodulates the audio signal from the ultrasonic carrier signal, and conditions the demodulated audio signal to reproduce the original audio signal at its output (250).
NASA Astrophysics Data System (ADS)
Meiyanti, R.; Subandi, A.; Fuqara, N.; Budiman, M. A.; Siahaan, A. P. U.
2018-03-01
A singer doesn’t just recite the lyrics of a song, but also with the use of particular sound techniques to make it more beautiful. In the singing technique, more female have a diverse sound registers than male. There are so many registers of the human voice, but the voice registers used while singing, among others, Chest Voice, Head Voice, Falsetto, and Vocal fry. Research of speech recognition based on the female’s voice registers in singing technique is built using Borland Delphi 7.0. Speech recognition process performed by the input recorded voice samples and also in real time. Voice input will result in weight energy values based on calculations using Hankel Transformation method and Macdonald Functions. The results showed that the accuracy of the system depends on the accuracy of sound engineering that trained and tested, and obtained an average percentage of the successful introduction of the voice registers record reached 48.75 percent, while the average percentage of the successful introduction of the voice registers in real time to reach 57 percent.
Weisleder, Adriana; Waxman, Sandra R.
2010-01-01
Recent analyses have revealed that child-directed speech contains distributional regularities that could, in principle, support young children's discovery of distinct grammatical categories (noun, verb, adjective). In particular, a distributional unit known as the frequent frame appears to be especially informative (Mintz, 2003). However, analyses have focused almost exclusively on the distributional information available in English. Because languages differ considerably in how the grammatical forms are marked within utterances, the scarcity of cross-linguistic evidence represents an unfortunate gap. We therefore advance the developmental evidence by analyzing the distributional information available in frequent frames across two languages (Spanish and English), across sentence positions (phrase medial and phrase final), and across grammatical forms (noun, verb, adjective). We selected six parent-child corpora from the CHILDES database (3 English; 3 Spanish), and analyzed the input when children were 2;6 years or younger. In each language, frequent frames did indeed offer systematic cues to grammatical category assignment. We also identify differences in the accuracy of these frames across languages, sentences positions, and grammatical classes. PMID:19698207
Weisleder, Adriana; Waxman, Sandra R
2010-11-01
Recent analyses have revealed that child-directed speech contains distributional regularities that could, in principle, support young children's discovery of distinct grammatical categories (noun, verb, adjective). In particular, a distributional unit known as the frequent frame appears to be especially informative (Mintz, 2003). However, analyses have focused almost exclusively on the distributional information available in English. Because languages differ considerably in how the grammatical forms are marked within utterances, the scarcity of cross-linguistic evidence represents an unfortunate gap. We therefore advance the developmental evidence by analyzing the distributional information available in frequent frames across two languages (Spanish and English), across sentence positions (phrase medial and phrase final), and across grammatical forms (noun, verb, adjective). We selected six parent-child corpora from the CHILDES database (three English; three Spanish), and analyzed the input when children were aged 2 ; 6 or younger. In each language, frequent frames did indeed offer systematic cues to grammatical category assignment. We also identify differences in the accuracy of these frames across languages, sentences positions and grammatical classes.
Yoon, Sung Hoon; Nam, Kyoung Won; Yook, Sunhyun; Cho, Baek Hwan; Jang, Dong Pyo; Hong, Sung Hwa; Kim, In Young
2017-03-01
In an effort to improve hearing aid users' satisfaction, recent studies on trainable hearing aids have attempted to implement one or two environmental factors into training. However, it would be more beneficial to train the device based on the owner's personal preferences in a more expanded environmental acoustic conditions. Our study aimed at developing a trainable hearing aid algorithm that can reflect the user's individual preferences in a more extensive environmental acoustic conditions (ambient sound level, listening situation, and degree of noise suppression) and evaluated the perceptual benefit of the proposed algorithm. Ten normal hearing subjects participated in this study. Each subjects trained the algorithm to their personal preference and the trained data was used to record test sounds in three different settings to be utilized to evaluate the perceptual benefit of the proposed algorithm by performing the Comparison Mean Opinion Score test. Statistical analysis revealed that of the 10 subjects, four showed significant differences in amplification constant settings between the noise-only and speech-in-noise situation ( P <0.05) and one subject also showed significant difference between the speech-only and speech-in-noise situation ( P <0.05). Additionally, every subject preferred different β settings for beamforming in all different input sound levels. The positive findings from this study suggested that the proposed algorithm has potential to improve hearing aid users' personal satisfaction under various ambient situations.