The 2016 NIST Speaker Recognition Evaluation
2017-08-20
The 2016 NIST Speaker Recognition Evaluation Seyed Omid Sadjadi1,∗, Timothée Kheyrkhah1,†, Audrey Tong1, Craig Greenberg1, Douglas Reynolds2, Elliot...recent in an ongoing series of speaker recognition evaluations (SRE) to foster research in ro- bust text-independent speaker recognition, as well as...online evaluation platform, a fixed training data condition, more variability in test segment duration (uni- formly distributed between 10s and 60s
Statistical Evaluation of Biometric Evidence in Forensic Automatic Speaker Recognition
NASA Astrophysics Data System (ADS)
Drygajlo, Andrzej
Forensic speaker recognition is the process of determining if a specific individual (suspected speaker) is the source of a questioned voice recording (trace). This paper aims at presenting forensic automatic speaker recognition (FASR) methods that provide a coherent way of quantifying and presenting recorded voice as biometric evidence. In such methods, the biometric evidence consists of the quantified degree of similarity between speaker-dependent features extracted from the trace and speaker-dependent features extracted from recorded speech of a suspect. The interpretation of recorded voice as evidence in the forensic context presents particular challenges, including within-speaker (within-source) variability and between-speakers (between-sources) variability. Consequently, FASR methods must provide a statistical evaluation which gives the court an indication of the strength of the evidence given the estimated within-source and between-sources variabilities. This paper reports on the first ENFSI evaluation campaign through a fake case, organized by the Netherlands Forensic Institute (NFI), as an example, where an automatic method using the Gaussian mixture models (GMMs) and the Bayesian interpretation (BI) framework were implemented for the forensic speaker recognition task.
Speaker recognition with temporal cues in acoustic and electric hearing
NASA Astrophysics Data System (ADS)
Vongphoe, Michael; Zeng, Fan-Gang
2005-08-01
Natural spoken language processing includes not only speech recognition but also identification of the speaker's gender, age, emotional, and social status. Our purpose in this study is to evaluate whether temporal cues are sufficient to support both speech and speaker recognition. Ten cochlear-implant and six normal-hearing subjects were presented with vowel tokens spoken by three men, three women, two boys, and two girls. In one condition, the subject was asked to recognize the vowel. In the other condition, the subject was asked to identify the speaker. Extensive training was provided for the speaker recognition task. Normal-hearing subjects achieved nearly perfect performance in both tasks. Cochlear-implant subjects achieved good performance in vowel recognition but poor performance in speaker recognition. The level of the cochlear implant performance was functionally equivalent to normal performance with eight spectral bands for vowel recognition but only to one band for speaker recognition. These results show a disassociation between speech and speaker recognition with primarily temporal cues, highlighting the limitation of current speech processing strategies in cochlear implants. Several methods, including explicit encoding of fundamental frequency and frequency modulation, are proposed to improve speaker recognition for current cochlear implant users.
NASA Astrophysics Data System (ADS)
Kayasith, Prakasith; Theeramunkong, Thanaruk
It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.
``The perceptual bases of speaker identity'' revisited
NASA Astrophysics Data System (ADS)
Voiers, William D.
2003-10-01
A series of experiments begun 40 years ago [W. D. Voiers, J. Acoust. Soc. Am. 36, 1065-1073 (1964)] was concerned with identifying the perceived voice traits (PVTs) on which human recognition of voices depends. It culminated with the development of a voice taxonomy based on 20 PVTs and a set of highly reliable rating scales for classifying voices with respect to those PVTs. The development of a perceptual voice taxonomy was motivated by the need for a practical method of evaluating speaker recognizability in voice communication systems. The Diagnostic Speaker Recognition Test (DSRT) evaluates the effects of systems on speaker recognizability as reflected in changes in the inter-listener reliability of voice ratings on the 20 PVTs. The DSRT thus provides a qualitative, as well as quantitative, evaluation of the effects of a system on speaker recognizability. A fringe benefit of this project is PVT rating data for a sample of 680 voices. [Work partially supported by USAFRL.
Speaker normalization for chinese vowel recognition in cochlear implants.
Luo, Xin; Fu, Qian-Jie
2005-07-01
Because of the limited spectra-temporal resolution associated with cochlear implants, implant patients often have greater difficulty with multitalker speech recognition. The present study investigated whether multitalker speech recognition can be improved by applying speaker normalization techniques to cochlear implant speech processing. Multitalker Chinese vowel recognition was tested with normal-hearing Chinese-speaking subjects listening to a 4-channel cochlear implant simulation, with and without speaker normalization. For each subject, speaker normalization was referenced to the speaker that produced the best recognition performance under conditions without speaker normalization. To match the remaining speakers to this "optimal" output pattern, the overall frequency range of the analysis filter bank was adjusted for each speaker according to the ratio of the mean third formant frequency values between the specific speaker and the reference speaker. Results showed that speaker normalization provided a small but significant improvement in subjects' overall recognition performance. After speaker normalization, subjects' patterns of recognition performance across speakers changed, demonstrating the potential for speaker-dependent effects with the proposed normalization technique.
Experimental study on GMM-based speaker recognition
NASA Astrophysics Data System (ADS)
Ye, Wenxing; Wu, Dapeng; Nucci, Antonio
2010-04-01
Speaker recognition plays a very important role in the field of biometric security. In order to improve the recognition performance, many pattern recognition techniques have be explored in the literature. Among these techniques, the Gaussian Mixture Model (GMM) is proved to be an effective statistic model for speaker recognition and is used in most state-of-the-art speaker recognition systems. The GMM is used to represent the 'voice print' of a speaker through modeling the spectral characteristic of speech signals of the speaker. In this paper, we implement a speaker recognition system, which consists of preprocessing, Mel-Frequency Cepstrum Coefficients (MFCCs) based feature extraction, and GMM based classification. We test our system with TIDIGITS data set (325 speakers) and our own recordings of more than 200 speakers; our system achieves 100% correct recognition rate. Moreover, we also test our system under the scenario that training samples are from one language but test samples are from a different language; our system also achieves 100% correct recognition rate, which indicates that our system is language independent.
DOE Office of Scientific and Technical Information (OSTI.GOV)
McClanahan, Richard; De Leon, Phillip L.
The majority of state-of-the-art speaker recognition systems (SR) utilize speaker models that are derived from an adapted universal background model (UBM) in the form of a Gaussian mixture model (GMM). This is true for GMM supervector systems, joint factor analysis systems, and most recently i-vector systems. In all of the identified systems, the posterior probabilities and sufficient statistics calculations represent a computational bottleneck in both enrollment and testing. We propose a multi-layered hash system, employing a tree-structured GMM–UBM which uses Runnalls’ Gaussian mixture reduction technique, in order to reduce the number of these calculations. Moreover, with this tree-structured hash, wemore » can trade-off reduction in computation with a corresponding degradation of equal error rate (EER). As an example, we also reduce this computation by a factor of 15× while incurring less than 10% relative degradation of EER (or 0.3% absolute EER) when evaluated with NIST 2010 speaker recognition evaluation (SRE) telephone data.« less
McClanahan, Richard; De Leon, Phillip L.
2014-08-20
The majority of state-of-the-art speaker recognition systems (SR) utilize speaker models that are derived from an adapted universal background model (UBM) in the form of a Gaussian mixture model (GMM). This is true for GMM supervector systems, joint factor analysis systems, and most recently i-vector systems. In all of the identified systems, the posterior probabilities and sufficient statistics calculations represent a computational bottleneck in both enrollment and testing. We propose a multi-layered hash system, employing a tree-structured GMM–UBM which uses Runnalls’ Gaussian mixture reduction technique, in order to reduce the number of these calculations. Moreover, with this tree-structured hash, wemore » can trade-off reduction in computation with a corresponding degradation of equal error rate (EER). As an example, we also reduce this computation by a factor of 15× while incurring less than 10% relative degradation of EER (or 0.3% absolute EER) when evaluated with NIST 2010 speaker recognition evaluation (SRE) telephone data.« less
Analysis of human scream and its impact on text-independent speaker verification.
Hansen, John H L; Nandwana, Mahesh Kumar; Shokouhi, Navid
2017-04-01
Scream is defined as sustained, high-energy vocalizations that lack phonological structure. Lack of phonological structure is how scream is identified from other forms of loud vocalization, such as "yell." This study investigates the acoustic aspects of screams and addresses those that are known to prevent standard speaker identification systems from recognizing the identity of screaming speakers. It is well established that speaker variability due to changes in vocal effort and Lombard effect contribute to degraded performance in automatic speech systems (i.e., speech recognition, speaker identification, diarization, etc.). However, previous research in the general area of speaker variability has concentrated on human speech production, whereas less is known about non-speech vocalizations. The UT-NonSpeech corpus is developed here to investigate speaker verification from scream samples. This study considers a detailed analysis in terms of fundamental frequency, spectral peak shift, frame energy distribution, and spectral tilt. It is shown that traditional speaker recognition based on the Gaussian mixture models-universal background model framework is unreliable when evaluated with screams.
Hybrid Speaker Recognition Using Universal Acoustic Model
NASA Astrophysics Data System (ADS)
Nishimura, Jun; Kuroda, Tadahiro
We propose a novel speaker recognition approach using a speaker-independent universal acoustic model (UAM) for sensornet applications. In sensornet applications such as “Business Microscope”, interactions among knowledge workers in an organization can be visualized by sensing face-to-face communication using wearable sensor nodes. In conventional studies, speakers are detected by comparing energy of input speech signals among the nodes. However, there are often synchronization errors among the nodes which degrade the speaker recognition performance. By focusing on property of the speaker's acoustic channel, UAM can provide robustness against the synchronization error. The overall speaker recognition accuracy is improved by combining UAM with the energy-based approach. For 0.1s speech inputs and 4 subjects, speaker recognition accuracy of 94% is achieved at the synchronization error less than 100ms.
Botti, F; Alexander, A; Drygajlo, A
2004-12-02
This paper deals with a procedure to compensate for mismatched recording conditions in forensic speaker recognition, using a statistical score normalization. Bayesian interpretation of the evidence in forensic automatic speaker recognition depends on three sets of recordings in order to perform forensic casework: reference (R) and control (C) recordings of the suspect, and a potential population database (P), as well as a questioned recording (QR) . The requirement of similar recording conditions between suspect control database (C) and the questioned recording (QR) is often not satisfied in real forensic cases. The aim of this paper is to investigate a procedure of normalization of scores, which is based on an adaptation of the Test-normalization (T-norm) [2] technique used in the speaker verification domain, to compensate for the mismatch. Polyphone IPSC-02 database and ASPIC (an automatic speaker recognition system developed by EPFL and IPS-UNIL in Lausanne, Switzerland) were used in order to test the normalization procedure. Experimental results for three different recording condition scenarios are presented using Tippett plots and the effect of the compensation on the evaluation of the strength of the evidence is discussed.
Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina
2014-12-01
In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory-only speech recognition. Copyright © 2014 Elsevier Ltd. All rights reserved.
Data Selection for Within-Class Covariance Estimation
2016-09-08
recognition performance. While developers have typically exploited the vast archive of speaker labeled data available from earlier NIST evaluations...utterances from a large population of speakers. Fortunately, participants in NIST evaluations have access to a vast repository of legacy data from earlier...previous NIST evaluations. Training data for the UBM and T-matrix was obtained from the NIST Switchboard 2 phases 2-5 [12] and SRE04/05/06 utterances
Fifty years of progress in speech and speaker recognition
NASA Astrophysics Data System (ADS)
Furui, Sadaoki
2004-10-01
Speech and speaker recognition technology has made very significant progress in the past 50 years. The progress can be summarized by the following changes: (1) from template matching to corpus-base statistical modeling, e.g., HMM and n-grams, (2) from filter bank/spectral resonance to Cepstral features (Cepstrum + DCepstrum + DDCepstrum), (3) from heuristic time-normalization to DTW/DP matching, (4) from gdistanceh-based to likelihood-based methods, (5) from maximum likelihood to discriminative approach, e.g., MCE/GPD and MMI, (6) from isolated word to continuous speech recognition, (7) from small vocabulary to large vocabulary recognition, (8) from context-independent units to context-dependent units for recognition, (9) from clean speech to noisy/telephone speech recognition, (10) from single speaker to speaker-independent/adaptive recognition, (11) from monologue to dialogue/conversation recognition, (12) from read speech to spontaneous speech recognition, (13) from recognition to understanding, (14) from single-modality (audio signal only) to multi-modal (audio/visual) speech recognition, (15) from hardware recognizer to software recognizer, and (16) from no commercial application to many practical commercial applications. Most of these advances have taken place in both the fields of speech recognition and speaker recognition. The majority of technological changes have been directed toward the purpose of increasing robustness of recognition, including many other additional important techniques not noted above.
NASA Astrophysics Data System (ADS)
Wang, Hongcui; Kawahara, Tatsuya
CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.
Automatic Intention Recognition in Conversation Processing
ERIC Educational Resources Information Center
Holtgraves, Thomas
2008-01-01
A fundamental assumption of many theories of conversation is that comprehension of a speaker's utterance involves recognition of the speaker's intention in producing that remark. However, the nature of intention recognition is not clear. One approach is to conceptualize a speaker's intention in terms of speech acts [Searle, J. (1969). "Speech…
Cost-sensitive learning for emotion robust speaker recognition.
Li, Dongdong; Yang, Yingchun; Dai, Weihui
2014-01-01
In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved.
Cost-Sensitive Learning for Emotion Robust Speaker Recognition
Li, Dongdong; Yang, Yingchun
2014-01-01
In the field of information security, voice is one of the most important parts in biometrics. Especially, with the development of voice communication through the Internet or telephone system, huge voice data resources are accessed. In speaker recognition, voiceprint can be applied as the unique password for the user to prove his/her identity. However, speech with various emotions can cause an unacceptably high error rate and aggravate the performance of speaker recognition system. This paper deals with this problem by introducing a cost-sensitive learning technology to reweight the probability of test affective utterances in the pitch envelop level, which can enhance the robustness in emotion-dependent speaker recognition effectively. Based on that technology, a new architecture of recognition system as well as its components is proposed in this paper. The experiment conducted on the Mandarin Affective Speech Corpus shows that an improvement of 8% identification rate over the traditional speaker recognition is achieved. PMID:24999492
Speaker diarization system on the 2007 NIST rich transcription meeting recognition evaluation
NASA Astrophysics Data System (ADS)
Sun, Hanwu; Nwe, Tin Lay; Koh, Eugene Chin Wei; Bin, Ma; Li, Haizhou
2007-09-01
This paper presents a speaker diarization system developed at the Institute for Infocomm Research (I2R) for NIST Rich Transcription 2007 (RT-07) evaluation task. We describe in details our primary approaches for the speaker diarization on the Multiple Distant Microphones (MDM) conditions in conference room scenario. Our proposed system consists of six modules: 1). Least-mean squared (NLMS) adaptive filter for the speaker direction estimate via Time Difference of Arrival (TDOA), 2). An initial speaker clustering via two-stage TDOA histogram distribution quantization approach, 3). Multiple microphone speaker data alignment via GCC-PHAT Time Delay Estimate (TDE) among all the distant microphone channel signals, 4). A speaker clustering algorithm based on GMM modeling approach, 5). Non-speech removal via speech/non-speech verification mechanism and, 6). Silence removal via "Double-Layer Windowing"(DLW) method. We achieves error rate of 31.02% on the 2006 Spring (RT-06s) MDM evaluation task and a competitive overall error rate of 15.32% for the NIST Rich Transcription 2007 (RT-07) MDM evaluation task.
Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.
ERIC Educational Resources Information Center
Makhoul, John I.
The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…
Memory for syntax despite amnesia.
Ferreira, Victor S; Bock, Kathryn; Wilson, Michael P; Cohen, Neal J
2008-09-01
Syntactic persistence is a tendency for speakers to reproduce sentence structures independently of accompanying meanings, words, or sounds. The memory mechanisms behind syntactic persistence are not fully understood. Although some properties of syntactic persistence suggest a role for procedural memory, current evidence suggests that procedural memory (unlike declarative memory) does not maintain the abstract, relational features that are inherent to syntactic structures. In a study evaluating the contribution of procedural memory to syntactic persistence, patients with anterograde amnesia and matched control speakers reproduced prime sentences with different syntactic structures; reproduced 0, 1, 6, or 10 neutral sentences; then spontaneously described pictures that elicited the primed structures; and finally made recognition judgments for the prime sentences. Amnesic and control speakers showed significant and equivalent syntactic persistence, despite the amnesic speakers' profoundly impaired recognition memory for the primes. Thus, syntax is maintained by procedural-memory mechanisms. This result reveals that procedural memory is capable of supporting abstract, relational knowledge.
NASA Astrophysics Data System (ADS)
Tovarek, Jaromir; Partila, Pavol
2017-05-01
This article discusses the speaker identification for the improvement of the security communication between law enforcement units. The main task of this research was to develop the text-independent speaker identification system which can be used for real-time recognition. This system is designed for identification in the open set. It means that the unknown speaker can be anyone. Communication itself is secured, but we have to check the authorization of the communication parties. We have to decide if the unknown speaker is the authorized for the given action. The calls are recorded by IP telephony server and then these recordings are evaluate using classification If the system evaluates that the speaker is not authorized, it sends a warning message to the administrator. This message can detect, for example a stolen phone or other unusual situation. The administrator then performs the appropriate actions. Our novel proposal system uses multilayer neural network for classification and it consists of three layers (input layer, hidden layer, and output layer). A number of neurons in input layer corresponds with the length of speech features. Output layer then represents classified speakers. Artificial Neural Network classifies speech signal frame by frame, but the final decision is done over the complete record. This rule substantially increases accuracy of the classification. Input data for the neural network are a thirteen Mel-frequency cepstral coefficients, which describe the behavior of the vocal tract. These parameters are the most used for speaker recognition. Parameters for training, testing and validation were extracted from recordings of authorized users. Recording conditions for training data correspond with the real traffic of the system (sampling frequency, bit rate). The main benefit of the research is the system developed for text-independent speaker identification which is applied to secure communication between law enforcement units.
Chun, Audrey; Reinhardt, Joann P; Ramirez, Mildred; Ellis, Julie M; Silver, Stephanie; Burack, Orah; Eimicke, Joseph P; Cimarolli, Verena; Teresi, Jeanne A
2017-12-01
To examine agreement between Minimum Data Set clinician ratings and researcher assessments of depression among ethnically diverse nursing home residents using the 9-item Patient Health Questionnaire. Although depression is common among nursing homes residents, its recognition remains a challenge. Observational baseline data from a longitudinal intervention study. Sample of 155 residents from 12 long-term care units in one US facility; 50 were interviewed in Spanish. Convergence between clinician and researcher ratings was examined for (i) self-report capacity, (ii) suicidal ideation, (iii) at least moderate depression, (iv) Patient Health Questionnaire severity scores. Experiences by clinical raters using the depression assessment were analysed. The intraclass correlation coefficient was used to examine concordance and Cohen's kappa to examine agreement between clinicians and researchers. Moderate agreement (κ = 0.52) was observed in determination of capacity and poor to fair agreement in reporting suicidal ideation (κ = 0.10-0.37) across time intervals. Poor agreement was observed in classification of at least moderate depression (κ = -0.02 to 0.24), lower than the maximum kappa obtainable (0.58-0.85). Eight assessors indicated problems assessing Spanish-speaking residents. Among Spanish speakers, researchers identified 16% with Patient Health Questionnaire scores of 10 or greater, and 14% with thoughts of self-harm whilst clinicians identified 6% and 0%, respectively. This study advances the field of depression recognition in long-term care by identification of possible challenges in assessing Spanish speakers. Use of the Patient Health Questionnaire requires further investigation, particularly among non-English speakers. Depression screening for ethnically diverse nursing home residents is required, as underreporting of depression and suicidal ideation among Spanish speakers may result in lack of depression recognition and referral for evaluation and treatment. Training in depression recognition is imperative to improve the recognition, evaluation and treatment of depression in older people living in nursing homes. © 2017 John Wiley & Sons Ltd.
Word recognition materials for native speakers of Taiwan Mandarin.
Nissen, Shawn L; Harris, Richard W; Dukes, Alycia
2008-06-01
To select, digitally record, evaluate, and psychometrically equate word recognition materials that can be used to measure the speech perception abilities of native speakers of Taiwan Mandarin in quiet. Frequently used bisyllabic words produced by male and female talkers of Taiwan Mandarin were digitally recorded and subsequently evaluated using 20 native listeners with normal hearing at 10 intensity levels (-5 to 40 dB HL) in increments of 5 dB. Using logistic regression, 200 words with the steepest psychometric slopes were divided into 4 lists and 8 half-lists that were relatively equivalent in psychometric function slope. To increase auditory homogeneity of the lists, the intensity of words in each list was digitally adjusted so that the threshold of each list was equal to the midpoint between the mean thresholds of the male and female half-lists. Digital recordings of the word recognition lists and the associated clinical instructions are available on CD upon request.
An automatic speech recognition system with speaker-independent identification support
NASA Astrophysics Data System (ADS)
Caranica, Alexandru; Burileanu, Corneliu
2015-02-01
The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.
Talker and accent variability effects on spoken word recognition
NASA Astrophysics Data System (ADS)
Nyang, Edna E.; Rogers, Catherine L.; Nishi, Kanae
2003-04-01
A number of studies have shown that words in a list are recognized less accurately in noise and with longer response latencies when they are spoken by multiple talkers, rather than a single talker. These results have been interpreted as support for an exemplar-based model of speech perception, in which it is assumed that detailed information regarding the speaker's voice is preserved in memory and used in recognition, rather than being eliminated via normalization. In the present study, the effects of varying both accent and talker are investigated using lists of words spoken by (a) a single native English speaker, (b) six native English speakers, (c) three native English speakers and three Japanese-accented English speakers. Twelve /hVd/ words were mixed with multi-speaker babble at three signal-to-noise ratios (+10, +5, and 0 dB) to create the word lists. Native English-speaking listeners' percent-correct recognition for words produced by native English speakers across the three talker conditions (single talker native, multi-talker native, and multi-talker mixed native and non-native) and three signal-to-noise ratios will be compared to determine whether sources of speaker variability other than voice alone add to the processing demands imposed by simple (i.e., single accent) speaker variability in spoken word recognition.
Speaker information affects false recognition of unstudied lexical-semantic associates.
Luthra, Sahil; Fox, Neal P; Blumstein, Sheila E
2018-05-01
Recognition of and memory for a spoken word can be facilitated by a prior presentation of that word spoken by the same talker. However, it is less clear whether this speaker congruency advantage generalizes to facilitate recognition of unheard related words. The present investigation employed a false memory paradigm to examine whether information about a speaker's identity in items heard by listeners could influence the recognition of novel items (critical intruders) phonologically or semantically related to the studied items. In Experiment 1, false recognition of semantically associated critical intruders was sensitive to speaker information, though only when subjects attended to talker identity during encoding. Results from Experiment 2 also provide some evidence that talker information affects the false recognition of critical intruders. Taken together, the present findings indicate that indexical information is able to contact the lexical-semantic network to affect the processing of unheard words.
Artificially intelligent recognition of Arabic speaker using voice print-based local features
NASA Astrophysics Data System (ADS)
Mahmood, Awais; Alsulaiman, Mansour; Muhammad, Ghulam; Akram, Sheeraz
2016-11-01
Local features for any pattern recognition system are based on the information extracted locally. In this paper, a local feature extraction technique was developed. This feature was extracted in the time-frequency plain by taking the moving average on the diagonal directions of the time-frequency plane. This feature captured the time-frequency events producing a unique pattern for each speaker that can be viewed as a voice print of the speaker. Hence, we referred to this technique as voice print-based local feature. The proposed feature was compared to other features including mel-frequency cepstral coefficient (MFCC) for speaker recognition using two different databases. One of the databases used in the comparison is a subset of an LDC database that consisted of two short sentences uttered by 182 speakers. The proposed feature attained 98.35% recognition rate compared to 96.7% for MFCC using the LDC subset.
Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions
NASA Astrophysics Data System (ADS)
Wang, Longbiao; Minami, Kazue; Yamamoto, Kazumasa; Nakagawa, Seiichi
In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
Suominen, Hanna; Johnson, Maree; Zhou, Liyuan; Sanchez, Paula; Sirel, Raul; Basilakis, Jim; Hanlen, Leif; Estival, Dominique; Dawson, Linda; Kelly, Barbara
2015-01-01
Objective We study the use of speech recognition and information extraction to generate drafts of Australian nursing-handover documents. Methods Speech recognition correctness and clinicians’ preferences were evaluated using 15 recorder–microphone combinations, six documents, three speakers, Dragon Medical 11, and five survey/interview participants. Information extraction correctness evaluation used 260 documents, six-class classification for each word, two annotators, and the CRF++ conditional random field toolkit. Results A noise-cancelling lapel-microphone with a digital voice recorder gave the best correctness (79%). This microphone was also the most preferred option by all but one participant. Although the participants liked the small size of this recorder, their preference was for tablets that can also be used for document proofing and sign-off, among other tasks. Accented speech was harder to recognize than native language and a male speaker was detected better than a female speaker. Information extraction was excellent in filtering out irrelevant text (85% F1) and identifying text relevant to two classes (87% and 70% F1). Similarly to the annotators’ disagreements, there was confusion between the remaining three classes, which explains the modest 62% macro-averaged F1. Discussion We present evidence for the feasibility of speech recognition and information extraction to support clinicians’ in entering text and unlock its content for computerized decision-making and surveillance in healthcare. Conclusions The benefits of this automation include storing all information; making the drafts available and accessible almost instantly to everyone with authorized access; and avoiding information loss, delays, and misinterpretations inherent to using a ward clerk or transcription services. PMID:25336589
Speaker Recognition Using Real vs. Synthetic Parallel Data for DNN Channel Compensation
2016-08-18
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation Fred Richardson, Michael Brandstein, Jennifer Melot and...de- noising DNNs has been demonstrated for several speech tech- nologies such as ASR and speaker recognition. This paper com- pares the use of real ...AVG and POOL min DCFs). In all cases, the telephone channel per- formance on SRE10 is improved by the denoising DNNs with the real Mixer 1 and 2
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation
2016-09-08
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation Fred Richardson, Michael Brandstein, Jennifer Melot and...de- noising DNNs has been demonstrated for several speech tech- nologies such as ASR and speaker recognition. This paper com- pares the use of real ...AVG and POOL min DCFs). In all cases, the telephone channel per- formance on SRE10 is improved by the denoising DNNs with the real Mixer 1 and 2
How Psychological Stress Affects Emotional Prosody.
Paulmann, Silke; Furnes, Desire; Bøkenes, Anne Ming; Cozzolino, Philip J
2016-01-01
We explored how experimentally induced psychological stress affects the production and recognition of vocal emotions. In Study 1a, we demonstrate that sentences spoken by stressed speakers are judged by naïve listeners as sounding more stressed than sentences uttered by non-stressed speakers. In Study 1b, negative emotions produced by stressed speakers are generally less well recognized than the same emotions produced by non-stressed speakers. Multiple mediation analyses suggest this poorer recognition of negative stimuli was due to a mismatch between the variation of volume voiced by speakers and the range of volume expected by listeners. Together, this suggests that the stress level of the speaker affects judgments made by the receiver. In Study 2, we demonstrate that participants who were induced with a feeling of stress before carrying out an emotional prosody recognition task performed worse than non-stressed participants. Overall, findings suggest detrimental effects of induced stress on interpersonal sensitivity.
How Psychological Stress Affects Emotional Prosody
Paulmann, Silke; Furnes, Desire; Bøkenes, Anne Ming; Cozzolino, Philip J.
2016-01-01
We explored how experimentally induced psychological stress affects the production and recognition of vocal emotions. In Study 1a, we demonstrate that sentences spoken by stressed speakers are judged by naïve listeners as sounding more stressed than sentences uttered by non-stressed speakers. In Study 1b, negative emotions produced by stressed speakers are generally less well recognized than the same emotions produced by non-stressed speakers. Multiple mediation analyses suggest this poorer recognition of negative stimuli was due to a mismatch between the variation of volume voiced by speakers and the range of volume expected by listeners. Together, this suggests that the stress level of the speaker affects judgments made by the receiver. In Study 2, we demonstrate that participants who were induced with a feeling of stress before carrying out an emotional prosody recognition task performed worse than non-stressed participants. Overall, findings suggest detrimental effects of induced stress on interpersonal sensitivity. PMID:27802287
Building Searchable Collections of Enterprise Speech Data.
ERIC Educational Resources Information Center
Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret
The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…
The Development of Word Recognition in a Second Language.
ERIC Educational Resources Information Center
Muljani, D.; Koda, Keiko; Moates, Danny R.
1998-01-01
A study investigated differences in English word recognition in native speakers of Indonesian (an alphabetic language) and Chinese (a logographic languages) learning English as a Second Language. Results largely confirmed the hypothesis that an alphabetic first language would predict better word recognition in speakers of an alphabetic language,…
Voice Recognition Software Accuracy with Second Language Speakers of English.
ERIC Educational Resources Information Center
Coniam, D.
1999-01-01
Explores the potential of the use of voice-recognition technology with second-language speakers of English. Involves the analysis of the output produced by a small group of very competent second-language subjects reading a text into the voice recognition software Dragon Systems "Dragon NaturallySpeaking." (Author/VWL)
Simulation of talking faces in the human brain improves auditory speech recognition
von Kriegstein, Katharina; Dogan, Özgür; Grüter, Martina; Giraud, Anne-Lise; Kell, Christian A.; Grüter, Thomas; Kleinschmidt, Andreas; Kiebel, Stefan J.
2008-01-01
Human face-to-face communication is essentially audiovisual. Typically, people talk to us face-to-face, providing concurrent auditory and visual input. Understanding someone is easier when there is visual input, because visual cues like mouth and tongue movements provide complementary information about speech content. Here, we hypothesized that, even in the absence of visual input, the brain optimizes both auditory-only speech and speaker recognition by harvesting speaker-specific predictions and constraints from distinct visual face-processing areas. To test this hypothesis, we performed behavioral and neuroimaging experiments in two groups: subjects with a face recognition deficit (prosopagnosia) and matched controls. The results show that observing a specific person talking for 2 min improves subsequent auditory-only speech and speaker recognition for this person. In both prosopagnosics and controls, behavioral improvement in auditory-only speech recognition was based on an area typically involved in face-movement processing. Improvement in speaker recognition was only present in controls and was based on an area involved in face-identity processing. These findings challenge current unisensory models of speech processing, because they show that, in auditory-only speech, the brain exploits previously encoded audiovisual correlations to optimize communication. We suggest that this optimization is based on speaker-specific audiovisual internal models, which are used to simulate a talking face. PMID:18436648
Distant Speech Recognition Using a Microphone Array Network
NASA Astrophysics Data System (ADS)
Nakano, Alberto Yoshihiro; Nakagawa, Seiichi; Yamamoto, Kazumasa
In this work, spatial information consisting of the position and orientation angle of an acoustic source is estimated by an artificial neural network (ANN). The estimated position of a speaker in an enclosed space is used to refine the estimated time delays for a delay-and-sum beamformer, thus enhancing the output signal. On the other hand, the orientation angle is used to restrict the lexicon used in the recognition phase, assuming that the speaker faces a particular direction while speaking. To compensate the effect of the transmission channel inside a short frame analysis window, a new cepstral mean normalization (CMN) method based on a Gaussian mixture model (GMM) is investigated and shows better performance than the conventional CMN for short utterances. The performance of the proposed method is evaluated through Japanese digit/command recognition experiments.
Bridge Health Monitoring Using a Machine Learning Strategy
DOT National Transportation Integrated Search
2017-01-01
The goal of this project was to cast the SHM problem within a statistical pattern recognition framework. Techniques borrowed from speaker recognition, particularly speaker verification, were used as this discipline deals with problems very similar to...
Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers
NASA Astrophysics Data System (ADS)
Caballero Morales, Santiago Omar; Cox, Stephen J.
2009-12-01
Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.
NASA Technical Reports Server (NTRS)
Simpson, C. A.
1985-01-01
In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.
NASA Astrophysics Data System (ADS)
Poock, G. K.; Martin, B. J.
1984-02-01
This was an applied investigation examining the ability of a speech recognition system to recognize speakers' inputs when the speakers were under different stress levels. Subjects were asked to speak to a voice recognition system under three conditions: (1) normal office environment, (2) emotional stress, and (3) perceptual-motor stress. Results indicate a definite relationship between voice recognition system performance and the type of low stress reference patterns used to achieve recognition.
Open-set speaker identification with diverse-duration speech data
NASA Astrophysics Data System (ADS)
Karadaghi, Rawande; Hertlein, Heinz; Ariyaeeinia, Aladdin
2015-05-01
The concern in this paper is an important category of applications of open-set speaker identification in criminal investigation, which involves operating with short and varied duration speech. The study presents investigations into the adverse effects of such an operating condition on the accuracy of open-set speaker identification, based on both GMMUBM and i-vector approaches. The experiments are conducted using a protocol developed for the identification task, based on the NIST speaker recognition evaluation corpus of 2008. In order to closely cover the real-world operating conditions in the considered application area, the study includes experiments with various combinations of training and testing data duration. The paper details the characteristics of the experimental investigations conducted and provides a thorough analysis of the results obtained.
Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.
Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina
2015-07-01
It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line with the 'auditory-visual view' of auditory speech perception, which assumes that auditory speech recognition is optimized by using predictions from previously encoded speaker-specific audio-visual internal models. Copyright © 2015 Elsevier Ltd. All rights reserved.
Recognition of speaker-dependent continuous speech with KEAL
NASA Astrophysics Data System (ADS)
Mercier, G.; Bigorgne, D.; Miclet, L.; Le Guennec, L.; Querre, M.
1989-04-01
A description of the speaker-dependent continuous speech recognition system KEAL is given. An unknown utterance, is recognized by means of the followng procedures: acoustic analysis, phonetic segmentation and identification, word and sentence analysis. The combination of feature-based, speaker-independent coarse phonetic segmentation with speaker-dependent statistical classification techniques is one of the main design features of the acoustic-phonetic decoder. The lexical access component is essentially based on a statistical dynamic programming technique which aims at matching a phonemic lexical entry containing various phonological forms, against a phonetic lattice. Sentence recognition is achieved by use of a context-free grammar and a parsing algorithm derived from Earley's parser. A speaker adaptation module allows some of the system parameters to be adjusted by matching known utterances with their acoustical representation. The task to be performed, described by its vocabulary and its grammar, is given as a parameter of the system. Continuously spoken sentences extracted from a 'pseudo-Logo' language are analyzed and results are presented.
Yoneyama, Kiyoko; Munson, Benjamin
2017-02-01
Whether or not the influence of listeners' language proficiency on L2 speech recognition was affected by the structure of the lexicon was examined. This specific experiment examined the effect of word frequency (WF) and phonological neighborhood density (PND) on word recognition in native speakers of English and second-language (L2) speakers of English whose first language was Japanese. The stimuli included English words produced by a native speaker of English and English words produced by a native speaker of Japanese (i.e., with Japanese-accented English). The experiment was inspired by the finding of Imai, Flege, and Walley [(2005). J. Acoust. Soc. Am. 117, 896-907] that the influence of talker accent on speech intelligibility for L2 learners of English whose L1 is Spanish varies as a function of words' PND. In the currently study, significant interactions between stimulus accentedness and listener group on the accuracy and speed of spoken word recognition were found, as were significant effects of PND and WF on word-recognition accuracy. However, no significant three-way interaction among stimulus talker, listener group, and PND on either measure was found. Results are discussed in light of recent findings on cross-linguistic differences in the nature of the effects of PND on L2 phonological and lexical processing.
NASA Astrophysics Data System (ADS)
Anagnostopoulos, Christos Nikolaos; Vovoli, Eftichia
An emotion recognition framework based on sound processing could improve services in human-computer interaction. Various quantitative speech features obtained from sound processing of acting speech were tested, as to whether they are sufficient or not to discriminate between seven emotions. Multilayered perceptrons were trained to classify gender and emotions on the basis of a 24-input vector, which provide information about the prosody of the speaker over the entire sentence using statistics of sound features. Several experiments were performed and the results were presented analytically. Emotion recognition was successful when speakers and utterances were “known” to the classifier. However, severe misclassifications occurred during the utterance-independent framework. At least, the proposed feature vector achieved promising results for utterance-independent recognition of high- and low-arousal emotions.
Bilingual Computerized Speech Recognition Screening for Depression Symptoms
ERIC Educational Resources Information Center
Gonzalez, Gerardo; Carter, Colby; Blanes, Erika
2007-01-01
The Voice-Interactive Depression Assessment System (VIDAS) is a computerized speech recognition application for screening depression based on the Center for Epidemiological Studies--Depression scale in English and Spanish. Study 1 included 50 English and 47 Spanish speakers. Study 2 involved 108 English and 109 Spanish speakers. Participants…
Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina
2014-11-15
Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on an inter-hemispheric mechanism which exploits both a right-hemispheric sensitivity to pitch information and a left-hemispheric dominance in speech processing. Copyright © 2014 Elsevier Inc. All rights reserved.
Robust Recognition of Loud and Lombard speech in the Fighter Cockpit Environment
1988-08-01
the latter as inter-speaker variability. According to Zue [Z85j, inter-speaker variabilities can be attributed to sociolinguistic background, dialect...34 Journal of the Acoustical Society of America , Vol 50, 1971. [At74I B. S. Atal, "Linear prediction for speaker identification," Journal of the Acoustical...Society of America , Vol 55, 1974. [B771 B. Beek, E. P. Neuberg, and D. C. Hodge, "An Assessment of the Technology of Automatic Speech Recognition for
2015-10-01
Scoring, Gaussian Backend , etc.) as shown in Fig. 39. The methods in this domain also emphasized the ability to perform data purification for both...investigation using the same infrastructure was undertaken to explore Lombard effect “flavor” detection for improved speaker ID. The study The presence of...dimension selection and compared to a common N-gram frequency based selection. 2.1.2: Exploration on NN/DBN backend : Since Deep Neural Networks (DNN) have
Speaker Linking and Applications using Non-Parametric Hashing Methods
2016-09-08
clustering method based on hashing—canopy- clustering . We apply this method to a large corpus of speaker recordings, demonstrate performance tradeoffs...and compare to other hash- ing methods. Index Terms: speaker recognition, clustering , hashing, locality sensitive hashing. 1. Introduction We assume...speaker in our corpus. Second, given a QBE method, how can we perform speaker clustering —each clustering should be a single speaker, and a cluster should
Schall, Sonja; von Kriegstein, Katharina
2014-01-01
It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers' voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker's face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas.
American or British? L2 Speakers' Recognition and Evaluations of Accent Features in English
ERIC Educational Resources Information Center
Carrie, Erin; McKenzie, Robert M.
2018-01-01
Recent language attitude research has attended to the processes involved in identifying and evaluating spoken language varieties. This article investigates the ability of second-language learners of English in Spain (N = 71) to identify Received Pronunciation (RP) and General American (GenAm) speech and their perceptions of linguistic variation…
NASA Astrophysics Data System (ADS)
Kamiński, K.; Dobrowolski, A. P.
2017-04-01
The paper presents the architecture and the results of optimization of selected elements of the Automatic Speaker Recognition (ASR) system that uses Gaussian Mixture Models (GMM) in the classification process. Optimization was performed on the process of selection of individual characteristics using the genetic algorithm and the parameters of Gaussian distributions used to describe individual voices. The system that was developed was tested in order to evaluate the impact of different compression methods used, among others, in landline, mobile, and VoIP telephony systems, on effectiveness of the speaker identification. Also, the results were presented of effectiveness of speaker identification at specific levels of noise with the speech signal and occurrence of other disturbances that could appear during phone calls, which made it possible to specify the spectrum of applications of the presented ASR system.
The Use of Voice Cues for Speaker Gender Recognition in Cochlear Implant Recipients
ERIC Educational Resources Information Center
Meister, Hartmut; Fürsen, Katrin; Streicher, Barbara; Lang-Roth, Ruth; Walger, Martin
2016-01-01
Purpose: The focus of this study was to examine the influence of fundamental frequency (F0) and vocal tract length (VTL) modifications on speaker gender recognition in cochlear implant (CI) recipients for different stimulus types. Method: Single words and sentences were manipulated using isolated or combined F0 and VTL cues. Using an 11-point…
Bent, Tessa; Holt, Rachael Frush
2018-02-01
Children's ability to understand speakers with a wide range of dialects and accents is essential for efficient language development and communication in a global society. Here, the impact of regional dialect and foreign-accent variability on children's speech understanding was evaluated in both quiet and noisy conditions. Five- to seven-year-old children ( n = 90) and adults ( n = 96) repeated sentences produced by three speakers with different accents-American English, British English, and Japanese-accented English-in quiet or noisy conditions. Adults had no difficulty understanding any speaker in quiet conditions. Their performance declined for the nonnative speaker with a moderate amount of noise; their performance only substantially declined for the British English speaker (i.e., below 93% correct) when their understanding of the American English speaker was also impeded. In contrast, although children showed accurate word recognition for the American and British English speakers in quiet conditions, they had difficulty understanding the nonnative speaker even under ideal listening conditions. With a moderate amount of noise, their perception of British English speech declined substantially and their ability to understand the nonnative speaker was particularly poor. These results suggest that although school-aged children can understand unfamiliar native dialects under ideal listening conditions, their ability to recognize words in these dialects may be highly susceptible to the influence of environmental degradation. Fully adult-like word identification for speakers with unfamiliar accents and dialects may exhibit a protracted developmental trajectory.
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS
2015-05-29
DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of...development data is assumed to be unavailable. The method is based on a generalization of data whitening used in association with i-vector length...normalization and utilizes a library of whitening transforms trained at system development time using strictly out-of-domain data. The approach is
A language-familiarity effect for speaker discrimination without comprehension.
Fleming, David; Giordano, Bruno L; Caldara, Roberto; Belin, Pascal
2014-09-23
The influence of language familiarity upon speaker identification is well established, to such an extent that it has been argued that "Human voice recognition depends on language ability" [Perrachione TK, Del Tufo SN, Gabrieli JDE (2011) Science 333(6042):595]. However, 7-mo-old infants discriminate speakers of their mother tongue better than they do foreign speakers [Johnson EK, Westrek E, Nazzi T, Cutler A (2011) Dev Sci 14(5):1002-1011] despite their limited speech comprehension abilities, suggesting that speaker discrimination may rely on familiarity with the sound structure of one's native language rather than the ability to comprehend speech. To test this hypothesis, we asked Chinese and English adult participants to rate speaker dissimilarity in pairs of sentences in English or Mandarin that were first time-reversed to render them unintelligible. Even in these conditions a language-familiarity effect was observed: Both Chinese and English listeners rated pairs of native-language speakers as more dissimilar than foreign-language speakers, despite their inability to understand the material. Our data indicate that the language familiarity effect is not based on comprehension but rather on familiarity with the phonology of one's native language. This effect may stem from a mechanism analogous to the "other-race" effect in face recognition.
Processing Lexical and Speaker Information in Repetition and Semantic/Associative Priming
ERIC Educational Resources Information Center
Lee, Chao-Yang; Zhang, Yu
2018-01-01
The purpose of this study is to investigate the interaction between processing lexical and speaker-specific information in spoken word recognition. The specific question is whether repetition and semantic/associative priming is reduced when the prime and target are produced by different speakers. In Experiment 1, the prime and target were repeated…
Bilingual Language Switching: Production vs. Recognition
Mosca, Michela; de Bot, Kees
2017-01-01
This study aims at assessing how bilinguals select words in the appropriate language in production and recognition while minimizing interference from the non-appropriate language. Two prominent models are considered which assume that when one language is in use, the other is suppressed. The Inhibitory Control (IC) model suggests that, in both production and recognition, the amount of inhibition on the non-target language is greater for the stronger compared to the weaker language. In contrast, the Bilingual Interactive Activation (BIA) model proposes that, in language recognition, the amount of inhibition on the weaker language is stronger than otherwise. To investigate whether bilingual language production and recognition can be accounted for by a single model of bilingual processing, we tested a group of native speakers of Dutch (L1), advanced speakers of English (L2) in a bilingual recognition and production task. Specifically, language switching costs were measured while participants performed a lexical decision (recognition) and a picture naming (production) task involving language switching. Results suggest that while in language recognition the amount of inhibition applied to the non-appropriate language increases along with its dominance as predicted by the IC model, in production the amount of inhibition applied to the non-relevant language is not related to language dominance, but rather it may be modulated by speakers' unconscious strategies to foster the weaker language. This difference indicates that bilingual language recognition and production might rely on different processing mechanisms and cannot be accounted within one of the existing models of bilingual language processing. PMID:28638361
Bilingual Language Switching: Production vs. Recognition.
Mosca, Michela; de Bot, Kees
2017-01-01
This study aims at assessing how bilinguals select words in the appropriate language in production and recognition while minimizing interference from the non-appropriate language. Two prominent models are considered which assume that when one language is in use, the other is suppressed. The Inhibitory Control (IC) model suggests that, in both production and recognition, the amount of inhibition on the non-target language is greater for the stronger compared to the weaker language. In contrast, the Bilingual Interactive Activation (BIA) model proposes that, in language recognition, the amount of inhibition on the weaker language is stronger than otherwise. To investigate whether bilingual language production and recognition can be accounted for by a single model of bilingual processing, we tested a group of native speakers of Dutch (L1), advanced speakers of English (L2) in a bilingual recognition and production task. Specifically, language switching costs were measured while participants performed a lexical decision (recognition) and a picture naming (production) task involving language switching. Results suggest that while in language recognition the amount of inhibition applied to the non-appropriate language increases along with its dominance as predicted by the IC model, in production the amount of inhibition applied to the non-relevant language is not related to language dominance, but rather it may be modulated by speakers' unconscious strategies to foster the weaker language. This difference indicates that bilingual language recognition and production might rely on different processing mechanisms and cannot be accounted within one of the existing models of bilingual language processing.
A speech-controlled environmental control system for people with severe dysarthria.
Hawley, Mark S; Enderby, Pam; Green, Phil; Cunningham, Stuart; Brownsell, Simon; Carmichael, James; Parker, Mark; Hatzis, Athanassios; O'Neill, Peter; Palmer, Rebecca
2007-06-01
Automatic speech recognition (ASR) can provide a rapid means of controlling electronic assistive technology. Off-the-shelf ASR systems function poorly for users with severe dysarthria because of the increased variability of their articulations. We have developed a limited vocabulary speaker dependent speech recognition application which has greater tolerance to variability of speech, coupled with a computerised training package which assists dysarthric speakers to improve the consistency of their vocalisations and provides more data for recogniser training. These applications, and their implementation as the interface for a speech-controlled environmental control system (ECS), are described. The results of field trials to evaluate the training program and the speech-controlled ECS are presented. The user-training phase increased the recognition rate from 88.5% to 95.4% (p<0.001). Recognition rates were good for people with even the most severe dysarthria in everyday usage in the home (mean word recognition rate 86.9%). Speech-controlled ECS were less accurate (mean task completion accuracy 78.6% versus 94.8%) but were faster to use than switch-scanning systems, even taking into account the need to repeat unsuccessful operations (mean task completion time 7.7s versus 16.9s, p<0.001). It is concluded that a speech-controlled ECS is a viable alternative to switch-scanning systems for some people with severe dysarthria and would lead, in many cases, to more efficient control of the home.
Automatic speech recognition technology development at ITT Defense Communications Division
NASA Technical Reports Server (NTRS)
White, George M.
1977-01-01
An assessment of the applications of automatic speech recognition to defense communication systems is presented. Future research efforts include investigations into the following areas: (1) dynamic programming; (2) recognition of speech degraded by noise; (3) speaker independent recognition; (4) large vocabulary recognition; (5) word spotting and continuous speech recognition; and (6) isolated word recognition.
Speech recognition: Acoustic-phonetic knowledge acquisition and representation
NASA Astrophysics Data System (ADS)
Zue, Victor W.
1988-09-01
The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.
Speaker Recognition Using Real vs. Synthetic Parallel Data for DNN Channel Compensation
2016-09-08
Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation Fred Richardson, Michael Brandstein, Jennifer Melot, and...DNNs trained with real Mixer 2 multichannel data perform only slightly better than DNNs trained with synthetic multichannel data for microphone SR on...Mixer 6. Large re- ductions in pooled error rates of 50% EER and 30% min DCF are achieved using DNNs trained on real Mixer 2 data. Nearly the same
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1990-09-01
These conference proceedings have been prepared in support of the US Nuclear Regulatory Commission's Security Training Symposium on Meeting the Challenge -- Firearms and Explosives Recognition and Detection,'' November 28 through 30, 1989, in Bethesda, Maryland. This document contains the edited transcripts of the guest speakers. It also contains some of the speakers' formal papers that were distributed and some of the slides that were shown at the symposium (Appendix A).
Early Detection of Severe Apnoea through Voice Analysis and Automatic Speaker Recognition Techniques
NASA Astrophysics Data System (ADS)
Fernández, Ruben; Blanco, Jose Luis; Díaz, David; Hernández, Luis A.; López, Eduardo; Alcázar, José
This study is part of an on-going collaborative effort between the medical and the signal processing communities to promote research on applying voice analysis and Automatic Speaker Recognition techniques (ASR) for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based diagnosis could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we present and discuss the possibilities of using generative Gaussian Mixture Models (GMMs), generally used in ASR systems, to model distinctive apnoea voice characteristics (i.e. abnormal nasalization). Finally, we present experimental findings regarding the discriminative power of speaker recognition techniques applied to severe apnoea detection. We have achieved an 81.25 % correct classification rate, which is very promising and underpins the interest in this line of inquiry.
Optimization of multilayer neural network parameters for speaker recognition
NASA Astrophysics Data System (ADS)
Tovarek, Jaromir; Partila, Pavol; Rozhon, Jan; Voznak, Miroslav; Skapa, Jan; Uhrin, Dominik; Chmelikova, Zdenka
2016-05-01
This article discusses the impact of multilayer neural network parameters for speaker identification. The main task of speaker identification is to find a specific person in the known set of speakers. It means that the voice of an unknown speaker (wanted person) belongs to a group of reference speakers from the voice database. One of the requests was to develop the text-independent system, which means to classify wanted person regardless of content and language. Multilayer neural network has been used for speaker identification in this research. Artificial neural network (ANN) needs to set parameters like activation function of neurons, steepness of activation functions, learning rate, the maximum number of iterations and a number of neurons in the hidden and output layers. ANN accuracy and validation time are directly influenced by the parameter settings. Different roles require different settings. Identification accuracy and ANN validation time were evaluated with the same input data but different parameter settings. The goal was to find parameters for the neural network with the highest precision and shortest validation time. Input data of neural networks are a Mel-frequency cepstral coefficients (MFCC). These parameters describe the properties of the vocal tract. Audio samples were recorded for all speakers in a laboratory environment. Training, testing and validation data set were split into 70, 15 and 15 %. The result of the research described in this article is different parameter setting for the multilayer neural network for four speakers.
NASA Technical Reports Server (NTRS)
Wolf, Jared J.
1977-01-01
The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described.
The Search for Common Ground: Part I. Lexical Performance by Linguistically Diverse Learners.
ERIC Educational Resources Information Center
Windsor, Jennifer; Kohnert, Kathryn
2004-01-01
This study examines lexical performance by 3 groups of linguistically diverse school-age learners: English-only speakers with primary language impairment (LI), typical English-only speakers (EO), and typical bilingual Spanish-English speakers (BI). The accuracy and response time (RT) of 100 8- to 13-year-old children in word recognition and…
Phoneme Error Pattern by Heritage Speakers of Spanish on an English Word Recognition Test.
Shi, Lu-Feng
2017-04-01
Heritage speakers acquire their native language from home use in their early childhood. As the native language is typically a minority language in the society, these individuals receive their formal education in the majority language and eventually develop greater competency with the majority than their native language. To date, there have not been specific research attempts to understand word recognition by heritage speakers. It is not clear if and to what degree we may infer from evidence based on bilingual listeners in general. This preliminary study investigated how heritage speakers of Spanish perform on an English word recognition test and analyzed their phoneme errors. A prospective, cross-sectional, observational design was employed. Twelve normal-hearing adult Spanish heritage speakers (four men, eight women, 20-38 yr old) participated in the study. Their language background was obtained through the Language Experience and Proficiency Questionnaire. Nine English monolingual listeners (three men, six women, 20-41 yr old) were also included for comparison purposes. Listeners were presented with 200 Northwestern University Auditory Test No. 6 words in quiet. They repeated each word orally and in writing. Their responses were scored by word, word-initial consonant, vowel, and word-final consonant. Performance was compared between groups with Student's t test or analysis of variance. Group-specific error patterns were primarily descriptive, but intergroup comparisons were made using 95% or 99% confidence intervals for proportional data. The two groups of listeners yielded comparable scores when their responses were examined by word, vowel, and final consonant. However, heritage speakers of Spanish misidentified significantly more word-initial consonants and had significantly more difficulty with initial /p, b, h/ than their monolingual peers. The two groups yielded similar patterns for vowel and word-final consonants, but heritage speakers made significantly fewer errors with /e/ and more errors with word-final /p, k/. Data reported in the present study lead to a twofold conclusion. On the one hand, normal-hearing heritage speakers of Spanish may misidentify English phonemes in patterns different from those of English monolingual listeners. Not all phoneme errors can be readily understood by comparing Spanish and English phonology, suggesting that Spanish heritage speakers differ in performance from other Spanish-English bilingual listeners. On the other hand, the absolute number of errors and the error pattern of most phonemes were comparable between English monolingual listeners and Spanish heritage speakers, suggesting that audiologists may assess word recognition in quiet in the same way for these two groups of listeners, if diagnosis is based on words, not phonemes. American Academy of Audiology
NASA Astrophysics Data System (ADS)
Kuroki, Hayato; Ino, Shuichi; Nakano, Satoko; Hori, Kotaro; Ifukube, Tohru
The authors of this paper have been studying a real-time speech-to-caption system using speech recognition technology with a “repeat-speaking” method. In this system, they used a “repeat-speaker” who listens to a lecturer's voice and then speaks back the lecturer's speech utterances into a speech recognition computer. The througoing system showed that the accuracy of the captions is about 97% in Japanese-Japanese conversion and the conversion time from voices to captions is about 4 seconds in English-English conversion in some international conferences. Of course it required a lot of costs to achieve these high performances. In human communications, speech understanding depends not only on verbal information but also on non-verbal information such as speaker's gestures, and face and mouth movements. So the authors found the idea to display information of captions and speaker's face movement images with a suitable way to achieve a higher comprehension after storing information once into a computer briefly. In this paper, we investigate the relationship of the display sequence and display timing between captions that have speech recognition errors and the speaker's face movement images. The results show that the sequence “to display the caption before the speaker's face image” improves the comprehension of the captions. The sequence “to display both simultaneously” shows an improvement only a few percent higher than the question sentence, and the sequence “to display the speaker's face image before the caption” shows almost no change. In addition, the sequence “to display the caption 1 second before the speaker's face shows the most significant improvement of all the conditions.
NASA Astrophysics Data System (ADS)
Yellen, H. W.
1983-03-01
Literature pertaining to Voice Recognition abounds with information relevant to the assessment of transitory speech recognition devices. In the past, engineering requirements have dictated the path this technology followed. But, other factors do exist that influence recognition accuracy. This thesis explores the impact of Human Factors on the successful recognition of speech, principally addressing the differences or variability among users. A Threshold Technology T-600 was used for a 100 utterance vocubalary to test 44 subjects. A statistical analysis was conducted on 5 generic categories of Human Factors: Occupational, Operational, Psychological, Physiological and Personal. How the equipment is trained and the experience level of the speaker were found to be key characteristics influencing recognition accuracy. To a lesser extent computer experience, time or week, accent, vital capacity and rate of air flow, speaker cooperativeness and anxiety were found to affect overall error rates.
The Development of the Speaker Independent ARM Continuous Speech Recognition System
1992-01-01
spokeTi airborne reconnaissance reports u-ing a speech recognition system based on phoneme-level hidden Markov models (HMMs). Previous versions of the ARM...will involve automatic selection from multiple model sets, corresponding to different speaker types, and that the most rudimen- tary partition of a...The vocabulary size for the ARM task is 497 words. These words are related to the phoneme-level symbols corresponding to the models in the model set
Speaker Verification Using SVM
2010-11-01
application the required resources are provided by the phone itself. Speaker recognition can be used in many areas, like: • homeland security: airport ... security , strengthening the national borders, in travel documents, visas; • enterprise-wide network security infrastructures; • secure electronic
Tone classification of syllable-segmented Thai speech based on multilayer perception
NASA Astrophysics Data System (ADS)
Satravaha, Nuttavudh; Klinkhachorn, Powsiri; Lass, Norman
2002-05-01
Thai is a monosyllabic tonal language that uses tone to convey lexical information about the meaning of a syllable. Thus to completely recognize a spoken Thai syllable, a speech recognition system not only has to recognize a base syllable but also must correctly identify a tone. Hence, tone classification of Thai speech is an essential part of a Thai speech recognition system. Thai has five distinctive tones (``mid,'' ``low,'' ``falling,'' ``high,'' and ``rising'') and each tone is represented by a single fundamental frequency (F0) pattern. However, several factors, including tonal coarticulation, stress, intonation, and speaker variability, affect the F0 pattern of a syllable in continuous Thai speech. In this study, an efficient method for tone classification of syllable-segmented Thai speech, which incorporates the effects of tonal coarticulation, stress, and intonation, as well as a method to perform automatic syllable segmentation, were developed. Acoustic parameters were used as the main discriminating parameters. The F0 contour of a segmented syllable was normalized by using a z-score transformation before being presented to a tone classifier. The proposed system was evaluated on 920 test utterances spoken by 8 speakers. A recognition rate of 91.36% was achieved by the proposed system.
CNN: a speaker recognition system using a cascaded neural network.
Zaki, M; Ghalwash, A; Elkouny, A A
1996-05-01
The main emphasis of this paper is to present an approach for combining supervised and unsupervised neural network models to the issue of speaker recognition. To enhance the overall operation and performance of recognition, the proposed strategy integrates the two techniques, forming one global model called the cascaded model. We first present a simple conventional technique based on the distance measured between a test vector and a reference vector for different speakers in the population. This particular distance metric has the property of weighting down the components in those directions along which the intraspeaker variance is large. The reason for presenting this method is to clarify the discrepancy in performance between the conventional and neural network approach. We then introduce the idea of using unsupervised learning technique, presented by the winner-take-all model, as a means of recognition. Due to several tests that have been conducted and in order to enhance the performance of this model, dealing with noisy patterns, we have preceded it with a supervised learning model--the pattern association model--which acts as a filtration stage. This work includes both the design and implementation of both conventional and neural network approaches to recognize the speakers templates--which are introduced to the system via a voice master card and preprocessed before extracting the features used in the recognition. The conclusion indicates that the system performance in case of neural network is better than that of the conventional one, achieving a smooth degradation in respect of noisy patterns, and higher performance in respect of noise-free patterns.
Factor analysis of auto-associative neural networks with application in speaker verification.
Garimella, Sri; Hermansky, Hynek
2013-04-01
Auto-associative neural network (AANN) is a fully connected feed-forward neural network, trained to reconstruct its input at its output through a hidden compression layer, which has fewer numbers of nodes than the dimensionality of input. AANNs are used to model speakers in speaker verification, where a speaker-specific AANN model is obtained by adapting (or retraining) the universal background model (UBM) AANN, an AANN trained on multiple held out speakers, using corresponding speaker data. When the amount of speaker data is limited, this adaptation procedure may lead to overfitting as all the parameters of UBM-AANN are adapted. In this paper, we introduce and develop the factor analysis theory of AANNs to alleviate this problem. We hypothesize that only the weight matrix connecting the last nonlinear hidden layer and the output layer is speaker-specific, and further restrict it to a common low-dimensional subspace during adaptation. The subspace is learned using large amounts of development data, and is held fixed during adaptation. Thus, only the coordinates in a subspace, also known as i-vector, need to be estimated using speaker-specific data. The update equations are derived for learning both the common low-dimensional subspace and the i-vectors corresponding to speakers in the subspace. The resultant i-vector representation is used as a feature for the probabilistic linear discriminant analysis model. The proposed system shows promising results on the NIST-08 speaker recognition evaluation (SRE), and yields a 23% relative improvement in equal error rate over the previously proposed weighted least squares-based subspace AANNs system. The experiments on NIST-10 SRE confirm that these improvements are consistent and generalize across datasets.
Evaluating deep learning architectures for Speech Emotion Recognition.
Fayek, Haytham M; Lech, Margaret; Cavedon, Lawrence
2017-08-01
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Adhi Pradana, Wisnu; Adiwijaya; Novia Wisesty, Untari
2018-03-01
Support Vector Machine or commonly called SVM is one method that can be used to process the classification of a data. SVM classifies data from 2 different classes with hyperplane. In this study, the system was built using SVM to develop Arabic Speech Recognition. In the development of the system, there are 2 kinds of speakers that have been tested that is dependent speakers and independent speakers. The results from this system is an accuracy of 85.32% for speaker dependent and 61.16% for independent speakers.
STS-41 Voice Command System Flight Experiment Report
NASA Technical Reports Server (NTRS)
Salazar, George A.
1981-01-01
This report presents the results of the Voice Command System (VCS) flight experiment on the five-day STS-41 mission. Two mission specialists,Bill Shepherd and Bruce Melnick, used the speaker-dependent system to evaluate the operational effectiveness of using voice to control a spacecraft system. In addition, data was gathered to analyze the effects of microgravity on speech recognition performance.
ERIC Educational Resources Information Center
Sachtleben, Annette; Denny, Heather
2012-01-01
Following the recent interest in the teaching of pragmatics and the recognition of its importance for both cross-cultural communication and new speakers of an additional language, the authors carried out an action research project to evaluate the effectiveness of a new approach to the teaching of pragmatics. This involved the use of semiauthentic…
"Who" is saying "what"? Brain-based decoding of human voice and speech.
Formisano, Elia; De Martino, Federico; Bonte, Milene; Goebel, Rainer
2008-11-07
Can we decipher speech content ("what" is being said) and speaker identity ("who" is saying it) from observations of brain activity of a listener? Here, we combine functional magnetic resonance imaging with a data-mining algorithm and retrieve what and whom a person is listening to from the neural fingerprints that speech and voice signals elicit in the listener's auditory cortex. These cortical fingerprints are spatially distributed and insensitive to acoustic variations of the input so as to permit the brain-based recognition of learned speech from unknown speakers and of learned voices from previously unheard utterances. Our findings unravel the detailed cortical layout and computational properties of the neural populations at the basis of human speech recognition and speaker identification.
Semantic Ambiguity Effects in L2 Word Recognition.
Ishida, Tomomi
2018-06-01
The present study examined the ambiguity effects in second language (L2) word recognition. Previous studies on first language (L1) lexical processing have observed that ambiguous words are recognized faster and more accurately than unambiguous words on lexical decision tasks. In this research, L1 and L2 speakers of English were asked whether a letter string on a computer screen was an English word or not. An ambiguity advantage was found for both groups and greater ambiguity effects were found for the non-native speaker group when compared to the native speaker group. The findings imply that the larger ambiguity advantage for L2 processing is due to their slower response time in producing adequate feedback activation from the semantic level to the orthographic level.
Dmitrieva, E S; Gel'man, V Ia
2011-01-01
The listener-distinctive features of recognition of different emotional intonations (positive, negative and neutral) of male and female speakers in the presence or absence of background noise were studied in 49 adults aged 20-79 years. In all the listeners noise produced the most pronounced decrease in recognition accuracy for positive emotional intonation ("joy") as compared to other intonations, whereas it did not influence the recognition accuracy of "anger" in 65-79-year-old listeners. The higher emotion recognition rates of a noisy signal were observed for speech emotional intonations expressed by female speakers. Acoustic characteristics of noisy and clear speech signals underlying perception of speech emotional prosody were found for adult listeners of different age and gender.
Development of equally intelligible Telugu sentence-lists to test speech recognition in noise.
Tanniru, Kishore; Narne, Vijaya Kumar; Jain, Chandni; Konadath, Sreeraj; Singh, Niraj Kumar; Sreenivas, K J Ramadevi; K, Anusha
2017-09-01
To develop sentence lists in the Telugu language for the assessment of speech recognition threshold (SRT) in the presence of background noise through identification of the mean signal-to-noise ratio required to attain a 50% sentence recognition score (SRTn). This study was conducted in three phases. The first phase involved the selection and recording of Telugu sentences. In the second phase, 20 lists, each consisting of 10 sentences with equal intelligibility, were formulated using a numerical optimisation procedure. In the third phase, the SRTn of the developed lists was estimated using adaptive procedures on individuals with normal hearing. A total of 68 native Telugu speakers with normal hearing participated in the study. Of these, 18 (including the speakers) performed on various subjective measures in first phase, 20 performed on sentence/word recognition in noise for second phase and 30 participated in the list equivalency procedures in third phase. In all, 15 lists of comparable difficulty were formulated as test material. The mean SRTn across these lists corresponded to -2.74 (SD = 0.21). The developed sentence lists provided a valid and reliable tool to measure SRTn in Telugu native speakers.
Current trends in small vocabulary speech recognition for equipment control
NASA Astrophysics Data System (ADS)
Doukas, Nikolaos; Bardis, Nikolaos G.
2017-09-01
Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.
NASA Astrophysics Data System (ADS)
Mosko, J. D.; Stevens, K. N.; Griffin, G. R.
1983-08-01
Acoustical analyses were conducted of words produced by four speakers in a motion stress-inducing situation. The aim of the analyses was to document the kinds of changes that occur in the vocal utterances of speakers who are exposed to motion stress and to comment on the implications of these results for the design and development of voice interactive systems. The speakers differed markedly in the types and magnitudes of the changes that occurred in their speech. For some speakers, the stress-inducing experimental condition caused an increase in fundamental frequency, changes in the pattern of vocal fold vibration, shifts in vowel production and changes in the relative amplitudes of sounds containing turbulence noise. All speakers showed greater variability in the experimental condition than in more relaxed control situation. The variability was manifested in the acoustical characteristics of individual phonetic elements, particularly in speech sound variability observed serve to unstressed syllables. The kinds of changes and variability observed serve to emphasize the limitations of speech recognition systems based on template matching of patterns that are stored in the system during a training phase. There is need for a better understanding of these phonetic modifications and for developing ways of incorporating knowledge about these changes within a speech recognition system.
2002-06-07
Continue to Develop and Refine Emerging Technology • Some of the emerging biometric devices, such as iris scans and facial recognition systems...such as iris scans and facial recognition systems, facial recognition systems, and speaker verification systems. (976301)
A Limited-Vocabulary, Multi-Speaker Automatic Isolated Word Recognition System.
ERIC Educational Resources Information Center
Paul, James E., Jr.
Techniques for automatic recognition of isolated words are investigated, and a computer simulation of a word recognition system is effected. Considered in detail are data acquisition and digitizing, word detection, amplitude and time normalization, short-time spectral estimation including spectral windowing, spectral envelope approximation,…
Uskul, Ayse K; Paulmann, Silke; Weick, Mario
2016-02-01
Listeners have to pay close attention to a speaker's tone of voice (prosody) during daily conversations. This is particularly important when trying to infer the emotional state of the speaker. Although a growing body of research has explored how emotions are processed from speech in general, little is known about how psychosocial factors such as social power can shape the perception of vocal emotional attributes. Thus, the present studies explored how social power affects emotional prosody recognition. In a correlational study (Study 1) and an experimental study (Study 2), we show that high power is associated with lower accuracy in emotional prosody recognition than low power. These results, for the first time, suggest that individuals experiencing high or low power perceive emotional tone of voice differently. (c) 2016 APA, all rights reserved).
Connected word recognition using a cascaded neuro-computational model
NASA Astrophysics Data System (ADS)
Hoya, Tetsuya; van Leeuwen, Cees
2016-10-01
We propose a novel framework for processing a continuous speech stream that contains a varying number of words, as well as non-speech periods. Speech samples are segmented into word-tokens and non-speech periods. An augmented version of an earlier-proposed, cascaded neuro-computational model is used for recognising individual words within the stream. Simulation studies using both a multi-speaker-dependent and speaker-independent digit string database show that the proposed method yields a recognition performance comparable to that obtained by a benchmark approach using hidden Markov models with embedded training.
Zäske, Romi; Awwad Shiekh Hasan, Bashar; Belin, Pascal
2017-09-01
Listeners can recognize newly learned voices from previously unheard utterances, suggesting the acquisition of high-level speech-invariant voice representations during learning. Using functional magnetic resonance imaging (fMRI) we investigated the anatomical basis underlying the acquisition of voice representations for unfamiliar speakers independent of speech, and their subsequent recognition among novel voices. Specifically, listeners studied voices of unfamiliar speakers uttering short sentences and subsequently classified studied and novel voices as "old" or "new" in a recognition test. To investigate "pure" voice learning, i.e., independent of sentence meaning, we presented German sentence stimuli to non-German speaking listeners. To disentangle stimulus-invariant and stimulus-dependent learning, during the test phase we contrasted a "same sentence" condition in which listeners heard speakers repeating the sentences from the preceding study phase, with a "different sentence" condition. Voice recognition performance was above chance in both conditions although, as expected, performance was higher for same than for different sentences. During study phases activity in the left inferior frontal gyrus (IFG) was related to subsequent voice recognition performance and same versus different sentence condition, suggesting an involvement of the left IFG in the interactive processing of speaker and speech information during learning. Importantly, at test reduced activation for voices correctly classified as "old" compared to "new" emerged in a network of brain areas including temporal voice areas (TVAs) of the right posterior superior temporal gyrus (pSTG), as well as the right inferior/middle frontal gyrus (IFG/MFG), the right medial frontal gyrus, and the left caudate. This effect of voice novelty did not interact with sentence condition, suggesting a role of temporal voice-selective areas and extra-temporal areas in the explicit recognition of learned voice identity, independent of speech content. Copyright © 2017 Elsevier Ltd. All rights reserved.
Gender Differences in the Recognition of Vocal Emotions
Lausen, Adi; Schacht, Annekathrin
2018-01-01
The conflicting findings from the few studies conducted with regard to gender differences in the recognition of vocal expressions of emotion have left the exact nature of these differences unclear. Several investigators have argued that a comprehensive understanding of gender differences in vocal emotion recognition can only be achieved by replicating these studies while accounting for influential factors such as stimulus type, gender-balanced samples, number of encoders, decoders, and emotional categories. This study aimed to account for these factors by investigating whether emotion recognition from vocal expressions differs as a function of both listeners' and speakers' gender. A total of N = 290 participants were randomly and equally allocated to two groups. One group listened to words and pseudo-words, while the other group listened to sentences and affect bursts. Participants were asked to categorize the stimuli with respect to the expressed emotions in a fixed-choice response format. Overall, females were more accurate than males when decoding vocal emotions, however, when testing for specific emotions these differences were small in magnitude. Speakers' gender had a significant impact on how listeners' judged emotions from the voice. The group listening to words and pseudo-words had higher identification rates for emotions spoken by male than by female actors, whereas in the group listening to sentences and affect bursts the identification rates were higher when emotions were uttered by female than male actors. The mixed pattern for emotion-specific effects, however, indicates that, in the vocal channel, the reliability of emotion judgments is not systematically influenced by speakers' gender and the related stereotypes of emotional expressivity. Together, these results extend previous findings by showing effects of listeners' and speakers' gender on the recognition of vocal emotions. They stress the importance of distinguishing these factors to explain recognition ability in the processing of emotional prosody. PMID:29922202
Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn
2016-01-01
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486
ICPR-2016 - International Conference on Pattern Recognition
Learning for Scene Understanding" Speakers ICPR2016 PAPER AWARDS Best Piero Zamperoni Student Paper -Paced Dictionary Learning for Cross-Domain Retrieval and Recognition Xu, Dan; Song, Jingkuan; Alameda discussions on recent advances in the fields of Pattern Recognition, Machine Learning and Computer Vision, and
Unsupervised real-time speaker identification for daily movies
NASA Astrophysics Data System (ADS)
Li, Ying; Kuo, C.-C. Jay
2002-07-01
The problem of identifying speakers for movie content analysis is addressed in this paper. While most previous work on speaker identification was carried out in a supervised mode using pure audio data, more robust results can be obtained in real-time by integrating knowledge from multiple media sources in an unsupervised mode. In this work, both audio and visual cues will be employed and subsequently combined in a probabilistic framework to identify speakers. Particularly, audio information is used to identify speakers with a maximum likelihood (ML)-based approach while visual information is adopted to distinguish speakers by detecting and recognizing their talking faces based on face detection/recognition and mouth tracking techniques. Moreover, to accommodate for speakers' acoustic variations along time, we update their models on the fly by adapting to their newly contributed speech data. Encouraging results have been achieved through extensive experiments, which shows a promising future of the proposed audiovisual-based unsupervised speaker identification system.
Do Listeners Store in Memory a Speaker's Habitual Utterance-Final Phonation Type?
Bőhm, Tamás; Shattuck-Hufnagel, Stefanie
2009-01-01
Earlier studies report systematic differences across speakers in the occurrence of utterance-final irregular phonation; the work reported here investigated whether human listeners remember this speaker-specific information and can access it when necessary (a prerequisite for using this cue in speaker recognition). Listeners personally familiar with the voices of the speakers were presented with pairs of speech samples: one with the original and the other with transformed final phonation type. Asked to select the member of the pair that was closer to the talker's voice, most listeners tended to choose the unmanipulated token (even though they judged them to sound essentially equally natural). This suggests that utterance-final pitch period irregularity is part of the mental representation of individual speaker voices, although this may depend on the individual speaker and listener to some extent. PMID:19776665
L2 Word Recognition: Influence of L1 Orthography on Multi-syllabic Word Recognition.
Hamada, Megumi
2017-10-01
L2 reading research suggests that L1 orthographic experience influences L2 word recognition. Nevertheless, the findings on multi-syllabic words in English are still limited despite the fact that a vast majority of words are multi-syllabic. The study investigated whether L1 orthography influences the recognition of multi-syllabic words, focusing on the position of an embedded word. The participants were Arabic ESL learners, Chinese ESL learners, and native speakers of English. The task was a word search task, in which the participants identified a target word embedded in a pseudoword at the initial, middle, or final position. The search accuracy and speed indicated that all groups showed a strong preference for the initial position. The accuracy data further indicated group differences. The Arabic group showed higher accuracy in the final than middle, while the Chinese group showed the opposite and the native speakers showed no difference between the two positions. The findings suggest that L2 multi-syllabic word recognition involves unique processes.
Automatic speech recognition research at NASA-Ames Research Center
NASA Technical Reports Server (NTRS)
Coler, Clayton R.; Plummer, Robert P.; Huff, Edward M.; Hitchcock, Myron H.
1977-01-01
A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented. The voice command system VCS encodes speech by sampling 16 bandpass filters with center frequencies in the range from 200 to 5000 Hz. Variations in speaking rate are compensated for by a compression algorithm that subdivides each utterance into eight subintervals in such a way that the amount of spectral change within each subinterval is the same. The recorded filter values within each subinterval are then reduced to a 15-bit representation, giving a 120-bit encoding for each utterance. The VCS incorporates a simple recognition algorithm that utilizes five training samples of each word in a vocabulary of up to 24 words. The recognition rate of approximately 85 percent correct for untrained speakers and 94 percent correct for trained speakers was not considered adequate for flight systems use. Therefore, the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.
Discriminative analysis of lip motion features for speaker identification and speech-reading.
Cetingül, H Ertan; Yemez, Yücel; Erzin, Engin; Tekalp, A Murat
2006-10-01
There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: 1) Is using explicit lip motion information useful, and, 2) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. Several lip motion feature candidates have been considered including dense motion features within a bounding box about the lip, lip contour motion features, and combination of these with lip shape features. Furthermore, a novel two-stage, spatial, and temporal discrimination analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application.
Alternative Speech Communication System for Persons with Severe Speech Disorders
NASA Astrophysics Data System (ADS)
Selouani, Sid-Ahmed; Sidi Yakoub, Mohammed; O'Shaughnessy, Douglas
2009-12-01
Assistive speech-enabled systems are proposed to help both French and English speaking persons with various speech disorders. The proposed assistive systems use automatic speech recognition (ASR) and speech synthesis in order to enhance the quality of communication. These systems aim at improving the intelligibility of pathologic speech making it as natural as possible and close to the original voice of the speaker. The resynthesized utterances use new basic units, a new concatenating algorithm and a grafting technique to correct the poorly pronounced phonemes. The ASR responses are uttered by the new speech synthesis system in order to convey an intelligible message to listeners. Experiments involving four American speakers with severe dysarthria and two Acadian French speakers with sound substitution disorders (SSDs) are carried out to demonstrate the efficiency of the proposed methods. An improvement of the Perceptual Evaluation of the Speech Quality (PESQ) value of 5% and more than 20% is achieved by the speech synthesis systems that deal with SSD and dysarthria, respectively.
Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition
NASA Astrophysics Data System (ADS)
Ryumin, D.; Karpov, A. A.
2017-05-01
In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.
The Resolution of Visual Noise in Word Recognition
ERIC Educational Resources Information Center
Pae, Hye K.; Lee, Yong-Won
2015-01-01
This study examined lexical processing in English by native speakers of Korean and Chinese, compared to that of native speakers of English, using normal, alternated, and inverse fonts. Sixty four adult students participated in a lexical decision task. The findings demonstrated similarities and differences in accuracy and latency among the three L1…
Multimodal fusion of polynomial classifiers for automatic person recgonition
NASA Astrophysics Data System (ADS)
Broun, Charles C.; Zhang, Xiaozheng
2001-03-01
With the prevalence of the information age, privacy and personalization are forefront in today's society. As such, biometrics are viewed as essential components of current evolving technological systems. Consumers demand unobtrusive and non-invasive approaches. In our previous work, we have demonstrated a speaker verification system that meets these criteria. However, there are additional constraints for fielded systems. The required recognition transactions are often performed in adverse environments and across diverse populations, necessitating robust solutions. There are two significant problem areas in current generation speaker verification systems. The first is the difficulty in acquiring clean audio signals in all environments without encumbering the user with a head- mounted close-talking microphone. Second, unimodal biometric systems do not work with a significant percentage of the population. To combat these issues, multimodal techniques are being investigated to improve system robustness to environmental conditions, as well as improve overall accuracy across the population. We propose a multi modal approach that builds on our current state-of-the-art speaker verification technology. In order to maintain the transparent nature of the speech interface, we focus on optical sensing technology to provide the additional modality-giving us an audio-visual person recognition system. For the audio domain, we use our existing speaker verification system. For the visual domain, we focus on lip motion. This is chosen, rather than static face or iris recognition, because it provides dynamic information about the individual. In addition, the lip dynamics can aid speech recognition to provide liveness testing. The visual processing method makes use of both color and edge information, combined within Markov random field MRF framework, to localize the lips. Geometric features are extracted and input to a polynomial classifier for the person recognition process. A late integration approach, based on a probabilistic model, is employed to combine the two modalities. The system is tested on the XM2VTS database combined with AWGN in the audio domain over a range of signal-to-noise ratios.
ERIC Educational Resources Information Center
Young, Victoria; Mihailidis, Alex
2010-01-01
Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…
Calandruccio, Lauren; Bradlow, Ann R; Dhar, Sumitrajit
2014-04-01
Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared with native-accented English speech was reported in Calandruccio et al (2010a). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech Affiliationect masking release. A mixed-model design with within-subject (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech and high-intelligibility, moderate-intelligibility, and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Three listener groups were tested, including monolingual English speakers with normal hearing, nonnative English speakers with normal hearing, and monolingual English speakers with hearing loss. The nonnative English speakers were from various native language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetric mild sloping to moderate sensorineural hearing loss. Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the key words within the sentences (100 key words per masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and listener groups. Monolingual English speakers with normal hearing benefited when the competing speech signal was foreign accented compared with native accented, allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the nonnative English-speaking listeners with normal hearing nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Slight modifications between the target and the masker speech allowed monolingual English speakers with normal hearing to improve their recognition of native-accented English, even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. American Academy of Audiology.
The SRI NIST 2010 Speaker Recognition Evaluation System (PREPRINT)
2011-01-01
of several subsystems with the use of adequate side information gives a 35% improvement on the standard telephone condition. We also show that a...ratio and amount of detected speech as side information . The SRI submissions were among the best-performing systems in SRE10. 2. COMMONALITIES This...Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response
Crossmodal and incremental perception of audiovisual cues to emotional speech.
Barkhuysen, Pashiera; Krahmer, Emiel; Swerts, Marc
2010-01-01
In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: 1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? Both experiments reported below are based on tests with video clips of emotional utterances collected via a variant of the well-known Velten method. More specifically, we recorded speakers who displayed positive or negative emotions, which were congruent or incongruent with the (emotional) lexical content of the uttered sentence. In order to test this, we conducted two experiments. The first experiment is a perception experiment in which Czech participants, who do not speak Dutch, rate the perceived emotional state of Dutch speakers in a bimodal (audiovisual) or a unimodal (audio- or vision-only) condition. It was found that incongruent emotional speech leads to significantly more extreme perceived emotion scores than congruent emotional speech, where the difference between congruent and incongruent emotional speech is larger for the negative than for the positive conditions. Interestingly, the largest overall differences between congruent and incongruent emotions were found for the audio-only condition, which suggests that posing an incongruent emotion has a particularly strong effect on the spoken realization of emotions. The second experiment uses a gating paradigm to test the recognition speed for various emotional expressions from a speaker's face. In this experiment participants were presented with the same clips as experiment I, but this time presented vision-only. The clips were shown in successive segments (gates) of increasing duration. Results show that participants are surprisingly accurate in their recognition of the various emotions, as they already reach high recognition scores in the first gate (after only 160 ms). Interestingly, the recognition scores raise faster for positive than negative conditions. Finally, the gating results suggest that incongruent emotions are perceived as more intense than congruent emotions, as the former get more extreme recognition scores than the latter, already after a short period of exposure.
Second language experience modulates word retrieval effort in bilinguals: evidence from pupillometry
Schmidtke, Jens
2014-01-01
Bilingual speakers often have less language experience compared to monolinguals as a result of speaking two languages and/or a later age of acquisition of the second language. This may result in weaker and less precise phonological representations of words in memory, which may cause greater retrieval effort during spoken word recognition. To gauge retrieval effort, the present study compared the effects of word frequency, neighborhood density (ND), and level of English experience by testing monolingual English speakers and native Spanish speakers who differed in their age of acquisition of English (early/late). In the experimental paradigm, participants heard English words and matched them to one of four pictures while the pupil size, an indication of cognitive effort, was recorded. Overall, both frequency and ND effects could be observed in the pupil response, indicating that lower frequency and higher ND were associated with greater retrieval effort. Bilingual speakers showed an overall delayed pupil response and a larger ND effect compared to the monolingual speakers. The frequency effect was the same in early bilinguals and monolinguals but was larger in late bilinguals. Within the group of bilingual speakers, higher English proficiency was associated with an earlier pupil response in addition to a smaller frequency and ND effect. These results suggest that greater retrieval effort associated with bilingualism may be a consequence of reduced language experience rather than constitute a categorical bilingual disadvantage. Future avenues for the use of pupillometry in the field of spoken word recognition are discussed. PMID:24600428
Schall, Sonja; von Kriegstein, Katharina
2014-01-01
It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (<2 min/speaker). This was followed by an auditory-only speech recognition task and a control task (voice recognition) involving the learned speakers’ voices in the MRI scanner. As hypothesized, we found that, during speech recognition, familiarity with the speaker’s face increased the functional connectivity between the face-movement sensitive posterior superior temporal sulcus (STS) and an anterior STS region that supports auditory speech intelligibility. There was no difference between normal participants and prosopagnosics. This was expected because previous findings have shown that both groups use the face-movement sensitive STS to optimize auditory-only speech comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas. PMID:24466026
Speaker emotion recognition: from classical classifiers to deep neural networks
NASA Astrophysics Data System (ADS)
Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri
2018-04-01
Speaker emotion recognition is considered among the most challenging tasks in recent years. In fact, automatic systems for security, medicine or education can be improved when considering the speech affective state. In this paper, a twofold approach for speech emotion classification is proposed. At the first side, a relevant set of features is adopted, and then at the second one, numerous supervised training techniques, involving classic methods as well as deep learning, are experimented. Experimental results indicate that deep architecture can improve classification performance on two affective databases, the Berlin Dataset of Emotional Speech and the SAVEE Dataset Surrey Audio-Visual Expressed Emotion.
Cross-cultural emotional prosody recognition: evidence from Chinese and British listeners.
Paulmann, Silke; Uskul, Ayse K
2014-01-01
This cross-cultural study of emotional tone of voice recognition tests the in-group advantage hypothesis (Elfenbein & Ambady, 2002) employing a quasi-balanced design. Individuals of Chinese and British background were asked to recognise pseudosentences produced by Chinese and British native speakers, displaying one of seven emotions (anger, disgust, fear, happy, neutral tone of voice, sad, and surprise). Findings reveal that emotional displays were recognised at rates higher than predicted by chance; however, members of each cultural group were more accurate in recognising the displays communicated by a member of their own cultural group than a member of the other cultural group. Moreover, the evaluation of error matrices indicates that both culture groups relied on similar mechanism when recognising emotional displays from the voice. Overall, the study reveals evidence for both universal and culture-specific principles in vocal emotion recognition.
Advancements in robust algorithm formulation for speaker identification of whispered speech
NASA Astrophysics Data System (ADS)
Fan, Xing
Whispered speech is an alternative speech production mode from neutral speech, which is used by talkers intentionally in natural conversational scenarios to protect privacy and to avoid certain content from being overheard/made public. Due to the profound differences between whispered and neutral speech in production mechanism and the absence of whispered adaptation data, the performance of speaker identification systems trained with neutral speech degrades significantly. This dissertation therefore focuses on developing a robust closed-set speaker recognition system for whispered speech by using no or limited whispered adaptation data from non-target speakers. This dissertation proposes the concept of "High''/"Low'' performance whispered data for the purpose of speaker identification. A variety of acoustic properties are identified that contribute to the quality of whispered data. An acoustic analysis is also conducted to compare the phoneme/speaker dependency of the differences between whispered and neutral data in the feature domain. The observations from those acoustic analysis are new in this area and also serve as a guidance for developing robust speaker identification systems for whispered speech. This dissertation further proposes two systems for speaker identification of whispered speech. One system focuses on front-end processing. A two-dimensional feature space is proposed to search for "Low''-quality performance based whispered utterances and separate feature mapping functions are applied to vowels and consonants respectively in order to retain the speaker's information shared between whispered and neutral speech. The other system focuses on speech-mode-independent model training. The proposed method generates pseudo whispered features from neutral features by using the statistical information contained in a whispered Universal Background model (UBM) trained from extra collected whispered data from non-target speakers. Four modeling methods are proposed for the transformation estimation in order to generate the pseudo whispered features. Both of the above two systems demonstrate a significant improvement over the baseline system on the evaluation data. This dissertation has therefore contributed to providing a scientific understanding of the differences between whispered and neutral speech as well as improved front-end processing and modeling method for speaker identification of whispered speech. Such advancements will ultimately contribute to improve the robustness of speech processing systems.
ELF on a Mushroom: The Overnight Growth in English as a Lingua Franca
ERIC Educational Resources Information Center
Sowden, Colin
2012-01-01
In an effort to curtail native-speaker dominance of global English, and in recognition of the growing role of the language among non-native speakers from different first-language backgrounds, some academics have been urging the teaching of English as a Lingua Franca (ELF). Although at first this proposal seems to offer a plausible alternative to…
Yu, Chengzhu; Hansen, John H L
2017-03-01
Human physiology has evolved to accommodate environmental conditions, including temperature, pressure, and air chemistry unique to Earth. However, the environment in space varies significantly compared to that on Earth and, therefore, variability is expected in astronauts' speech production mechanism. In this study, the variations of astronaut voice characteristics during the NASA Apollo 11 mission are analyzed. Specifically, acoustical features such as fundamental frequency and phoneme formant structure that are closely related to the speech production system are studied. For a further understanding of astronauts' vocal tract spectrum variation in space, a maximum likelihood frequency warping based analysis is proposed to detect the vocal tract spectrum displacement during space conditions. The results from fundamental frequency, formant structure, as well as vocal spectrum displacement indicate that astronauts change their speech production mechanism when in space. Moreover, the experimental results for astronaut voice identification tasks indicate that current speaker recognition solutions are highly vulnerable to astronaut voice production variations in space conditions. Future recommendations from this study suggest that successful applications of speaker recognition during extended space missions require robust speaker modeling techniques that could effectively adapt to voice production variation caused by diverse space conditions.
Dai, Chuanfu; Zhao, Zeqi; Zhang, Duo; Lei, Guanxiong
2018-01-01
Background The aim of this study was to explore the value of the spectral ripple discrimination test in speech recognition evaluation among a deaf (post-lingual) Mandarin-speaking population in China following cochlear implantation. Material/Methods The study included 23 Mandarin-speaking adult subjects with normal hearing (normal-hearing group) and 17 deaf adults who were former Mandarin-speakers, with cochlear implants (cochlear implantation group). The normal-hearing subjects were divided into men (n=10) and women (n=13). The spectral ripple discrimination thresholds between the groups were compared. The correlation between spectral ripple discrimination thresholds and Mandarin speech recognition rates in the cochlear implantation group were studied. Results Spectral ripple discrimination thresholds did not correlate with age (r=−0.19; p=0.22), and there was no significant difference in spectral ripple discrimination thresholds between the male and female groups (p=0.654). Spectral ripple discrimination thresholds of deaf adults with cochlear implants were significantly correlated with monosyllabic recognition rates (r=0.84; p=0.000). Conclusions In a Mandarin Chinese speaking population, spectral ripple discrimination thresholds of normal-hearing individuals were unaffected by both gender and age. Spectral ripple discrimination thresholds were correlated with Mandarin monosyllabic recognition rates of Mandarin-speaking in post-lingual deaf adults with cochlear implants. The spectral ripple discrimination test is a promising method for speech recognition evaluation in adults following cochlear implantation in China. PMID:29806954
Dai, Chuanfu; Zhao, Zeqi; Shen, Weidong; Zhang, Duo; Lei, Guanxiong; Qiao, Yuehua; Yang, Shiming
2018-05-28
BACKGROUND The aim of this study was to explore the value of the spectral ripple discrimination test in speech recognition evaluation among a deaf (post-lingual) Mandarin-speaking population in China following cochlear implantation. MATERIAL AND METHODS The study included 23 Mandarin-speaking adult subjects with normal hearing (normal-hearing group) and 17 deaf adults who were former Mandarin-speakers, with cochlear implants (cochlear implantation group). The normal-hearing subjects were divided into men (n=10) and women (n=13). The spectral ripple discrimination thresholds between the groups were compared. The correlation between spectral ripple discrimination thresholds and Mandarin speech recognition rates in the cochlear implantation group were studied. RESULTS Spectral ripple discrimination thresholds did not correlate with age (r=-0.19; p=0.22), and there was no significant difference in spectral ripple discrimination thresholds between the male and female groups (p=0.654). Spectral ripple discrimination thresholds of deaf adults with cochlear implants were significantly correlated with monosyllabic recognition rates (r=0.84; p=0.000). CONCLUSIONS In a Mandarin Chinese speaking population, spectral ripple discrimination thresholds of normal-hearing individuals were unaffected by both gender and age. Spectral ripple discrimination thresholds were correlated with Mandarin monosyllabic recognition rates of Mandarin-speaking in post-lingual deaf adults with cochlear implants. The spectral ripple discrimination test is a promising method for speech recognition evaluation in adults following cochlear implantation in China.
Shin, Young Hoon; Seo, Jiwon
2016-10-29
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.
Boost OCR accuracy using iVector based system combination approach
NASA Astrophysics Data System (ADS)
Peng, Xujun; Cao, Huaigu; Natarajan, Prem
2015-01-01
Optical character recognition (OCR) is a challenging task because most existing preprocessing approaches are sensitive to writing style, writing material, noises and image resolution. Thus, a single recognition system cannot address all factors of real document images. In this paper, we describe an approach to combine diverse recognition systems by using iVector based features, which is a newly developed method in the field of speaker verification. Prior to system combination, document images are preprocessed and text line images are extracted with different approaches for each system, where iVector is transformed from a high-dimensional supervector of each text line and is used to predict the accuracy of OCR. We merge hypotheses from multiple recognition systems according to the overlap ratio and the predicted OCR score of text line images. We present evaluation results on an Arabic document database where the proposed method is compared against the single best OCR system using word error rate (WER) metric.
Calandruccio, Lauren; Bradlow, Ann R.; Dhar, Sumitrajit
2013-01-01
Background Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared to native-accented English speech was reported in Calandruccio, Dhar and Bradlow (2010). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. Purpose The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech affect masking release. Research Design A mixed model design with within- (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech, and high-intelligibility, moderate-intelligibility and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Study Sample Three listener groups were tested including monolingual English speakers with normal hearing, non-native speakers of English with normal hearing, and monolingual speakers of English with hearing loss. The non-native speakers of English were from various native-language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetrical, mild sloping to moderate sensorineural hearing loss. Data Collection and Analysis Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the keywords within the sentences (100 keywords/masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and the listener groups. Results Monolingual speakers of English with normal hearing benefited when the competing speech signal was foreign-accented compared to native-accented allowing for improved speech recognition. Various levels of intelligibility across the foreign-accented speech maskers did not influence results. Neither the non-native English listeners with normal hearing, nor the monolingual English speakers with hearing loss benefited from masking release when the masker was changed from native-accented to foreign-accented English. Conclusions Slight modifications between the target and the masker speech allowed monolingual speakers of English with normal hearing to improve their recognition of native-accented English even when the competing speech was highly intelligible. Further research is needed to determine which modifications within the competing speech signal caused the Mandarin-accented English to be less effective with respect to masking. Determining the influences within the competing speech that make it less effective as a masker, or determining why monolingual normal-hearing listeners can take advantage of these differences could help improve speech recognition for those with hearing loss in the future. PMID:25126683
Shahamiri, Seyed Reza; Salim, Siti Salwah Binti
2014-09-01
Automatic speech recognition (ASR) can be very helpful for speakers who suffer from dysarthria, a neurological disability that damages the control of motor speech articulators. Although a few attempts have been made to apply ASR technologies to sufferers of dysarthria, previous studies show that such ASR systems have not attained an adequate level of performance. In this study, a dysarthric multi-networks speech recognizer (DM-NSR) model is provided using a realization of multi-views multi-learners approach called multi-nets artificial neural networks, which tolerates variability of dysarthric speech. In particular, the DM-NSR model employs several ANNs (as learners) to approximate the likelihood of ASR vocabulary words and to deal with the complexity of dysarthric speech. The proposed DM-NSR approach was presented as both speaker-dependent and speaker-independent paradigms. In order to highlight the performance of the proposed model over legacy models, multi-views single-learner models of the DM-NSRs were also provided and their efficiencies were compared in detail. Moreover, a comparison among the prominent dysarthric ASR methods and the proposed one is provided. The results show that the DM-NSR recorded improved recognition rate by up to 24.67% and the error rate was reduced by up to 8.63% over the reference model.
Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.
Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti
2015-01-01
Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.
Noise-immune multisensor transduction of speech
NASA Astrophysics Data System (ADS)
Viswanathan, Vishu R.; Henry, Claudia M.; Derr, Alan G.; Roucos, Salim; Schwartz, Richard M.
1986-08-01
Two types of configurations of multiple sensors were developed, tested and evaluated in speech recognition application for robust performance in high levels of acoustic background noise: One type combines the individual sensor signals to provide a single speech signal input, and the other provides several parallel inputs. For single-input systems, several configurations of multiple sensors were developed and tested. Results from formal speech intelligibility and quality tests in simulated fighter aircraft cockpit noise show that each of the two-sensor configurations tested outperforms the constituent individual sensors in high noise. Also presented are results comparing the performance of two-sensor configurations and individual sensors in speaker-dependent, isolated-word speech recognition tests performed using a commercial recognizer (Verbex 4000) in simulated fighter aircraft cockpit noise.
Dickinson, Ann-Marie; Baker, Richard; Siciliano, Catherine; Munro, Kevin J
2014-10-01
To identify which training approach, if any, is most effective for improving perception of frequency-compressed speech. A between-subject design using repeated measures. Forty young adults with normal hearing were randomly allocated to one of four groups: a training group (sentence or consonant) or a control group (passive exposure or test-only). Test and training material differed in terms of material and speaker. On average, sentence training and passive exposure led to significantly improved sentence recognition (11.0% and 11.7%, respectively) compared with the consonant training group (2.5%) and test-only group (0.4%), whilst, consonant training led to significantly improved consonant recognition (8.8%) compared with the sentence training group (1.9%), passive exposure group (2.8%), and test-only group (0.8%). Sentence training led to improved sentence recognition, whilst consonant training led to improved consonant recognition. This suggests learning transferred between speakers and material but not stimuli. Passive exposure to sentence material led to an improvement in sentence recognition that was equivalent to gains from active training. This suggests that it may be possible to adapt passively to frequency-compressed speech.
Processing of Acoustic Cues in Lexical-Tone Identification by Pediatric Cochlear-Implant Recipients
ERIC Educational Resources Information Center
Peng, Shu-Chen; Lu, Hui-Ping; Lu, Nelson; Lin, Yung-Song; Deroche, Mickael L. D.; Chatterjee, Monita
2017-01-01
Purpose: The objective was to investigate acoustic cue processing in lexical-tone recognition by pediatric cochlear-implant (CI) recipients who are native Mandarin speakers. Method: Lexical-tone recognition was assessed in pediatric CI recipients and listeners with normal hearing (NH) in 2 tasks. In Task 1, participants identified naturally…
L2 Gender Facilitation and Inhibition in Spoken Word Recognition
ERIC Educational Resources Information Center
Behney, Jennifer N.
2011-01-01
This dissertation investigates the role of grammatical gender facilitation and inhibition in second language (L2) learners' spoken word recognition. Native speakers of languages that have grammatical gender are sensitive to gender marking when hearing and recognizing a word. Gender facilitation refers to when a given noun that is preceded by an…
ERP Evidence of Hemispheric Independence in Visual Word Recognition
ERIC Educational Resources Information Center
Nemrodov, Dan; Harpaz, Yuval; Javitt, Daniel C.; Lavidor, Michal
2011-01-01
This study examined the capability of the left hemisphere (LH) and the right hemisphere (RH) to perform a visual recognition task independently as formulated by the Direct Access Model (Fernandino, Iacoboni, & Zaidel, 2007). Healthy native Hebrew speakers were asked to categorize nouns and non-words (created from nouns by transposing two middle…
Cai, Zhenguang G; Gilbert, Rebecca A; Davis, Matthew H; Gaskell, M Gareth; Farrar, Lauren; Adler, Sarah; Rodd, Jennifer M
2017-11-01
Speech carries accent information relevant to determining the speaker's linguistic and social background. A series of web-based experiments demonstrate that accent cues can modulate access to word meaning. In Experiments 1-3, British participants were more likely to retrieve the American dominant meaning (e.g., hat meaning of "bonnet") in a word association task if they heard the words in an American than a British accent. In addition, results from a speeded semantic decision task (Experiment 4) and sentence comprehension task (Experiment 5) confirm that accent modulates on-line meaning retrieval such that comprehension of ambiguous words is easier when the relevant word meaning is dominant in the speaker's dialect. Critically, neutral-accent speech items, created by morphing British- and American-accented recordings, were interpreted in a similar way to accented words when embedded in a context of accented words (Experiment 2). This finding indicates that listeners do not use accent to guide meaning retrieval on a word-by-word basis; instead they use accent information to determine the dialectic identity of a speaker and then use their experience of that dialect to guide meaning access for all words spoken by that person. These results motivate a speaker-model account of spoken word recognition in which comprehenders determine key characteristics of their interlocutor and use this knowledge to guide word meaning access. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Gay- and Lesbian-Sounding Auditory Cues Elicit Stereotyping and Discrimination.
Fasoli, Fabio; Maass, Anne; Paladino, Maria Paola; Sulpizio, Simone
2017-07-01
The growing body of literature on the recognition of sexual orientation from voice ("auditory gaydar") is silent on the cognitive and social consequences of having a gay-/lesbian- versus heterosexual-sounding voice. We investigated this issue in four studies (overall N = 276), conducted in Italian language, in which heterosexual listeners were exposed to single-sentence voice samples of gay/lesbian and heterosexual speakers. In all four studies, listeners were found to make gender-typical inferences about traits and preferences of heterosexual speakers, but gender-atypical inferences about those of gay or lesbian speakers. Behavioral intention measures showed that listeners considered lesbian and gay speakers as less suitable for a leadership position, and male (but not female) listeners took distance from gay speakers. Together, this research demonstrates that having a gay/lesbian rather than heterosexual-sounding voice has tangible consequences for stereotyping and discrimination.
On the Development of Speech Resources for the Mixtec Language
2013-01-01
The Mixtec language is one of the main native languages in Mexico. In general, due to urbanization, discrimination, and limited attempts to promote the culture, the native languages are disappearing. Most of the information available about the Mixtec language is in written form as in dictionaries which, although including examples about how to pronounce the Mixtec words, are not as reliable as listening to the correct pronunciation from a native speaker. Formal acoustic resources, as speech corpora, are almost non-existent for the Mixtec, and no speech technologies are known to have been developed for it. This paper presents the development of the following resources for the Mixtec language: (1) a speech database of traditional narratives of the Mixtec culture spoken by a native speaker (labelled at the phonetic and orthographic levels by means of spectral analysis) and (2) a native speaker-adaptive automatic speech recognition (ASR) system (trained with the speech database) integrated with a Mixtec-to-Spanish/Spanish-to-Mixtec text translator. The speech database, although small and limited to a single variant, was reliable enough to build the multiuser speech application which presented a mean recognition/translation performance up to 94.36% in experiments with non-native speakers (the target users). PMID:23710134
When speaker identity is unavoidable: Neural processing of speaker identity cues in natural speech.
Tuninetti, Alba; Chládková, Kateřina; Peter, Varghese; Schiller, Niels O; Escudero, Paola
2017-11-01
Speech sound acoustic properties vary largely across speakers and accents. When perceiving speech, adult listeners normally disregard non-linguistic variation caused by speaker or accent differences, in order to comprehend the linguistic message, e.g. to correctly identify a speech sound or a word. Here we tested whether the process of normalizing speaker and accent differences, facilitating the recognition of linguistic information, is found at the level of neural processing, and whether it is modulated by the listeners' native language. In a multi-deviant oddball paradigm, native and nonnative speakers of Dutch were exposed to naturally-produced Dutch vowels varying in speaker, sex, accent, and phoneme identity. Unexpectedly, the analysis of mismatch negativity (MMN) amplitudes elicited by each type of change shows a large degree of early perceptual sensitivity to non-linguistic cues. This finding on perception of naturally-produced stimuli contrasts with previous studies examining the perception of synthetic stimuli wherein adult listeners automatically disregard acoustic cues to speaker identity. The present finding bears relevance to speech normalization theories, suggesting that at an unattended level of processing, listeners are indeed sensitive to changes in fundamental frequency in natural speech tokens. Copyright © 2017 Elsevier Inc. All rights reserved.
DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1
NASA Astrophysics Data System (ADS)
Garofolo, J. S.; Lamel, L. F.; Fisher, W. M.; Fiscus, J. G.; Pallett, D. S.
1993-02-01
The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.
Factors Impacting Recognition of False Collocations by Speakers of English as L1 and L2
ERIC Educational Resources Information Center
Makinina, Olga
2017-01-01
Currently there is a general uncertainty about what makes collocations (i.e., fixed word combinations with specific, not easily interpreted relations between their components) hard for ESL learners to master, and about how to improve collocation recognition and learning process. This study explored and designed a comparative classification of…
ERIC Educational Resources Information Center
Khateb, Asaid; Khateb-Abdelgani, Manal; Taha, Haitham Y.; Ibrahim, Raphiq
2014-01-01
This study aimed at assessing the effects of letters' connectivity in Arabic on visual word recognition. For this purpose, reaction times (RTs) and accuracy scores were collected from ninety-third, sixth and ninth grade native Arabic speakers during a lexical decision task, using fully connected (Cw), partially connected (PCw) and…
Detailed Phonetic Labeling of Multi-language Database for Spoken Language Processing Applications
2015-03-01
which contains about 60 interfering speakers as well as background music in a bar. The top panel is again clean training /noisy testing settings, and...recognition system for Mandarin was developed and tested. Character recognition rates as high as 88% were obtained, using an approximately 40 training ...Tool_ComputeFeat.m) .............................................................................................................. 50 6.3. Training
Semantic Ambiguity Effects in L2 Word Recognition
ERIC Educational Resources Information Center
Ishida, Tomomi
2018-01-01
The present study examined the ambiguity effects in second language (L2) word recognition. Previous studies on first language (L1) lexical processing have observed that ambiguous words are recognized faster and more accurately than unambiguous words on lexical decision tasks. In this research, L1 and L2 speakers of English were asked whether a…
Using Automatic Speech Recognition Technology with Elicited Oral Response Testing
ERIC Educational Resources Information Center
Cox, Troy L.; Davies, Randall S.
2012-01-01
This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…
ERIC Educational Resources Information Center
Calandruccio, Lauren; Zhou, Haibo
2014-01-01
Purpose: To examine whether improved speech recognition during linguistically mismatched target-masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method: Monolingual English speakers (n = 20) and English-Greek simultaneous bilinguals (n = 20) listened to…
Video indexing based on image and sound
NASA Astrophysics Data System (ADS)
Faudemay, Pascal; Montacie, Claude; Caraty, Marie-Jose
1997-10-01
Video indexing is a major challenge for both scientific and economic reasons. Information extraction can sometimes be easier from sound channel than from image channel. We first present a multi-channel and multi-modal query interface, to query sound, image and script through 'pull' and 'push' queries. We then summarize the segmentation phase, which needs information from the image channel. Detection of critical segments is proposed. It should speed-up both automatic and manual indexing. We then present an overview of the information extraction phase. Information can be extracted from the sound channel, through speaker recognition, vocal dictation with unconstrained vocabularies, and script alignment with speech. We present experiment results for these various techniques. Speaker recognition methods were tested on the TIMIT and NTIMIT database. Vocal dictation as experimented on newspaper sentences spoken by several speakers. Script alignment was tested on part of a carton movie, 'Ivanhoe'. For good quality sound segments, error rates are low enough for use in indexing applications. Major issues are the processing of sound segments with noise or music, and performance improvement through the use of appropriate, low-cost architectures or networks of workstations.
Onojima, Takayuki; Kitajo, Keiichi; Mizuhara, Hiroaki
2017-01-01
Neural oscillation is attracting attention as an underlying mechanism for speech recognition. Speech intelligibility is enhanced by the synchronization of speech rhythms and slow neural oscillation, which is typically observed as human scalp electroencephalography (EEG). In addition to the effect of neural oscillation, it has been proposed that speech recognition is enhanced by the identification of a speaker's motor signals, which are used for speech production. To verify the relationship between the effect of neural oscillation and motor cortical activity, we measured scalp EEG, and simultaneous EEG and functional magnetic resonance imaging (fMRI) during a speech recognition task in which participants were required to recognize spoken words embedded in noise sound. We proposed an index to quantitatively evaluate the EEG phase effect on behavioral performance. The results showed that the delta and theta EEG phase before speech inputs modulated the participant's response time when conducting speech recognition tasks. The simultaneous EEG-fMRI experiment showed that slow EEG activity was correlated with motor cortical activity. These results suggested that the effect of the slow oscillatory phase was associated with the activity of the motor cortex during speech recognition.
Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil
2006-01-01
The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent proximity to "normal" articulatory patterns. Severe dysarthric output may also be characterized by a small mass of distinguishable phonetic tokens making the acoustic differentiation of target words difficult. Speaker dependent computer speech recognition using Hidden Markov Models was achieved by the identification of robust phonetic elements within the individual speaker output patterns. A new system of speech training using computer generated visual and auditory feedback reduced the inconsistent production of key phonetic tokens over time.
NASA Astrophysics Data System (ADS)
Palaniswamy, Sumithra; Duraisamy, Prakash; Alam, Mohammad Showkat; Yuan, Xiaohui
2012-04-01
Automatic speech processing systems are widely used in everyday life such as mobile communication, speech and speaker recognition, and for assisting the hearing impaired. In speech communication systems, the quality and intelligibility of speech is of utmost importance for ease and accuracy of information exchange. To obtain an intelligible speech signal and one that is more pleasant to listen, noise reduction is essential. In this paper a new Time Adaptive Discrete Bionic Wavelet Thresholding (TADBWT) scheme is proposed. The proposed technique uses Daubechies mother wavelet to achieve better enhancement of speech from additive non- stationary noises which occur in real life such as street noise and factory noise. Due to the integration of human auditory system model into the wavelet transform, bionic wavelet transform (BWT) has great potential for speech enhancement which may lead to a new path in speech processing. In the proposed technique, at first, discrete BWT is applied to noisy speech to derive TADBWT coefficients. Then the adaptive nature of the BWT is captured by introducing a time varying linear factor which updates the coefficients at each scale over time. This approach has shown better performance than the existing algorithms at lower input SNR due to modified soft level dependent thresholding on time adaptive coefficients. The objective and subjective test results confirmed the competency of the TADBWT technique. The effectiveness of the proposed technique is also evaluated for speaker recognition task under noisy environment. The recognition results show that the TADWT technique yields better performance when compared to alternate methods specifically at lower input SNR.
ERIC Educational Resources Information Center
Ryba, Ken; McIvor, Tom; Shakir, Maha; Paez, Di
2006-01-01
This study examined continuous automated speech recognition in the university lecture theatre. The participants were both native speakers of English (L1) and English as a second language students (L2) enrolled in an information systems course (Total N=160). After an initial training period, an L2 lecturer in information systems delivered three…
Influences of High and Low Variability on Infant Word Recognition
ERIC Educational Resources Information Center
Singh, Leher
2008-01-01
Although infants begin to encode and track novel words in fluent speech by 7.5 months, their ability to recognize words is somewhat limited at this stage. In particular, when the surface form of a word is altered, by changing the gender or affective prosody of the speaker, infants begin to falter at spoken word recognition. Given that natural…
Identification and tracking of particular speaker in noisy environment
NASA Astrophysics Data System (ADS)
Sawada, Hideyuki; Ohkado, Minoru
2004-10-01
Human is able to exchange information smoothly using voice under different situations such as noisy environment in a crowd and with the existence of plural speakers. We are able to detect the position of a source sound in 3D space, extract a particular sound from mixed sounds, and recognize who is talking. By realizing this mechanism with a computer, new applications will be presented for recording a sound with high quality by reducing noise, presenting a clarified sound, and realizing a microphone-free speech recognition by extracting particular sound. The paper will introduce a realtime detection and identification of particular speaker in noisy environment using a microphone array based on the location of a speaker and the individual voice characteristics. The study will be applied to develop an adaptive auditory system of a mobile robot which collaborates with a factory worker.
2012-03-01
with each SVM discriminating between a pair of the N total speakers in the data set. The (( + 1))/2 classifiers then vote on the final...classification of a test sample. The Random Forest classifier is an ensemble classifier that votes amongst decision trees generated with each node using...Forest vote , and the effects of overtraining will be mitigated by the fact that each decision tree is overtrained differently (due to the random
Voice input/output capabilities at Perception Technology Corporation
NASA Technical Reports Server (NTRS)
Ferber, Leon A.
1977-01-01
Condensed resumes of key company personnel at the Perception Technology Corporation are presented. The staff possesses recognition, speech synthesis, speaker authentication, and language identification. Hardware and software engineers' capabilities are included.
Channel Compensation for Speaker Recognition using MAP Adapted PLDA and Denoising DNNs
2016-06-21
improvement has been the availability of large quantities of speaker-labeled data from telephone recordings. For new data applications, such as audio from...mi- crophone channels to the telephone channel. Audio files were rejected if the alignment process failed. At the end of the pro- cess a total of 873...Microphone 01 AT3035 ( Audio Technica Studio Mic) 02 MX418S (Shure Gooseneck Mic) 03 Crown PZM Soundgrabber II 04 AT Pro45 ( Audio Technica Hanging Mic
ERIC Educational Resources Information Center
Nober, E. Harris; Seymour, Harry N.
In order to investigate the possible consequences of dialectical differences in the classroom setting relative to the low income black and white first grade child and the prospective white middle-class teacher, 25 black and 25 white university listeners yielded speech recognition scores for 48 black and 48 white five-year-old urban school-children…
A Prerequisite to L1 Homophone Effects in L2 Spoken-Word Recognition
ERIC Educational Resources Information Center
Nakai, Satsuki; Lindsay, Shane; Ota, Mitsuhiko
2015-01-01
When both members of a phonemic contrast in L2 (second language) are perceptually mapped to a single phoneme in one's L1 (first language), L2 words containing a member of that contrast can spuriously activate L2 words in spoken-word recognition. For example, upon hearing cattle, Dutch speakers of English are reported to experience activation…
Razza, Sergio; Zaccone, Monica; Meli, Aannalisa; Cristofari, Eliana
2017-12-01
Children affected by hearing loss can experience difficulties in challenging and noisy environments even when deafness is corrected by Cochlear implant (CI) devices. These patients have a selective attention deficit in multiple listening conditions. At present, the most effective ways to improve the performance of speech recognition in noise consists of providing CI processors with noise reduction algorithms and of providing patients with bilateral CIs. The aim of this study was to compare speech performances in noise, across increasing noise levels, in CI recipients using two kinds of wireless remote-microphone radio systems that use digital radio frequency transmission: the Roger Inspiro accessory and the Cochlear Wireless Mini Microphone accessory. Eleven Nucleus Cochlear CP910 CI young user subjects were studied. The signal/noise ratio, at a speech reception threshold (SRT) value of 50%, was measured in different conditions for each patient: with CI only, with the Roger or with the MiniMic accessory. The effect of the application of the SNR-noise reduction algorithm in each of these conditions was also assessed. The tests were performed with the subject positioned in front of the main speaker, at a distance of 2.5 m. Another two speakers were positioned at 3.50 m. The main speaker at 65 dB issued disyllabic words. Babble noise signal was delivered through the other speakers, with variable intensity. The use of both wireless remote microphones improved the SRT results. Both systems improved gain of speech performances. The gain was higher with the Mini Mic system (SRT = -4.76) than the Roger system (SRT = -3.01). The addition of the NR algorithm did not statistically further improve the results. There is significant improvement in speech recognition results with both wireless digital remote microphone accessories, in particular with the Mini Mic system when used with the CP910 processor. The use of a remote microphone accessory surpasses the benefit of application of NR algorithm. Copyright © 2017. Published by Elsevier B.V.
Sensing of Particular Speakers for the Construction of Voice Interface Utilized in Noisy Environment
NASA Astrophysics Data System (ADS)
Sawada, Hideyuki; Ohkado, Minoru
Human is able to exchange information smoothly using voice under different situations such as noisy environment in a crowd and with the existence of plural speakers. We are able to detect the position of a source sound in 3D space, extract a particular sound from mixed sounds, and recognize who is talking. By realizing this mechanism with a computer, new applications will be presented for recording a sound with high quality by reducing noise, presenting a clarified sound, and realizing a microphone-free speech recognition by extracting particular sound. The paper will introduce a realtime detection and identification of particular speaker in noisy environment using a microphone array based on the location of a speaker and the individual voice characteristics. The study will be applied to develop an adaptive auditory system of a mobile robot which collaborates with a factory worker.
Lexical constraints in second language learning: Evidence on grammatical gender in German*
BOBB, SUSAN C.; KROLL, JUDITH F.; JACKSON, CARRIE N.
2015-01-01
The present study asked whether or not the apparent insensitivity of second language (L2) learners to grammatical gender violations reflects an inability to use grammatical information during L2 lexical processing. Native German speakers and English speakers with intermediate to advanced L2 proficiency in German performed a translation-recognition task. On critical trials, an incorrect translation was presented that either matched or mismatched the grammatical gender of the correct translation. Results show interference for native German speakers in conditions in which the incorrect translation matched the gender of the correct translation. Native English speakers, regardless of German proficiency, were insensitive to the gender mismatch. In contrast, these same participants were correctly able to assign gender to critical items. These findings suggest a dissociation between explicit knowledge and the ability to use that information under speeded processing conditions and demonstrate the difficulty of L2 gender processing at the lexical level. PMID:26346327
Speaker Recognition Through NLP and CWT Modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown-VanHoozer, S.A.; Kercel, S.W.; Tucker, R.W.
The objective of this research is to develop a system capable of identifying speakers on wiretaps from a large database (>500 speakers) with a short search time duration (<30 seconds), and with better than 90% accuracy. Much previous research in speaker recognition has led to algorithms that produced encouraging preliminary results, but were overwhelmed when applied to populations of more than a dozen or so different speakers. The authors are investigating a solution to the "large population" problem by seeking two completely different kinds of characterizing features. These features are he techniques of Neuro-Linguistic Programming (NLP) and the continuous waveletmore » transform (CWT). NLP extracts precise neurological, verbal and non-verbal information, and assimilates the information into useful patterns. These patterns are based on specific cues demonstrated by each individual, and provide ways of determining congruency between verbal and non-verbal cues. The primary NLP modalities are characterized through word spotting (or verbal predicates cues, e.g., see, sound, feel, etc.) while the secondary modalities would be characterized through the speech transcription used by the individual. This has the practical effect of reducing the size of the search space, and greatly speeding up the process of identifying an unknown speaker. The wavelet-based line of investigation concentrates on using vowel phonemes and non-verbal cues, such as tempo. The rationale for concentrating on vowels is there are a limited number of vowels phonemes, and at least one of them usually appears in even the shortest of speech segments. Using the fast, CWT algorithm, the details of both the formant frequency and the glottal excitation characteristics can be easily extracted from voice waveforms. The differences in the glottal excitation waveforms as well as the formant frequency are evident in the CWT output. More significantly, the CWT reveals significant detail of the glottal excitation waveform.« less
Speaker recognition through NLP and CWT modeling.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown-VanHoozer, A.; Kercel, S. W.; Tucker, R. W.
The objective of this research is to develop a system capable of identifying speakers on wiretaps from a large database (>500 speakers) with a short search time duration (<30 seconds), and with better than 90% accuracy. Much previous research in speaker recognition has led to algorithms that produced encouraging preliminary results, but were overwhelmed when applied to populations of more than a dozen or so different speakers. The authors are investigating a solution to the ''huge population'' problem by seeking two completely different kinds of characterizing features. These features are extracted using the techniques of Neuro-Linguistic Programming (NLP) and themore » continuous wavelet transform (CWT). NLP extracts precise neurological, verbal and non-verbal information, and assimilates the information into useful patterns. These patterns are based on specific cues demonstrated by each individual, and provide ways of determining congruency between verbal and non-verbal cues. The primary NLP modalities are characterized through word spotting (or verbal predicates cues, e.g., see, sound, feel, etc.) while the secondary modalities would be characterized through the speech transcription used by the individual. This has the practical effect of reducing the size of the search space, and greatly speeding up the process of identifying an unknown speaker. The wavelet-based line of investigation concentrates on using vowel phonemes and non-verbal cues, such as tempo. The rationale for concentrating on vowels is there are a limited number of vowels phonemes, and at least one of them usually appears in even the shortest of speech segments. Using the fast, CWT algorithm, the details of both the formant frequency and the glottal excitation characteristics can be easily extracted from voice waveforms. The differences in the glottal excitation waveforms as well as the formant frequency are evident in the CWT output. More significantly, the CWT reveals significant detail of the glottal excitation waveform.« less
Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech☆
Cao, Houwei; Verma, Ragini; Nenkova, Ani
2014-01-01
We introduce a ranking approach for emotion recognition which naturally incorporates information about the general expressivity of speakers. We demonstrate that our approach leads to substantial gains in accuracy compared to conventional approaches. We train ranking SVMs for individual emotions, treating the data from each speaker as a separate query, and combine the predictions from all rankers to perform multi-class prediction. The ranking method provides two natural benefits. It captures speaker specific information even in speaker-independent training/testing conditions. It also incorporates the intuition that each utterance can express a mix of possible emotion and that considering the degree to which each emotion is expressed can be productively exploited to identify the dominant emotion. We compare the performance of the rankers and their combination to standard SVM classification approaches on two publicly available datasets of acted emotional speech, Berlin and LDC, as well as on spontaneous emotional data from the FAU Aibo dataset. On acted data, ranking approaches exhibit significantly better performance compared to SVM classification both in distinguishing a specific emotion from all others and in multi-class prediction. On the spontaneous data, which contains mostly neutral utterances with a relatively small portion of less intense emotional utterances, ranking-based classifiers again achieve much higher precision in identifying emotional utterances than conventional SVM classifiers. In addition, we discuss the complementarity of conventional SVM and ranking-based classifiers. On all three datasets we find dramatically higher accuracy for the test items on whose prediction the two methods agree compared to the accuracy of individual methods. Furthermore on the spontaneous data the ranking and standard classification are complementary and we obtain marked improvement when we combine the two classifiers by late-stage fusion. PMID:25422534
Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech☆
Cao, Houwei; Verma, Ragini; Nenkova, Ani
2015-01-01
We introduce a ranking approach for emotion recognition which naturally incorporates information about the general expressivity of speakers. We demonstrate that our approach leads to substantial gains in accuracy compared to conventional approaches. We train ranking SVMs for individual emotions, treating the data from each speaker as a separate query, and combine the predictions from all rankers to perform multi-class prediction. The ranking method provides two natural benefits. It captures speaker specific information even in speaker-independent training/testing conditions. It also incorporates the intuition that each utterance can express a mix of possible emotion and that considering the degree to which each emotion is expressed can be productively exploited to identify the dominant emotion. We compare the performance of the rankers and their combination to standard SVM classification approaches on two publicly available datasets of acted emotional speech, Berlin and LDC, as well as on spontaneous emotional data from the FAU Aibo dataset. On acted data, ranking approaches exhibit significantly better performance compared to SVM classification both in distinguishing a specific emotion from all others and in multi-class prediction. On the spontaneous data, which contains mostly neutral utterances with a relatively small portion of less intense emotional utterances, ranking-based classifiers again achieve much higher precision in identifying emotional utterances than conventional SVM classifiers. In addition, we discuss the complementarity of conventional SVM and ranking-based classifiers. On all three datasets we find dramatically higher accuracy for the test items on whose prediction the two methods agree compared to the accuracy of individual methods. Furthermore on the spontaneous data the ranking and standard classification are complementary and we obtain marked improvement when we combine the two classifiers by late-stage fusion.
The search for common ground: Part I. Lexical performance by linguistically diverse learners.
Windsor, Jennifer; Kohnert, Kathryn
2004-08-01
This study examines lexical performance by 3 groups of linguistically diverse school-age learners: English-only speakers with primary language impairment (LI), typical English-only speakers (EO), and typical bilingual Spanish-English speakers (BI). The accuracy and response time (RT) of 100 8- to 13-year-old children in word recognition and picture-naming tasks were analyzed. Within each task, stimulus difficulty was manipulated to include very easy stimuli (words that were high frequency/had an early age of acquisition in English) and more difficult stimuli (words of low frequency/late age of acquisition [AOA]). There was no difference among groups in real-word recognition accuracy or RT; all 3 groups showed lower accuracy with low-frequency words. In picture naming, all 3 groups showed a longer RT for words with a late AOA, although AOA had a disproportionate negative impact on BI performance. The EO group was faster and more accurate than both LI and BI groups in conditions with later acquired stimuli. Results are discussed in terms of quantitative differences separating EO children from the other 2 groups and qualitative similarities linking monolingual children with and without LI.
Improving Speaker Recognition by Biometric Voice Deconstruction
Mazaira-Fernandez, Luis Miguel; Álvarez-Marquina, Agustín; Gómez-Vilda, Pedro
2015-01-01
Person identification, especially in critical environments, has always been a subject of great interest. However, it has gained a new dimension in a world threatened by a new kind of terrorism that uses social networks (e.g., YouTube) to broadcast its message. In this new scenario, classical identification methods (such as fingerprints or face recognition) have been forcedly replaced by alternative biometric characteristics such as voice, as sometimes this is the only feature available. The present study benefits from the advances achieved during last years in understanding and modeling voice production. The paper hypothesizes that a gender-dependent characterization of speakers combined with the use of a set of features derived from the components, resulting from the deconstruction of the voice into its glottal source and vocal tract estimates, will enhance recognition rates when compared to classical approaches. A general description about the main hypothesis and the methodology followed to extract the gender-dependent extended biometric parameters is given. Experimental validation is carried out both on a highly controlled acoustic condition database, and on a mobile phone network recorded under non-controlled acoustic conditions. PMID:26442245
Improving Speaker Recognition by Biometric Voice Deconstruction.
Mazaira-Fernandez, Luis Miguel; Álvarez-Marquina, Agustín; Gómez-Vilda, Pedro
2015-01-01
Person identification, especially in critical environments, has always been a subject of great interest. However, it has gained a new dimension in a world threatened by a new kind of terrorism that uses social networks (e.g., YouTube) to broadcast its message. In this new scenario, classical identification methods (such as fingerprints or face recognition) have been forcedly replaced by alternative biometric characteristics such as voice, as sometimes this is the only feature available. The present study benefits from the advances achieved during last years in understanding and modeling voice production. The paper hypothesizes that a gender-dependent characterization of speakers combined with the use of a set of features derived from the components, resulting from the deconstruction of the voice into its glottal source and vocal tract estimates, will enhance recognition rates when compared to classical approaches. A general description about the main hypothesis and the methodology followed to extract the gender-dependent extended biometric parameters is given. Experimental validation is carried out both on a highly controlled acoustic condition database, and on a mobile phone network recorded under non-controlled acoustic conditions.
ERIC Educational Resources Information Center
Ashrapova, Alsu; Alendeeva, Svetlana
2014-01-01
This article is the result of a study of the influence of English and German on the Russian language during the English learning based on lexical borrowings in the field of economics. This paper discusses the use and recognition of borrowings from the English and German languages by Russian native speakers. The use of lexical borrowings from…
Robust recognition of loud and Lombard speech in the fighter cockpit environment
NASA Astrophysics Data System (ADS)
Stanton, Bill J., Jr.
1988-08-01
There are a number of challenges associated with incorporating speech recognition technology into the fighter cockpit. One of the major problems is the wide range of variability in the pilot's voice. That can result from changing levels of stress and workload. Increasing the training set to include abnormal speech is not an attractive option because of the innumerable conditions that would have to be represented and the inordinate amount of time to collect such a training set. A more promising approach is to study subsets of abnormal speech that have been produced under controlled cockpit conditions with the purpose of characterizing reliable shifts that occur relative to normal speech. Such was the initiative of this research. Analyses were conducted for 18 features on 17671 phoneme tokens across eight speakers for normal, loud, and Lombard speech. It was discovered that there was a consistent migration of energy in the sonorants. This discovery of reliable energy shifts led to the development of a method to reduce or eliminate these shifts in the Euclidean distances between LPC log magnitude spectra. This combination significantly improved recognition performance of loud and Lombard speech. Discrepancies in recognition error rates between normal and abnormal speech were reduced by approximately 50 percent for all eight speakers combined.
Neave-DiToro, Dorothy; Rubinstein, Adrienne; Neuman, Arlene C
2017-05-01
Limited attention has been given to the effects of classroom acoustics at the college level. Many studies have reported that nonnative speakers of English are more likely to be affected by poor room acoustics than native speakers. An important question is how classroom acoustics affect speech perception of nonnative college students. The combined effect of noise and reverberation on the speech recognition performance of college students who differ in age of English acquisition was evaluated under conditions simulating classrooms with reverberation times (RTs) close to ANSI recommended RTs. A mixed design was used in this study. Thirty-six native and nonnative English-speaking college students with normal hearing, ages 18-28 yr, participated. Two groups of nine native participants (native monolingual [NM] and native bilingual) and two groups of nine nonnative participants (nonnative early and nonnative late) were evaluated in noise under three reverberant conditions (0.03, 0.06, and 0.08 sec). A virtual test paradigm was used, which represented a signal reaching a student at the back of a classroom. Speech recognition in noise was measured using the Bamford-Kowal-Bench Speech-in-Noise (BKB-SIN) test and signal-to-noise ratio required for correct repetition of 50% of the key words in the stimulus sentences (SNR-50) was obtained for each group in each reverberant condition. A mixed-design analysis of variance was used to determine statistical significance as a function of listener group and RT. SNR-50 was significantly higher for nonnative listeners as compared to native listeners, and a more favorable SNR-50 was needed as RT increased. The most dramatic effect on SNR-50 was found in the group with later acquisition of English, whereas the impact of early introduction of a second language was subtler. At the ANSI standard's maximum recommended RT (0.6 sec), all groups except the NM group exhibited a mild signal-to-noise ratio (SNR) loss. At the 0.8 sec RT, all groups exhibited a mild SNR loss. Acoustics in the classroom are an important consideration for nonnative speakers who are proficient in English and enrolled in college. To address the need for a clearer speech signal by nonnative students (and for all students), universities should follow ANSI recommendations, as well as minimize background noise in occupied classrooms. Behavioral/instructional strategies should be considered to address factors that cannot be compensated for through acoustic design. American Academy of Audiology
Shattuck-Hufnagel, S.; Choi, J. Y.; Moro-Velázquez, L.; Gómez-García, J. A.
2017-01-01
Although a large amount of acoustic indicators have already been proposed in the literature to evaluate the hypokinetic dysarthria of people with Parkinson’s Disease, the goal of this work is to identify and interpret new reliable and complementary articulatory biomarkers that could be applied to predict/evaluate Parkinson’s Disease from a diadochokinetic test, contributing to the possibility of a further multidimensional analysis of the speech of parkinsonian patients. The new biomarkers proposed are based on the kinetic behaviour of the envelope trace, which is directly linked with the articulatory dysfunctions introduced by the disease since the early stages. The interest of these new articulatory indicators stands on their easiness of identification and interpretation, and their potential to be translated into computer based automatic methods to screen the disease from the speech. Throughout this paper, the accuracy provided by these acoustic kinetic biomarkers is compared with the one obtained with a baseline system based on speaker identification techniques. Results show accuracies around 85% that are in line with those obtained with the complex state of the art speaker recognition techniques, but with an easier physical interpretation, which open the possibility to be transferred to a clinical setting. PMID:29240814
[Vocal recognition in dental and oral radiology].
La Fianza, A; Giorgetti, S; Marelli, P; Campani, R
1993-10-01
Speech reporting benefits by units which can recognize sentences in any natural language in real time. The use of this method in the everyday practice of radiology departments shows its possible application fields. We used the speech recognition method to report orthopantomographic exams in order to evaluate the advantages the method offers to the management and quality of reporting the exams which are difficult to fit in other closed computed reporting systems. Both speech recognition and the conventional reporting method (tape recording and typewriting) were used to report 760 orthopantomographs. The average time needed to make the report, the legibility (or Flesch) index, as adapted for the Italian language, and finally a clinical index (the subjective opinion of 4 odontostomatologists) were evaluated for each exam, with both techniques. Moreover, errors in speech reporting (crude, human and overall errors) were also evaluated. The advantages of speech reporting consisted in the shorter time needed for the report to become available (2.24 vs 2.99 minutes) (p < 0.0005), in the improved Flesch index (30.62 vs 28.9) and in the clinical index. The data obtained from speech reporting in odontostomatologic radiology were useful not only to reduce the mean reporting time of orthopantomographic exams but also to improve report quality by reducing both grammar and transmission mistakes. However, the basic condition for such results to be obtained is the speaker's skills to make a good report.
Vocal Identity Recognition in Autism Spectrum Disorder
Lin, I-Fan; Yamada, Takashi; Komine, Yoko; Kato, Nobumasa; Kato, Masaharu; Kashino, Makio
2015-01-01
Voices can convey information about a speaker. When forming an abstract representation of a speaker, it is important to extract relevant features from acoustic signals that are invariant to the modulation of these signals. This study investigated the way in which individuals with autism spectrum disorder (ASD) recognize and memorize vocal identity. The ASD group and control group performed similarly in a task when asked to choose the name of the newly-learned speaker based on his or her voice, and the ASD group outperformed the control group in a subsequent familiarity test when asked to discriminate the previously trained voices and untrained voices. These findings suggest that individuals with ASD recognized and memorized voices as well as the neurotypical individuals did, but they categorized voices in a different way: individuals with ASD categorized voices quantitatively based on the exact acoustic features, while neurotypical individuals categorized voices qualitatively based on the acoustic patterns correlated to the speakers' physical and mental properties. PMID:26070199
Vocal Identity Recognition in Autism Spectrum Disorder.
Lin, I-Fan; Yamada, Takashi; Komine, Yoko; Kato, Nobumasa; Kato, Masaharu; Kashino, Makio
2015-01-01
Voices can convey information about a speaker. When forming an abstract representation of a speaker, it is important to extract relevant features from acoustic signals that are invariant to the modulation of these signals. This study investigated the way in which individuals with autism spectrum disorder (ASD) recognize and memorize vocal identity. The ASD group and control group performed similarly in a task when asked to choose the name of the newly-learned speaker based on his or her voice, and the ASD group outperformed the control group in a subsequent familiarity test when asked to discriminate the previously trained voices and untrained voices. These findings suggest that individuals with ASD recognized and memorized voices as well as the neurotypical individuals did, but they categorized voices in a different way: individuals with ASD categorized voices quantitatively based on the exact acoustic features, while neurotypical individuals categorized voices qualitatively based on the acoustic patterns correlated to the speakers' physical and mental properties.
Casas, Rachel Nichole; Gonzales, Edlin; Aldana-Aragón, Eréndira; Lara-Muñoz, María del Carmen; Kopelowicz, Alex; Andrews, Laura; López, Steven Regeser
2015-01-01
Lack of knowledge about psychosis, a condition oftentimes associated with serious mental illness, may contribute to disparities in mental health service use. Psychoeducational interventions aimed at improving psychosis literacy have attracted significant attention recently, but few have focused on the growing numbers of ethnic and linguistic minorities in countries with large immigrant populations, such as the United States. This paper reports on two studies designed to evaluate the effectiveness of a DVD version of La CLAve, a psychoeducational program that aims to increase psychosis literacy among Spanish-speaking Latinos. Study 1 is a randomized control study to test directly the efficacy of a DVD version of La CLAve for Spanish-speakers across a range of educational backgrounds. Fifty-seven medical students and 68 community residents from Mexico were randomly assigned to view either La CLAve or a psychoeducational program of similar length regarding caregiving. Study 2 employed a single-subjects design to evaluate the effectiveness of the DVD presentation when administered by a community mental health educator. Ninety-three Spanish-speakers from San Diego, California completed assessments both before and after receiving the DVD training. Results from these two studies indicate that the DVD version of La CLAve is capable of producing a range of psychosis literacy gains for Spanish-speakers in both the United States and Mexico, even when administered by a community worker. Thus, it has potential for widespread dissemination and use among underserved communities of Spanish-speaking Latinos and for minimizing disparities in mental health service use, particularly as it relates to insufficient knowledge of psychosis. PMID:25383998
Multilevel Analysis in Analyzing Speech Data
ERIC Educational Resources Information Center
Guddattu, Vasudeva; Krishna, Y.
2011-01-01
The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…
NASA Astrophysics Data System (ADS)
Costache, G. N.; Gavat, I.
2004-09-01
Along with the aggressive growing of the amount of digital data available (text, audio samples, digital photos and digital movies joined all in the multimedia domain) the need for classification, recognition and retrieval of this kind of data became very important. In this paper will be presented a system structure to handle multimedia data based on a recognition perspective. The main processing steps realized for the interesting multimedia objects are: first, the parameterization, by analysis, in order to obtain a description based on features, forming the parameter vector; second, a classification, generally with a hierarchical structure to make the necessary decisions. For audio signals, both speech and music, the derived perceptual features are the melcepstral (MFCC) and the perceptual linear predictive (PLP) coefficients. For images, the derived features are the geometric parameters of the speaker mouth. The hierarchical classifier consists generally in a clustering stage, based on the Kohonnen Self-Organizing Maps (SOM) and a final stage, based on a powerful classification algorithm called Support Vector Machines (SVM). The system, in specific variants, is applied with good results in two tasks: the first, is a bimodal speech recognition which uses features obtained from speech signal fused to features obtained from speaker's image and the second is a music retrieval from large music database.
Effect of Vowel Context on the Recognition of Initial Consonants in Kannada.
Kalaiah, Mohan Kumar; Bhat, Jayashree S
2017-09-01
The present study was carried out to investigate the effect of vowel context on the recognition of Kannada consonants in quiet for young adults. A total of 17 young adults with normal hearing in both ears participated in the study. The stimuli included consonant-vowel syllables, spoken by 12 native speakers of Kannada. Consonant recognition task was carried out as a closed-set (fourteen-alternative forced-choice). The present study showed an effect of vowel context on the perception of consonants. Maximum consonant recognition score was obtained in the /o/ vowel context, followed by the /a/ and /u/ vowel contexts, and then the /e/ context. Poorest consonant recognition score was obtained in the vowel context /i/. Vowel context has an effect on the recognition of Kannada consonants, and the vowel effect was unique for Kannada consonants.
Monstrey, Jolijn; Deeks, John M.; Macherey, Olivier
2014-01-01
Objective To evaluate a speech-processing strategy in which the lowest frequency channel is conveyed using an asymmetric pulse shape and “phantom stimulation”, where current is injected into one intra-cochlear electrode and where the return current is shared between an intra-cochlear and an extra-cochlear electrode. This strategy is expected to provide more selective excitation of the cochlear apex, compared to a standard strategy where the lowest-frequency channel is conveyed by symmetric pulses in monopolar mode. In both strategies all other channels were conveyed by monopolar stimulation. Design Within-subjects comparison between the two strategies. Four experiments: (1) discrimination between the strategies, controlling for loudness differences, (2) consonant identification, (3) recognition of lowpass-filtered sentences in quiet, (4) sentence recognition in the presence of a competing speaker. Study sample Eight users of the Advanced Bionics CII/Hi-Res 90k cochlear implant. Results Listeners could easily discriminate between the two strategies but no consistent differences in performance were observed. Conclusions The proposed method does not improve speech perception, at least in the short term. PMID:25358027
Carlyon, Robert P; Monstrey, Jolijn; Deeks, John M; Macherey, Olivier
2014-12-01
To evaluate a speech-processing strategy in which the lowest frequency channel is conveyed using an asymmetric pulse shape and "phantom stimulation", where current is injected into one intra-cochlear electrode and where the return current is shared between an intra-cochlear and an extra-cochlear electrode. This strategy is expected to provide more selective excitation of the cochlear apex, compared to a standard strategy where the lowest-frequency channel is conveyed by symmetric pulses in monopolar mode. In both strategies all other channels were conveyed by monopolar stimulation. Within-subjects comparison between the two strategies. Four experiments: (1) discrimination between the strategies, controlling for loudness differences, (2) consonant identification, (3) recognition of lowpass-filtered sentences in quiet, (4) sentence recognition in the presence of a competing speaker. Eight users of the Advanced Bionics CII/Hi-Res 90k cochlear implant. Listeners could easily discriminate between the two strategies but no consistent differences in performance were observed. The proposed method does not improve speech perception, at least in the short term.
Benefits for Voice Learning Caused by Concurrent Faces Develop over Time.
Zäske, Romi; Mühl, Constanze; Schweinberger, Stefan R
2015-01-01
Recognition of personally familiar voices benefits from the concurrent presentation of the corresponding speakers' faces. This effect of audiovisual integration is most pronounced for voices combined with dynamic articulating faces. However, it is unclear if learning unfamiliar voices also benefits from audiovisual face-voice integration or, alternatively, is hampered by attentional capture of faces, i.e., "face-overshadowing". In six study-test cycles we compared the recognition of newly-learned voices following unimodal voice learning vs. bimodal face-voice learning with either static (Exp. 1) or dynamic articulating faces (Exp. 2). Voice recognition accuracies significantly increased for bimodal learning across study-test cycles while remaining stable for unimodal learning, as reflected in numerical costs of bimodal relative to unimodal voice learning in the first two study-test cycles and benefits in the last two cycles. This was independent of whether faces were static images (Exp. 1) or dynamic videos (Exp. 2). In both experiments, slower reaction times to voices previously studied with faces compared to voices only may result from visual search for faces during memory retrieval. A general decrease of reaction times across study-test cycles suggests facilitated recognition with more speaker repetitions. Overall, our data suggest two simultaneous and opposing mechanisms during bimodal face-voice learning: while attentional capture of faces may initially impede voice learning, audiovisual integration may facilitate it thereafter.
Wilson, Richard H
2015-04-01
In 1940, a cooperative effort by the radio networks and Bell Telephone produced the volume unit (vu) meter that has been the mainstay instrument for monitoring the level of speech signals in commercial broadcasting and research laboratories. With the use of computers, today the amplitude of signals can be quantified easily using the root mean square (rms) algorithm. Researchers had previously reported that amplitude estimates of sentences and running speech were 4.8 dB higher when measured with a vu meter than when calculated with rms. This study addresses the vu-rms relation as applied to the carrier phrase and target word paradigm used to assess word-recognition abilities, the premise being that by definition the word-recognition paradigm is a special and different case from that described previously. The purpose was to evaluate the vu and rms amplitude relations for the carrier phrases and target words commonly used to assess word-recognition abilities. In addition, the relations with the target words between rms level and recognition performance were examined. Descriptive and correlational. Two recoded versions of the Northwestern University Auditory Test No. 6 were evaluated, the Auditec of St. Louis (Auditec) male speaker and the Department of Veterans Affairs (VA) female speaker. Using both visual and auditory cues from a waveform editor, the temporal onsets and offsets were defined for each carrier phrase and each target word. The rms amplitudes for those segments then were computed and expressed in decibels with reference to the maximum digitization range. The data were maintained for each of the four Northwestern University Auditory Test No. 6 word lists. Descriptive analyses were used with linear regressions used to evaluate the reliability of the measurement technique and the relation between the rms levels of the target words and recognition performances. Although there was a 1.3 dB difference between the calibration tones, the mean levels of the carrier phrases for the two recordings were -14.8 dB (Auditec) and -14.1 dB (VA) with standard deviations <1 dB. For the target words, the mean amplitudes were -19.9 dB (Auditec) and -18.3 dB (VA) with standard deviations ranging from 1.3 to 2.4 dB. The mean durations for the carrier phrases of both recordings were 593-594 msec, with the mean durations of the target words a little different, 509 msec (Auditec) and 528 msec (VA). Random relations were observed between the recognition performances and rms levels of the target words. Amplitude and temporal data for the individual words are provided. The rms levels of the carrier phrases closely approximated (±1 dB) the rms levels of the calibration tones, both of which were set to 0 vu (dB). The rms levels of the target words were 5-6 dB below the levels of the carrier phrases and were substantially more variable than the levels of the carrier phrases. The relation between the rms levels of the target words and recognition performances on the words was random. American Academy of Audiology.
How Captain Amerika uses neural networks to fight crime
NASA Technical Reports Server (NTRS)
Rogers, Steven K.; Kabrisky, Matthew; Ruck, Dennis W.; Oxley, Mark E.
1994-01-01
Artificial neural network models can make amazing computations. These models are explained along with their application in problems associated with fighting crime. Specific problems addressed are identification of people using face recognition, speaker identification, and fingerprint and handwriting analysis (biometric authentication).
On the Time Course of Vocal Emotion Recognition
Pell, Marc D.; Kotz, Sonja A.
2011-01-01
How quickly do listeners recognize emotions from a speaker's voice, and does the time course for recognition vary by emotion type? To address these questions, we adapted the auditory gating paradigm to estimate how much vocal information is needed for listeners to categorize five basic emotions (anger, disgust, fear, sadness, happiness) and neutral utterances produced by male and female speakers of English. Semantically-anomalous pseudo-utterances (e.g., The rivix jolled the silling) conveying each emotion were divided into seven gate intervals according to the number of syllables that listeners heard from sentence onset. Participants (n = 48) judged the emotional meaning of stimuli presented at each gate duration interval, in a successive, blocked presentation format. Analyses looked at how recognition of each emotion evolves as an utterance unfolds and estimated the “identification point” for each emotion. Results showed that anger, sadness, fear, and neutral expressions are recognized more accurately at short gate intervals than happiness, and particularly disgust; however, as speech unfolds, recognition of happiness improves significantly towards the end of the utterance (and fear is recognized more accurately than other emotions). When the gate associated with the emotion identification point of each stimulus was calculated, data indicated that fear (M = 517 ms), sadness (M = 576 ms), and neutral (M = 510 ms) expressions were identified from shorter acoustic events than the other emotions. These data reveal differences in the underlying time course for conscious recognition of basic emotions from vocal expressions, which should be accounted for in studies of emotional speech processing. PMID:22087275
Lu, Lingxi; Bao, Xiaohan; Chen, Jing; Qu, Tianshu; Wu, Xihong; Li, Liang
2018-05-01
Under a noisy "cocktail-party" listening condition with multiple people talking, listeners can use various perceptual/cognitive unmasking cues to improve recognition of the target speech against informational speech-on-speech masking. One potential unmasking cue is the emotion expressed in a speech voice, by means of certain acoustical features. However, it was unclear whether emotionally conditioning a target-speech voice that has none of the typical acoustical features of emotions (i.e., an emotionally neutral voice) can be used by listeners for enhancing target-speech recognition under speech-on-speech masking conditions. In this study we examined the recognition of target speech against a two-talker speech masker both before and after the emotionally neutral target voice was paired with a loud female screaming sound that has a marked negative emotional valence. The results showed that recognition of the target speech (especially the first keyword in a target sentence) was significantly improved by emotionally conditioning the target speaker's voice. Moreover, the emotional unmasking effect was independent of the unmasking effect of the perceived spatial separation between the target speech and the masker. Also, (skin conductance) electrodermal responses became stronger after emotional learning when the target speech and masker were perceptually co-located, suggesting an increase of listening efforts when the target speech was informationally masked. These results indicate that emotionally conditioning the target speaker's voice does not change the acoustical parameters of the target-speech stimuli, but the emotionally conditioned vocal features can be used as cues for unmasking target speech.
Improving language models for radiology speech recognition.
Paulett, John M; Langlotz, Curtis P
2009-02-01
Speech recognition systems have become increasingly popular as a means to produce radiology reports, for reasons both of efficiency and of cost. However, the suboptimal recognition accuracy of these systems can affect the productivity of the radiologists creating the text reports. We analyzed a database of over two million de-identified radiology reports to determine the strongest determinants of word frequency. Our results showed that body site and imaging modality had a similar influence on the frequency of words and of three-word phrases as did the identity of the speaker. These findings suggest that the accuracy of speech recognition systems could be significantly enhanced by further tailoring their language models to body site and imaging modality, which are readily available at the time of report creation.
Speech to Text Translation for Malay Language
NASA Astrophysics Data System (ADS)
Al-khulaidi, Rami Ali; Akmeliawati, Rini
2017-11-01
The speech recognition system is a front end and a back-end process that receives an audio signal uttered by a speaker and converts it into a text transcription. The speech system can be used in several fields including: therapeutic technology, education, social robotics and computer entertainments. In most cases in control tasks, which is the purpose of proposing our system, wherein the speed of performance and response concern as the system should integrate with other controlling platforms such as in voiced controlled robots. Therefore, the need for flexible platforms, that can be easily edited to jibe with functionality of the surroundings, came to the scene; unlike other software programs that require recording audios and multiple training for every entry such as MATLAB and Phoenix. In this paper, a speech recognition system for Malay language is implemented using Microsoft Visual Studio C#. 90 (ninety) Malay phrases were tested by 10 (ten) speakers from both genders in different contexts. The result shows that the overall accuracy (calculated from Confusion Matrix) is satisfactory as it is 92.69%.
Mark My Words: Tone of Voice Changes Affective Word Representations in Memory
Schirmer, Annett
2010-01-01
The present study explored the effect of speaker prosody on the representation of words in memory. To this end, participants were presented with a series of words and asked to remember the words for a subsequent recognition test. During study, words were presented auditorily with an emotional or neutral prosody, whereas during test, words were presented visually. Recognition performance was comparable for words studied with emotional and neutral prosody. However, subsequent valence ratings indicated that study prosody changed the affective representation of words in memory. Compared to words with neutral prosody, words with sad prosody were later rated as more negative and words with happy prosody were later rated as more positive. Interestingly, the participants' ability to remember study prosody failed to predict this effect, suggesting that changes in word valence were implicit and associated with initial word processing rather than word retrieval. Taken together these results identify a mechanism by which speakers can have sustained effects on listener attitudes towards word referents. PMID:20169154
Constraints on the Transfer of Perceptual Learning in Accented Speech
Eisner, Frank; Melinger, Alissa; Weber, Andrea
2013-01-01
The perception of speech sounds can be re-tuned through a mechanism of lexically driven perceptual learning after exposure to instances of atypical speech production. This study asked whether this re-tuning is sensitive to the position of the atypical sound within the word. We investigated perceptual learning using English voiced stop consonants, which are commonly devoiced in word-final position by Dutch learners of English. After exposure to a Dutch learner’s productions of devoiced stops in word-final position (but not in any other positions), British English (BE) listeners showed evidence of perceptual learning in a subsequent cross-modal priming task, where auditory primes with devoiced final stops (e.g., “seed”, pronounced [si:th]), facilitated recognition of visual targets with voiced final stops (e.g., SEED). In Experiment 1, this learning effect generalized to test pairs where the critical contrast was in word-initial position, e.g., auditory primes such as “town” facilitated recognition of visual targets like DOWN. Control listeners, who had not heard any stops by the speaker during exposure, showed no learning effects. The generalization to word-initial position did not occur when participants had also heard correctly voiced, word-initial stops during exposure (Experiment 2), and when the speaker was a native BE speaker who mimicked the word-final devoicing (Experiment 3). The readiness of the perceptual system to generalize a previously learned adjustment to other positions within the word thus appears to be modulated by distributional properties of the speech input, as well as by the perceived sociophonetic characteristics of the speaker. The results suggest that the transfer of pre-lexical perceptual adjustments that occur through lexically driven learning can be affected by a combination of acoustic, phonological, and sociophonetic factors. PMID:23554598
Free Field Word recognition test in the presence of noise in normal hearing adults.
Almeida, Gleide Viviani Maciel; Ribas, Angela; Calleros, Jorge
In ideal listening situations, subjects with normal hearing can easily understand speech, as can many subjects who have a hearing loss. To present the validation of the Word Recognition Test in a Free Field in the Presence of Noise in normal-hearing adults. Sample consisted of 100 healthy adults over 18 years of age with normal hearing. After pure tone audiometry, a speech recognition test was applied in free field condition with monosyllables and disyllables, with standardized material in three listening situations: optimal listening condition (no noise), with a signal to noise ratio of 0dB and a signal to noise ratio of -10dB. For these tests, an environment in calibrated free field was arranged where speech was presented to the subject being tested from two speakers located at 45°, and noise from a third speaker, located at 180°. All participants had speech audiometry results in the free field between 88% and 100% in the three listening situations. Word Recognition Test in Free Field in the Presence of Noise proved to be easy to be organized and applied. The results of the test validation suggest that individuals with normal hearing should get between 88% and 100% of the stimuli correct. The test can be an important tool in measuring noise interference on the speech perception abilities. Copyright © 2016 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.
NASA Astrophysics Data System (ADS)
Iqbal, Asim; Farooq, Umar; Mahmood, Hassan; Asad, Muhammad Usman; Khan, Akrama; Atiq, Hafiz Muhammad
2010-02-01
A self teaching image processing and voice recognition based system is developed to educate visually impaired children, chiefly in their primary education. System comprises of a computer, a vision camera, an ear speaker and a microphone. Camera, attached with the computer system is mounted on the ceiling opposite (on the required angle) to the desk on which the book is placed. Sample images and voices in the form of instructions and commands of English, Urdu alphabets, Numeric Digits, Operators and Shapes are already stored in the database. A blind child first reads the embossed character (object) with the help of fingers than he speaks the answer, name of the character, shape etc into the microphone. With the voice command of a blind child received by the microphone, image is taken by the camera which is processed by MATLAB® program developed with the help of Image Acquisition and Image processing toolbox and generates a response or required set of instructions to child via ear speaker, resulting in self education of a visually impaired child. Speech recognition program is also developed in MATLAB® with the help of Data Acquisition and Signal Processing toolbox which records and process the command of the blind child.
MARTI: man-machine animation real-time interface
NASA Astrophysics Data System (ADS)
Jones, Christian M.; Dlay, Satnam S.
1997-05-01
The research introduces MARTI (man-machine animation real-time interface) for the realization of natural human-machine interfacing. The system uses simple vocal sound-tracks of human speakers to provide lip synchronization of computer graphical facial models. We present novel research in a number of engineering disciplines, which include speech recognition, facial modeling, and computer animation. This interdisciplinary research utilizes the latest, hybrid connectionist/hidden Markov model, speech recognition system to provide very accurate phone recognition and timing for speaker independent continuous speech, and expands on knowledge from the animation industry in the development of accurate facial models and automated animation. The research has many real-world applications which include the provision of a highly accurate and 'natural' man-machine interface to assist user interactions with computer systems and communication with one other using human idiosyncrasies; a complete special effects and animation toolbox providing automatic lip synchronization without the normal constraints of head-sets, joysticks, and skilled animators; compression of video data to well below standard telecommunication channel bandwidth for video communications and multi-media systems; assisting speech training and aids for the handicapped; and facilitating player interaction for 'video gaming' and 'virtual worlds.' MARTI has introduced a new level of realism to man-machine interfacing and special effect animation which has been previously unseen.
A voice-input voice-output communication aid for people with severe speech impairment.
Hawley, Mark S; Cunningham, Stuart P; Green, Phil D; Enderby, Pam; Palmer, Rebecca; Sehgal, Siddharth; O'Neill, Peter
2013-01-01
A new form of augmentative and alternative communication (AAC) device for people with severe speech impairment-the voice-input voice-output communication aid (VIVOCA)-is described. The VIVOCA recognizes the disordered speech of the user and builds messages, which are converted into synthetic speech. System development was carried out employing user-centered design and development methods, which identified and refined key requirements for the device. A novel methodology for building small vocabulary, speaker-dependent automatic speech recognizers with reduced amounts of training data, was applied. Experiments showed that this method is successful in generating good recognition performance (mean accuracy 96%) on highly disordered speech, even when recognition perplexity is increased. The selected message-building technique traded off various factors including speed of message construction and range of available message outputs. The VIVOCA was evaluated in a field trial by individuals with moderate to severe dysarthria and confirmed that they can make use of the device to produce intelligible speech output from disordered speech input. The trial highlighted some issues which limit the performance and usability of the device when applied in real usage situations, with mean recognition accuracy of 67% in these circumstances. These limitations will be addressed in future work.
Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review
NASA Astrophysics Data System (ADS)
Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH
2017-09-01
This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.
Speaker-independent phoneme recognition with a binaural auditory image model
NASA Astrophysics Data System (ADS)
Francis, Keith Ivan
1997-09-01
This dissertation presents phoneme recognition techniques based on a binaural fusion of outputs of the auditory image model and subsequent azimuth-selective phoneme recognition in a noisy environment. Background information concerning speech variations, phoneme recognition, current binaural fusion techniques and auditory modeling issues is explained. The research is constrained to sources in the frontal azimuthal plane of a simulated listener. A new method based on coincidence detection of neural activity patterns from the auditory image model of Patterson is used for azimuth-selective phoneme recognition. The method is tested in various levels of noise and the results are reported in contrast to binaural fusion methods based on various forms of correlation to demonstrate the potential of coincidence- based binaural phoneme recognition. This method overcomes smearing of fine speech detail typical of correlation based methods. Nevertheless, coincidence is able to measure similarity of left and right inputs and fuse them into useful feature vectors for phoneme recognition in noise.
Development of panel loudspeaker system: design, evaluation and enhancement.
Bai, M R; Huang, T
2001-06-01
Panel speakers are investigated in terms of structural vibration and acoustic radiation. A panel speaker primarily consists of a panel and an inertia exciter. Contrary to conventional speakers, flexural resonance is encouraged such that the panel vibrates as randomly as possible. Simulation tools are developed to facilitate system integration of panel speakers. In particular, electro-mechanical analogy, finite element analysis, and fast Fourier transform are employed to predict panel vibration and the acoustic radiation. Design procedures are also summarized. In order to compare the panel speakers with the conventional speakers, experimental investigations were undertaken to evaluate frequency response, directional response, sensitivity, efficiency, and harmonic distortion of both speakers. The results revealed that the panel speakers suffered from a problem of sensitivity and efficiency. To alleviate the problem, a woofer using electronic compensation based on H2 model matching principle is utilized to supplement the bass response. As indicated in the result, significant improvement over the panel speaker alone was achieved by using the combined panel-woofer system.
Translation Ambiguity but Not Word Class Predicts Translation Performance
ERIC Educational Resources Information Center
Prior, Anat; Kroll, Judith F.; Macwhinney, Brian
2013-01-01
We investigated the influence of word class and translation ambiguity on cross-linguistic representation and processing. Bilingual speakers of English and Spanish performed translation production and translation recognition tasks on nouns and verbs in both languages. Words either had a single translation or more than one translation. Translation…
Scenario-Based Spoken Interaction with Virtual Agents
ERIC Educational Resources Information Center
Morton, Hazel; Jack, Mervyn A.
2005-01-01
This paper describes a CALL approach which integrates software for speaker independent continuous speech recognition with embodied virtual agents and virtual worlds to create an immersive environment in which learners can converse in the target language in contextualised scenarios. The result is a self-access learning package: SPELL (Spoken…
Inferring Speaker Affect in Spoken Natural Language Communication
ERIC Educational Resources Information Center
Pon-Barry, Heather Roberta
2013-01-01
The field of spoken language processing is concerned with creating computer programs that can understand human speech and produce human-like speech. Regarding the problem of understanding human speech, there is currently growing interest in moving beyond speech recognition (the task of transcribing the words in an audio stream) and towards…
Event identification by acoustic signature recognition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dress, W.B.; Kercel, S.W.
1995-07-01
Many events of interest to the security commnnity produce acoustic emissions that are, in principle, identifiable as to cause. Some obvious examples are gunshots, breaking glass, takeoffs and landings of small aircraft, vehicular engine noises, footsteps (high frequencies when on gravel, very low frequencies. when on soil), and voices (whispers to shouts). We are investigating wavelet-based methods to extract unique features of such events for classification and identification. We also discuss methods of classification and pattern recognition specifically tailored for acoustic signatures obtained by wavelet analysis. The paper is divided into three parts: completed work, work in progress, and futuremore » applications. The completed phase has led to the successful recognition of aircraft types on landing and takeoff. Both small aircraft (twin-engine turboprop) and large (commercial airliners) were included in the study. The project considered the design of a small, field-deployable, inexpensive device. The techniques developed during the aircraft identification phase were then adapted to a multispectral electromagnetic interference monitoring device now deployed in a nuclear power plant. This is a general-purpose wavelet analysis engine, spanning 14 octaves, and can be adapted for other specific tasks. Work in progress is focused on applying the methods previously developed to speaker identification. Some of the problems to be overcome include recognition of sounds as voice patterns and as distinct from possible background noises (e.g., music), as well as identification of the speaker from a short-duration voice sample. A generalization of the completed work and the work in progress is a device capable of classifying any number of acoustic events-particularly quasi-stationary events such as engine noises and voices and singular events such as gunshots and breaking glass. We will show examples of both kinds of events and discuss their recognition likelihood.« less
Lee, Soomin; Katsuura, Tetsuo; Shimomura, Yoshihiro
2011-01-01
In recent years, a new type of speaker called the parametric speaker has been used to generate highly directional sound, and these speakers are now commercially available. In our previous study, we verified that the burden of the parametric speaker was lower than that of the general speaker for endocrine functions. However, nothing has yet been demonstrated about the effects of the shorter distance than 2.6 m between parametric speakers and the human body. Therefore, we investigated the distance effect on endocrinological function and subjective evaluation. Nine male subjects participated in this study. They completed three consecutive sessions: a 20-min quiet period as a baseline, a 30-min mental task period with general speakers or parametric speakers, and a 20-min recovery period. We measured salivary cortisol and chromogranin A (CgA) concentrations. Furthermore, subjects took the Kwansei-gakuin Sleepiness Scale (KSS) test before and after the task and also a sound quality evaluation test after it. Four experiments, one with a speaker condition (general speaker and parametric speaker), the other with a distance condition (0.3 m and 1.0 m), were conducted, respectively, at the same time of day on separate days. We used three-way repeated measures ANOVA (speaker factor × distance factor × time factor) to examine the effects of the parametric speaker. We found that the endocrinological functions were not significantly different between the speaker condition and the distance condition. The results also showed that the physiological burdens increased with progress in time independent of the speaker condition and distance condition.
Methods and apparatus for non-acoustic speech characterization and recognition
Holzrichter, John F.
1999-01-01
By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.
Methods and apparatus for non-acoustic speech characterization and recognition
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J.F.
By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.
Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition
NASA Astrophysics Data System (ADS)
Malcangi, Mario
2009-08-01
Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.
ERIC Educational Resources Information Center
Kibishi, Hiroshi; Hirabayashi, Kuniaki; Nakagawa, Seiichi
2015-01-01
In this paper, we propose a statistical evaluation method of pronunciation proficiency and intelligibility for presentations made in English by native Japanese speakers. We statistically analyzed the actual utterances of speakers to find combinations of acoustic and linguistic features with high correlation between the scores estimated by the…
Neural Processing of Vocal Emotion and Identity
ERIC Educational Resources Information Center
Spreckelmeyer, Katja N.; Kutas, Marta; Urbach, Thomas; Altenmuller, Eckart; Munte, Thomas F.
2009-01-01
The voice is a marker of a person's identity which allows individual recognition even if the person is not in sight. Listening to a voice also affords inferences about the speaker's emotional state. Both these types of personal information are encoded in characteristic acoustic feature patterns analyzed within the auditory cortex. In the present…
Embedding speech into virtual realities
NASA Technical Reports Server (NTRS)
Bohn, Christian-Arved; Krueger, Wolfgang
1993-01-01
In this work a speaker-independent speech recognition system is presented, which is suitable for implementation in Virtual Reality applications. The use of an artificial neural network in connection with a special compression of the acoustic input leads to a system, which is robust, fast, easy to use and needs no additional hardware, beside a common VR-equipment.
Speaker-dependent Multipitch Tracking Using Deep Neural Networks
2015-01-01
connections through time. Studies have shown that RNNs are good at modeling sequential data like handwriting [12] and speech [26]. We plan to explore RNNs in...Schmidhuber, and S. Fernández, “Unconstrained on-line handwriting recognition with recurrent neural networks,” in Proceedings of NIPS, 2008, pp. 577–584. [13
Perceiving and Remembering Events Cross-Linguistically: Evidence from Dual-Task Paradigms
ERIC Educational Resources Information Center
Trueswell, John C.; Papafragou, Anna
2010-01-01
What role does language play during attention allocation in perceiving and remembering events? We recorded adults' eye movements as they studied animated motion events for a later recognition task. We compared native speakers of two languages that use different means of expressing motion (Greek and English). In Experiment 1, eye movements revealed…
2010-12-01
discovered that the NSA is concerned about speaker recognition being vulnerable to man- in-the-middle ( MITM ) attacks. The professional could tailor an MITM ...with the results of the test against the MITM threat. The Collective Acquisition framework comprises powerful search techniques found in the CRC
Effects of Speed of Word Processing on Semantic Access: The Case of Bilingualism
ERIC Educational Resources Information Center
Martin, Clara D.; Costa, Albert; Dering, Benjamin; Hoshino, Noriko; Wu, Yan Jing; Thierry, Guillaume
2012-01-01
Bilingual speakers generally manifest slower word recognition than monolinguals. We investigated the consequences of the word processing speed on semantic access in bilinguals. The paradigm involved a stream of English words and pseudowords presented in succession at a constant rate. English-Welsh bilinguals and English monolinguals were asked to…
Fast Morphological Effects in First and Second Language Word Recognition
ERIC Educational Resources Information Center
Diependaele, Kevin; Dunabeitia, Jon Andoni; Morris, Joanna; Keuleers, Emmanuel
2011-01-01
In three experiments we compared the performance of native English speakers to that of Spanish-English and Dutch-English bilinguals on a masked morphological priming lexical decision task. The results do not show significant differences across the three experiments. In line with recent meta-analyses, we observed a graded pattern of facilitation…
Chinese-Mandarin: Basic Course. Volume VII: Lessons 72-79.
ERIC Educational Resources Information Center
Defense Language Inst., Monterey, CA.
This is the seventh of 16 volumes of audiolingual classroom instruction in Mandarin Chinese. The course is designed to train native English speakers to Level 3 Foreign Service Institute proficiency in comprehension and speaking, and to Level 2 proficiency in reading and writing Mandarin. Facility in the use and recognition of Chinese characters is…
Chinese-Mandarin: Basic Course. Volume IX: Lessons 88-95.
ERIC Educational Resources Information Center
Defense Language Inst., Monterey, CA.
This is the ninth of 16 volumes of audiolingual classroom instruction in Mandarin Chinese. The course is designed to train native English speakers to Level 3 Foreign Service Institute proficiency in comprehension and speaking, and to Level 2 proficiency in reading and writing Mandarin. Facility in the use and recognition of Chinese characters is…
Chinese-Mandarin: Basic Course. Volume VIII: Lessons 80-87.
ERIC Educational Resources Information Center
Defense Language Inst., Monterey, CA.
This is the eighth of 16 volumes of audiolingual classroom instruction in Mandarin Chinese. The course is designed to train native English speakers to Level 3 Foreign Service Institute proficiency in comprehension and speaking, and to Level 2 proficiency in reading and writing Mandarin. Facility in the use and recognition of Chinese characters is…
The Influence of Anticipation of Word Misrecognition on the Likelihood of Stuttering
ERIC Educational Resources Information Center
Brocklehurst, Paul H.; Lickley, Robin J.; Corley, Martin
2012-01-01
This study investigates whether the experience of stuttering can result from the speaker's anticipation of his words being misrecognized. Twelve adults who stutter (AWS) repeated single words into what appeared to be an automatic speech-recognition system. Following each iteration of each word, participants provided a self-rating of whether they…
Sardelis, Stephanie; Drew, Joshua A.
2016-01-01
The scientific community faces numerous challenges in achieving gender equality among its participants. One method of highlighting the contributions made by female scientists is through their selection as featured speakers in symposia held at the conferences of professional societies. Because they are specially invited, symposia speakers obtain a prestigious platform from which to display their scientific research, which can elevate the recognition of female scientists. We investigated the number of female symposium speakers in two professional societies (the Society of Conservation Biology (SCB) from 1999 to 2015, and the American Society of Ichthyologists and Herpetologists (ASIH) from 2005 to 2015), in relation to the number of female symposium organizers. Overall, we found that 36.4% of symposia organizers and 31.7% of symposia speakers were women at the Society of Conservation Biology conferences, while 19.1% of organizers and 28% of speakers were women at the American Society of Ichthyologists and Herpetologists conferences. For each additional female organizer at the SCB and ASIH conferences, there was an average increase of 95% and 70% female speakers, respectively. As such, we found a significant positive relationship between the number of women organizing a symposium and the number of women speaking in that symposium. We did not, however, find a significant increase in the number of women speakers or organizers per symposium over time at either conference, suggesting a need for revitalized efforts to diversify our scientific societies. To further those ends, we suggest facilitating gender equality in professional societies by removing barriers to participation, including assisting with travel, making conferences child-friendly, and developing thorough, mandatory Codes of Conduct for all conferences. PMID:27467580
Speech Breathing in Speakers Who Use an Electrolarynx
ERIC Educational Resources Information Center
Bohnenkamp, Todd A.; Stowell, Talena; Hesse, Joy; Wright, Simon
2010-01-01
Speakers who use an electrolarynx following a total laryngectomy no longer require pulmonary support for speech. Subsequently, chest wall movements may be affected; however, chest wall movements in these speakers are not well defined. The purpose of this investigation was to evaluate speech breathing in speakers who use an electrolarynx during…
Tension between scientific certainty and meaning complicates communication of IPCC reports
NASA Astrophysics Data System (ADS)
Hollin, G. J. S.; Pearce, W.
2015-08-01
Here we demonstrate that speakers at the press conference for the publication of the IPCC’s Fifth Assessment Report (Working Group 1; ref. ) attempted to make the documented level of certainty of anthropogenic global warming (AGW) more meaningful to the public. Speakers attempted to communicate this through reference to short-term temperature increases. However, when journalists enquired about the similarly short `pause’ in global temperature increase, the speakers dismissed the relevance of such timescales, thus becoming incoherent as to `what counts’ as scientific evidence for AGW. We call this the `IPCC’s certainty trap’. This incoherence led to confusion within the press conference and subsequent condemnation in the media. The speakers were well intentioned in their attempts to communicate the public implications of the report, but these attempts threatened to erode their scientific credibility. In this instance, the certainty trap was the result of the speakers’ failure to acknowledge the tensions between scientific and public meanings. Avoiding the certainty trap in the future will require a nuanced accommodation of uncertainties and a recognition that rightful demands for scientific credibility need to be balanced with public and political dialogue about the things we value and the actions we take to protect those things.
Google Home: smart speaker as environmental control unit.
Noda, Kenichiro
2017-08-23
Environmental Control Units (ECU) are devices or a system that allows a person to control appliances in their home or work environment. Such system can be utilized by clients with physical and/or functional disability to enhance their ability to control their environment, to promote independence and improve their quality of life. Over the last several years, there have been an emergence of several inexpensive, commercially-available, voice activated smart speakers into the market such as Google Home and Amazon Echo. These smart speakers are equipped with far field microphone that supports voice recognition, and allows for complete hand-free operation for various purposes, including for playing music, for information retrieval, and most importantly, for environmental control. Clients with disability could utilize these features to turn the unit into a simple ECU that is completely voice activated and wirelessly connected to appliances. Smart speakers, with their ease of setup, low cost and versatility, may be a more affordable and accessible alternative to the traditional ECU. Implications for Rehabilitation Environmental Control Units (ECU) enable independence for physically and functionally disabled clients, and reduce burden and frequency of demands on carers. Traditional ECU can be costly and may require clients to learn specialized skills to use. Smart speakers have the potential to be used as a new-age ECU by overcoming these barriers, and can be used by a wider range of clients.
Adaptation to novel accents by toddlers
White, Katherine S.; Aslin, Richard N.
2010-01-01
Word recognition is a balancing act: listeners must be sensitive to phonetic detail to avoid confusing similar words, yet, at the same time, be flexible enough to adapt to phonetically variable pronunciations, such as those produced by speakers of different dialects or by non-native speakers. Recent work has demonstrated that young toddlers are sensitive to phonetic detail during word recognition; pronunciations that deviate from the typical phonological form lead to a disruption of processing. However, it is not known whether young word learners show the flexibility that is characteristic of adult word recognition. The present study explores whether toddlers can adapt to artificial accents in which there is a vowel category shift with respect to the native language. 18–20-month-olds heard mispronunciations of familiar words (e.g., vowels were shifted from [a] to [æ]: “dog” pronounced as “dag”). In test, toddlers were tolerant of mispronunciations if they had recently been exposed to the same vowel shift, but not if they had been exposed to standard pronunciations or other vowel shifts. The effects extended beyond particular items heard in exposure to words sharing the same vowels. These results indicate that, like adults, toddlers show flexibility in their interpretation of phonological detail. Moreover, they suggest that effects of top-down knowledge on the reinterpretation of phonological detail generalize across the phono-lexical system. PMID:21479106
SAM: speech-aware applications in medicine to support structured data entry.
Wormek, A. K.; Ingenerf, J.; Orthner, H. F.
1997-01-01
In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech recognizer to a specific medical application. The goals of our work are first, to develop a speech-aware core component which is able to establish connections to speech recognition engines of different vendors. This is realized in SAM. Second, with applications based on SAM we want to support the physician in his/her routine clinical care activities. Within the STAMP project (STAndardized Multimedia report generator in Pathology), we extend SAM by combining a structured data entry approach with speech recognition technology. Another speech-aware application in the field of Diabetes care is connected to a terminology server. The server delivers a controlled vocabulary which can be used for speech recognition. PMID:9357730
Influence of encoding focus and stereotypes on source monitoring event-related-potentials.
Leynes, P Andrew; Nagovsky, Irina
2016-01-01
Source memory, memory for the origin of a memory, can be influenced by stereotypes and the information of focus during encoding processes. Participants studied words from two different speakers (male or female) using self-focus or other-focus encoding. Source judgments for the speaker׳s voice and Event-Related Potentials (ERPs) were recorded during test. Self-focus encoding increased dependence on stereotype information and the Late Posterior Negativity (LPN). The results link the LPN with an increase in systematic decision processes such as consulting prior knowledge to support an episodic memory judgment. In addition, other-focus encoding increased conditional source judgments and resulted in weaker old/new recognition relative to the self-focus encoding. The putative correlate of recollection (LPC) was absent during this condition and this was taken as evidence that recollection of partial information supported source judgments. Collectively, the results suggest that other-focus encoding changes source monitoring processing by altering the weight of specific memory features. Copyright © 2015 Elsevier B.V. All rights reserved.
Speech processing using maximum likelihood continuity mapping
Hogden, John E.
2000-01-01
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Speech processing using maximum likelihood continuity mapping
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.E.
Speech processing is obtained that, given a probabilistic mapping between static speech sounds and pseudo-articulator positions, allows sequences of speech sounds to be mapped to smooth sequences of pseudo-articulator positions. In addition, a method for learning a probabilistic mapping between static speech sounds and pseudo-articulator position is described. The method for learning the mapping between static speech sounds and pseudo-articulator position uses a set of training data composed only of speech sounds. The said speech processing can be applied to various speech analysis tasks, including speech recognition, speaker recognition, speech coding, speech synthesis, and voice mimicry.
Tone Attrition in Mandarin Speakers of Varying English Proficiency
Creel, Sarah C.
2017-01-01
Purpose The purpose of this study was to determine whether the degree of dominance of Mandarin–English bilinguals' languages affects phonetic processing of tone content in their native language, Mandarin. Method We tested 72 Mandarin–English bilingual college students with a range of language-dominance profiles in the 2 languages and ages of acquisition of English. Participants viewed 2 photographs at a time while hearing a familiar Mandarin word referring to 1 photograph. The names of the 2 photographs diverged in tone, vowels, or both. Word recognition was evaluated using clicking accuracy, reaction times, and an online recognition measure (gaze) and was compared in the 3 conditions. Results Relative proficiency in English was correlated with reduced word recognition success in tone-disambiguated trials, but not in vowel-disambiguated trials, across all 3 dependent measures. This selective attrition for tone content emerged even though all bilinguals had learned Mandarin from birth. Lengthy experience with English thus weakened tone use. Conclusions This finding has implications for the question of the extent to which bilinguals' 2 phonetic systems interact. It suggests that bilinguals may not process pitch information language-specifically and that processing strategies from the dominant language may affect phonetic processing in the nondominant language—even when the latter was learned natively. PMID:28124064
Robust audio-visual speech recognition under noisy audio-video conditions.
Stewart, Darryl; Seymour, Rowan; Pass, Adrian; Ming, Ji
2014-02-01
This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.
Quantity Recognition among Speakers of an Anumeric Language
ERIC Educational Resources Information Center
Everett, Caleb; Madora, Keren
2012-01-01
Recent research has suggested that the Piraha, an Amazonian tribe with a number-less language, are able to match quantities greater than 3 if the matching task does not require recall or spatial transposition. This finding contravenes previous work among the Piraha. In this study, we re-tested the Pirahas' performance in the crucial one-to-one…
The Storage and Processing of Morphologically Complex Words in L2 Spanish
ERIC Educational Resources Information Center
Foote, Rebecca
2017-01-01
Research with native speakers indicates that, during word recognition, regularly inflected words undergo parsing that segments them into stems and affixes. In contrast, studies with learners suggest that this parsing may not take place in L2. This study's research questions are: Do L2 Spanish learners store and process regularly inflected,…
ERIC Educational Resources Information Center
Liu, Fang; Xu, Yi; Patel, Aniruddh D.; Francart, Tom; Jiang, Cunmei
2012-01-01
This study examined whether "melodic contour deafness" (insensitivity to the direction of pitch movement) in congenital amusia is associated with specific types of pitch patterns (discrete versus gliding pitches) or stimulus types (speech syllables versus complex tones). Thresholds for identification of pitch direction were obtained using discrete…
ERIC Educational Resources Information Center
Treurniet, William
A study applied artificial neural networks, trained with the back-propagation learning algorithm, to modelling phonemes extracted from the DARPA TIMIT multi-speaker, continuous speech data base. A number of proposed network architectures were applied to the phoneme classification task, ranging from the simple feedforward multilayer network to more…
ERIC Educational Resources Information Center
Malins, Jeffrey G.; Joanisse, Marc F.
2012-01-01
We investigated the influences of phonological similarity on the time course of spoken word processing in Mandarin Chinese. Event related potentials were recorded while adult native speakers of Mandarin ("N" = 19) judged whether auditory words matched or mismatched visually presented pictures. Mismatching words were of the following…
Experimental Pragmatics and What Is Said: A Response to Gibbs and Moise.
ERIC Educational Resources Information Center
Nicolle, Steve; Clark, Billy
1999-01-01
Attempted replication of Gibbs and Moise (1997) experiments regarding the recognition of a distinction between what is said and what is implicated. Results showed that, under certain conditions, subject selected implicatures when asked to select the paraphrase best reflecting what a speaker has said. Suggests that results can be explained with the…
Costs and Effects of Dual-Language Immersion in the Portland Public Schools
ERIC Educational Resources Information Center
Steele, Jennifer L.; Slater, Robert; Li, Jennifer; Zamarro, Gema; Miller, Trey
2015-01-01
Though it is estimated that about half of the world's population is bilingual, the estimate for the United States is well below 20% (Grosjean, 2010). Amid growing recognition of the need for second language skills to facilitate international commerce and national security and to enhance learning opportunities for non-native speakers of English,…
ERIC Educational Resources Information Center
Tsurutani, Chiharu
2012-01-01
Foreign-accented speakers are generally regarded as less educated, less reliable and less interesting than native speakers and tend to be associated with cultural stereotypes of their country of origin. This discrimination against foreign accents has, however, been discussed mainly using accented English in English-speaking countries. This study…
Emotion Analysis of Telephone Complaints from Customer Based on Affective Computing.
Gong, Shuangping; Dai, Yonghui; Ji, Jun; Wang, Jinzhao; Sun, Hai
2015-01-01
Customer complaint has been the important feedback for modern enterprises to improve their product and service quality as well as the customer's loyalty. As one of the commonly used manners in customer complaint, telephone communication carries rich emotional information of speeches, which provides valuable resources for perceiving the customer's satisfaction and studying the complaint handling skills. This paper studies the characteristics of telephone complaint speeches and proposes an analysis method based on affective computing technology, which can recognize the dynamic changes of customer emotions from the conversations between the service staff and the customer. The recognition process includes speaker recognition, emotional feature parameter extraction, and dynamic emotion recognition. Experimental results show that this method is effective and can reach high recognition rates of happy and angry states. It has been successfully applied to the operation quality and service administration in telecom and Internet service company.
Emotion Analysis of Telephone Complaints from Customer Based on Affective Computing
Gong, Shuangping; Ji, Jun; Wang, Jinzhao; Sun, Hai
2015-01-01
Customer complaint has been the important feedback for modern enterprises to improve their product and service quality as well as the customer's loyalty. As one of the commonly used manners in customer complaint, telephone communication carries rich emotional information of speeches, which provides valuable resources for perceiving the customer's satisfaction and studying the complaint handling skills. This paper studies the characteristics of telephone complaint speeches and proposes an analysis method based on affective computing technology, which can recognize the dynamic changes of customer emotions from the conversations between the service staff and the customer. The recognition process includes speaker recognition, emotional feature parameter extraction, and dynamic emotion recognition. Experimental results show that this method is effective and can reach high recognition rates of happy and angry states. It has been successfully applied to the operation quality and service administration in telecom and Internet service company. PMID:26633967
V2S: Voice to Sign Language Translation System for Malaysian Deaf People
NASA Astrophysics Data System (ADS)
Mean Foong, Oi; Low, Tang Jung; La, Wai Wan
The process of learning and understand the sign language may be cumbersome to some, and therefore, this paper proposes a solution to this problem by providing a voice (English Language) to sign language translation system using Speech and Image processing technique. Speech processing which includes Speech Recognition is the study of recognizing the words being spoken, regardless of whom the speaker is. This project uses template-based recognition as the main approach in which the V2S system first needs to be trained with speech pattern based on some generic spectral parameter set. These spectral parameter set will then be stored as template in a database. The system will perform the recognition process through matching the parameter set of the input speech with the stored templates to finally display the sign language in video format. Empirical results show that the system has 80.3% recognition rate.
The development of cross-cultural recognition of vocal emotion during childhood and adolescence.
Chronaki, Georgia; Wigelsworth, Michael; Pell, Marc D; Kotz, Sonja A
2018-06-14
Humans have an innate set of emotions recognised universally. However, emotion recognition also depends on socio-cultural rules. Although adults recognise vocal emotions universally, they identify emotions more accurately in their native language. We examined developmental trajectories of universal vocal emotion recognition in children. Eighty native English speakers completed a vocal emotion recognition task in their native language (English) and foreign languages (Spanish, Chinese, and Arabic) expressing anger, happiness, sadness, fear, and neutrality. Emotion recognition was compared across 8-to-10, 11-to-13-year-olds, and adults. Measures of behavioural and emotional problems were also taken. Results showed that although emotion recognition was above chance for all languages, native English speaking children were more accurate in recognising vocal emotions in their native language. There was a larger improvement in recognising vocal emotion from the native language during adolescence. Vocal anger recognition did not improve with age for the non-native languages. This is the first study to demonstrate universality of vocal emotion recognition in children whilst supporting an "in-group advantage" for more accurate recognition in the native language. Findings highlight the role of experience in emotion recognition, have implications for child development in modern multicultural societies and address important theoretical questions about the nature of emotions.
Speaker and Observer Perceptions of Physical Tension during Stuttering.
Tichenor, Seth; Leslie, Paula; Shaiman, Susan; Yaruss, J Scott
2017-01-01
Speech-language pathologists routinely assess physical tension during evaluation of those who stutter. If speakers experience tension that is not visible to clinicians, then judgments of severity may be inaccurate. This study addressed this potential discrepancy by comparing judgments of tension by people who stutter and expert clinicians to determine if clinicians could accurately identify the speakers' experience of physical tension. Ten adults who stutter were audio-video recorded in two speaking samples. Two board-certified specialists in fluency evaluated the samples using the Stuttering Severity Instrument-4 and a checklist adapted for this study. Speakers rated their tension using the same forms, and then discussed their experiences in a qualitative interview so that themes related to physical tension could be identified. The degree of tension reported by speakers was higher than that observed by specialists. Tension in parts of the body that were less visible to the observer (chest, abdomen, throat) was reported more by speakers than by specialists. The thematic analysis revealed that speakers' experience of tension changes over time and that these changes may be related to speakers' acceptance of stuttering. The lack of agreement between speaker and specialist perceptions of tension suggests that using self-reports is a necessary component for supporting the accurate diagnosis of tension in stuttering. © 2018 S. Karger AG, Basel.
Seeing a singer helps comprehension of the song's lyrics.
Jesse, Alexandra; Massaro, Dominic W
2010-06-01
When listening to speech, we often benefit when also seeing the speaker's face. If this advantage is not domain specific for speech, the recognition of sung lyrics should also benefit from seeing the singer's face. By independently varying the sight and sound of the lyrics, we found a substantial comprehension benefit of seeing a singer. This benefit was robust across participants, lyrics, and repetition of the test materials. This benefit was much larger than the benefit for sung lyrics obtained in previous research, which had not provided the visual information normally present in singing. Given that the comprehension of sung lyrics benefits from seeing the singer, just like speech comprehension benefits from seeing the speaker, both speech and music perception appear to be multisensory processes.
ERIC Educational Resources Information Center
Gilbert, Harvey R.; Ferrand, Carole T.
1987-01-01
Respirometric quotients (RQ), the ratio of oral air volume expended to total volume expended, were obtained from the productions of oral and nasal airflow of 10 speakers with cleft palate, with and without their prosthetic appliances, and 10 normal speakers. Cleft palate speakers without their appliances exhibited the lowest RQ values. (Author/DB)
Continuing Medical Education Speakers with High Evaluation Scores Use more Image-based Slides.
Ferguson, Ian; Phillips, Andrew W; Lin, Michelle
2017-01-01
Although continuing medical education (CME) presentations are common across health professions, it is unknown whether slide design is independently associated with audience evaluations of the speaker. Based on the conceptual framework of Mayer's theory of multimedia learning, this study aimed to determine whether image use and text density in presentation slides are associated with overall speaker evaluations. This retrospective analysis of six sequential CME conferences (two annual emergency medicine conferences over a three-year period) used a mixed linear regression model to assess whether post-conference speaker evaluations were associated with image fraction (percentage of image-based slides per presentation) and text density (number of words per slide). A total of 105 unique lectures were given by 49 faculty members, and 1,222 evaluations (70.1% response rate) were available for analysis. On average, 47.4% (SD=25.36) of slides had at least one educationally-relevant image (image fraction). Image fraction significantly predicted overall higher evaluation scores [F(1, 100.676)=6.158, p=0.015] in the mixed linear regression model. The mean (SD) text density was 25.61 (8.14) words/slide but was not a significant predictor [F(1, 86.293)=0.55, p=0.815]. Of note, the individual speaker [χ 2 (1)=2.952, p=0.003] and speaker seniority [F(3, 59.713)=4.083, p=0.011] significantly predicted higher scores. This is the first published study to date assessing the linkage between slide design and CME speaker evaluations by an audience of practicing clinicians. The incorporation of images was associated with higher evaluation scores, in alignment with Mayer's theory of multimedia learning. Contrary to this theory, however, text density showed no significant association, suggesting that these scores may be multifactorial. Professional development efforts should focus on teaching best practices in both slide design and presentation skills.
Watch what you say, your computer might be listening: A review of automated speech recognition
NASA Technical Reports Server (NTRS)
Degennaro, Stephen V.
1991-01-01
Spoken language is the most convenient and natural means by which people interact with each other and is, therefore, a promising candidate for human-machine interactions. Speech also offers an additional channel for hands-busy applications, complementing the use of motor output channels for control. Current speech recognition systems vary considerably across a number of important characteristics, including vocabulary size, speaking mode, training requirements for new speakers, robustness to acoustic environments, and accuracy. Algorithmically, these systems range from rule-based techniques through more probabilistic or self-learning approaches such as hidden Markov modeling and neural networks. This tutorial begins with a brief summary of the relevant features of current speech recognition systems and the strengths and weaknesses of the various algorithmic approaches.
Nirme, Jens; Haake, Magnus; Lyberg Åhlander, Viveka; Brännström, Jonas; Sahlén, Birgitta
2018-04-05
Seeing a speaker's face facilitates speech recognition, particularly under noisy conditions. Evidence for how it might affect comprehension of the content of the speech is more sparse. We investigated how children's listening comprehension is affected by multi-talker babble noise, with or without presentation of a digitally animated virtual speaker, and whether successful comprehension is related to performance on a test of executive functioning. We performed a mixed-design experiment with 55 (34 female) participants (8- to 9-year-olds), recruited from Swedish elementary schools. The children were presented with four different narratives, each in one of four conditions: audio-only presentation in a quiet setting, audio-only presentation in noisy setting, audio-visual presentation in a quiet setting, and audio-visual presentation in a noisy setting. After each narrative, the children answered questions on the content and rated their perceived listening effort. Finally, they performed a test of executive functioning. We found significantly fewer correct answers to explicit content questions after listening in noise. This negative effect was only mitigated to a marginally significant degree by audio-visual presentation. Strong executive function only predicted more correct answers in quiet settings. Altogether, our results are inconclusive regarding how seeing a virtual speaker affects listening comprehension. We discuss how methodological adjustments, including modifications to our virtual speaker, can be used to discriminate between possible explanations to our results and contribute to understanding the listening conditions children face in a typical classroom.
Evaluation of speaker de-identification based on voice gender and age conversion
NASA Astrophysics Data System (ADS)
Přibil, Jiří; Přibilová, Anna; Matoušek, Jindřich
2018-03-01
Two basic tasks are covered in this paper. The first one consists in the design and practical testing of a new method for voice de-identification that changes the apparent age and/or gender of a speaker by multi-segmental frequency scale transformation combined with prosody modification. The second task is aimed at verification of applicability of a classifier based on Gaussian mixture models (GMM) to detect the original Czech and Slovak speakers after applied voice deidentification. The performed experiments confirm functionality of the developed gender and age conversion for all selected types of de-identification which can be objectively evaluated by the GMM-based open-set classifier. The original speaker detection accuracy was compared also for sentences uttered by German and English speakers showing language independence of the proposed method.
"That Sounds So Cooool": Entanglements of Children, Digital Tools, and Literacy Practices
ERIC Educational Resources Information Center
Toohey, Kelleen; Dagenais, Diane; Fodor, Andreea; Hof, Linda; Nuñez, Omar; Singh, Angelpreet; Schulze, Liz
2015-01-01
Many observers have argued that minority language speakers often have difficulty with school-based literacy and that the poorer school achievement of such learners occurs at least partly as a result of these difficulties. At the same time, many have argued for a recognition of the multiple literacies required for citizens in a 21st century world.…
Pitch-Based Segregation of Reverberant Speech
2005-02-01
speaker recognition in real environments, audio information retrieval and hearing prosthesis. Second, although binaural listening improves the...intelligibility of target speech under anechoic conditions (Bronkhorst, 2000), this binaural advantage is largely eliminated by reverberation (Plomp, 1976...Brown and Cooke, 1994; Wang and Brown, 1999; Hu and Wang, 2004) as well as in binaural separation (e.g., Roman et al., 2003; Palomaki et al., 2004
Facilitating Comprehension of Non-Native English Speakers during Lectures in English with STR-Texts
ERIC Educational Resources Information Center
Shadiev, Rustam; Wu, Ting-Ting; Huang, Yueh-Min
2018-01-01
We provided texts generated by speech-to text-recognition (STR) technology for non-native English speaking students during lectures in English in order to test whether STR-texts were useful for enhancing students' comprehension of lectures. To this end, we carried out an experiment in which 60 participants were randomly assigned to a control group…
ERIC Educational Resources Information Center
Scott Instruments Corp., Denton, TX.
This project was designed to develop techniques for adding low-cost speech synthesis to educational software. Four tasks were identified for the study: (1) select a microcomputer with a built-in analog-to-digital converter that is currently being used in educational environments; (2) determine the feasibility of implementing expansion and playback…
Crossmodal and Incremental Perception of Audiovisual Cues to Emotional Speech
ERIC Educational Resources Information Center
Barkhuysen, Pashiera; Krahmer, Emiel; Swerts, Marc
2010-01-01
In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: (1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? Both experiments reported below are based on tests…
Transitivity, Space, and Hand: The Spatial Grounding of Syntax
ERIC Educational Resources Information Center
Boiteau, Timothy W.; Almor, Amit
2017-01-01
Previous research has linked the concept of number and other ordinal series to space via a spatially oriented mental number line. In addition, it has been shown that in visual scene recognition and production, speakers of a language with a left-to-right orthography respond faster to and tend to draw images in which the agent of an action is…
Vogel, Bastian D; Brück, Carolin; Jacob, Heike; Eberle, Mark; Wildgruber, Dirk
2016-07-07
Impaired interpretation of nonverbal emotional cues in patients with schizophrenia has been reported in several studies and a clinical relevance of these deficits for social functioning has been assumed. However, it is unclear to what extent the impairments depend on specific emotions or specific channels of nonverbal communication. Here, the effect of cue modality and emotional categories on accuracy of emotion recognition was evaluated in 21 patients with schizophrenia and compared to a healthy control group (n = 21). To this end, dynamic stimuli comprising speakers of both genders in three different sensory modalities (auditory, visual and audiovisual) and five emotional categories (happy, alluring, neutral, angry and disgusted) were used. Patients with schizophrenia were found to be impaired in emotion recognition in comparison to the control group across all stimuli. Considering specific emotions more severe deficits were revealed in the recognition of alluring stimuli and less severe deficits in the recognition of disgusted stimuli as compared to all other emotions. Regarding cue modality the extent of the impairment in emotional recognition did not significantly differ between auditory and visual cues across all emotional categories. However, patients with schizophrenia showed significantly more severe disturbances for vocal as compared to facial cues when sexual interest is expressed (alluring stimuli), whereas more severe disturbances for facial as compared to vocal cues were observed when happiness or anger is expressed. Our results confirmed that perceptual impairments can be observed for vocal as well as facial cues conveying various social and emotional connotations. The observed differences in severity of impairments with most severe deficits for alluring expressions might be related to specific difficulties in recognizing the complex social emotional information of interpersonal intentions as compared to "basic" emotional states. Therefore, future studies evaluating perception of nonverbal cues should consider a broader range of social and emotional signals beyond basic emotions including attitudes and interpersonal intentions. Identifying specific domains of social perception particularly prone for misunderstandings in patients with schizophrenia might allow for a refinement of interventions aiming at improving social functioning.
NASA Technical Reports Server (NTRS)
1973-01-01
The development, construction, and test of a 100-word vocabulary near real time word recognition system are reported. Included are reasonable replacement of any one or all 100 words in the vocabulary, rapid learning of a new speaker, storage and retrieval of training sets, verbal or manual single word deletion, continuous adaptation with verbal or manual error correction, on-line verification of vocabulary as spoken, system modes selectable via verification display keyboard, relationship of classified word to neighboring word, and a versatile input/output interface to accommodate a variety of applications.
EFL Teachers' Responses to L2 Writing.
ERIC Educational Resources Information Center
Chang, Yuh-Fang
This study investigated differences in the product and process of evaluating second language compositions by Taiwanese speakers of English. It examined whether such factors as language background (native English speaker versus native Chinese speaker), academic discipline, and educational background affected raters' scoring outcomes; whether rating…
Automated Intelligibility Assessment of Pathological Speech Using Phonological Features
NASA Astrophysics Data System (ADS)
Middag, Catherine; Martens, Jean-Pierre; Van Nuffelen, Gwen; De Bodt, Marc
2009-12-01
It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words) and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008) is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.
Development of a Self-Report Tool to Evaluate Hearing Aid Outcomes among Chinese Speakers
ERIC Educational Resources Information Center
Wong, Lena L. N.; Hang, Na
2014-01-01
Purpose: This article reports on the development of a self-report tool--the Chinese Hearing Aid Outcomes Questionnaire (CHAOQ)--to evaluate hearing aid outcomes among Chinese speakers. Method: There were 4 phases to construct the CHAOQ and evaluate its psychometric properties. First, items were selected to evaluate a range of culturally relevant…
Attempting to "Increase Intake from the Input": Attention and Word Learning in Children with Autism.
Tenenbaum, Elena J; Amso, Dima; Righi, Giulia; Sheinkopf, Stephen J
2017-06-01
Previous work has demonstrated that social attention is related to early language abilities. We explored whether we can facilitate word learning among children with autism by directing attention to areas of the scene that have been demonstrated as relevant for successful word learning. We tracked eye movements to faces and objects while children watched videos of a woman teaching them new words. Test trials measured participants' recognition of these novel word-object pairings. Results indicate that for children with autism and typically developing children, pointing to the speaker's mouth while labeling a novel object impaired performance, likely because it distracted participants from the target object. In contrast, for children with autism, holding the object close to the speaker's mouth improved performance.
Inferring speaker attributes in adductor spasmodic dysphonia: ratings from unfamiliar listeners.
Isetti, Derek; Xuereb, Linnea; Eadie, Tanya L
2014-05-01
To determine whether unfamiliar listeners' perceptions of speakers with adductor spasmodic dysphonia (ADSD) differ from control speakers on the parameters of relative age, confidence, tearfulness, and vocal effort and are related to speaker-rated vocal effort or voice-specific quality of life. Twenty speakers with ADSD (including 6 speakers with ADSD plus tremor) and 20 age- and sex-matched controls provided speech recordings, completed a voice-specific quality-of-life instrument (Voice Handicap Index; Jacobson et al., 1997), and rated their own vocal effort. Twenty listeners evaluated speech samples for relative age, confidence, tearfulness, and vocal effort using rating scales. Listeners judged speakers with ADSD as sounding significantly older, less confident, more tearful, and more effortful than control speakers (p < .01). Increased vocal effort was strongly associated with decreased speaker confidence (rs = .88-.89) and sounding more tearful (rs = .83-.85). Self-rated speaker effort was moderately related (rs = .45-.52) to listener impressions. Listeners' perceptions of confidence and tearfulness were also moderately associated with higher Voice Handicap Index scores (rs = .65-.70). Unfamiliar listeners judge speakers with ADSD more negatively than control speakers, with judgments extending beyond typical clinical measures. The results have implications for counseling and understanding the psychosocial effects of ADSD.
Dysprosody and Stimulus Effects in Cantonese Speakers with Parkinson's Disease
ERIC Educational Resources Information Center
Ma, Joan K.-Y.; Whitehill, Tara; Cheung, Katherine S.-K.
2010-01-01
Background: Dysprosody is a common feature in speakers with hypokinetic dysarthria. However, speech prosody varies across different types of speech materials. This raises the question of what is the most appropriate speech material for the evaluation of dysprosody. Aims: To characterize the prosodic impairment in Cantonese speakers with…
U.S. Air Forces Escape and Evasion Society Recognition Act of 2014
Rep. Tsongas, Niki [D-MA-3
2014-05-20
House - 05/20/2014 Referred to the Committee on Financial Services, and in addition to the Committee on House Administration, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee... (All Actions) Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:
ERIC Educational Resources Information Center
Segal, Osnat; Kishon-Rabin, Liat
2017-01-01
Purpose: The stressed word in a sentence (narrow focus [NF]) conveys information about the intent of the speaker and is therefore important for processing spoken language and in social interactions. The ability of participants with severe-to-profound prelingual hearing loss to comprehend NF has rarely been investigated. The purpose of this study…
Spanish as the Second National Language of the United States: Fact, Future, Fiction, or Hope?
ERIC Educational Resources Information Center
Macías, Reynaldo F.
2014-01-01
The status of a language is very often described and measured by different factors, including the length of time it has been in use in a particular territory, the official recognition it has been given by governmental units, and the number and proportion of speakers. Spanish has a unique history and, so some argue status, in the contemporary…
Audiovisual speech facilitates voice learning.
Sheffert, Sonya M; Olson, Elizabeth
2004-02-01
In this research, we investigated the effects of voice and face information on the perceptual learning of talkers and on long-term memory for spoken words. In the first phase, listeners were trained over several days to identify voices from words presented auditorily or audiovisually. The training data showed that visual information about speakers enhanced voice learning, revealing cross-modal connections in talker processing akin to those observed in speech processing. In the second phase, the listeners completed an auditory or audiovisual word recognition memory test in which equal numbers of words were spoken by familiar and unfamiliar talkers. The data showed that words presented by familiar talkers were more likely to be retrieved from episodic memory, regardless of modality. Together, these findings provide new information about the representational code underlying familiar talker recognition and the role of stimulus familiarity in episodic word recognition.
ERIC Educational Resources Information Center
Köroglu, Zehra; Tüm, Gülden
2017-01-01
This study has been conducted to evaluate the TM usage in the MA theses written by the native speakers (NSs) of English and the Turkish speakers (TSs) of English. The purpose is to compare the TM usage in the introduction, results and discussion, and conclusion sections by both groups' randomly selected MA theses in the field of ELT between the…
Liu, Danzheng; Shi, Lu-Feng
2013-06-01
This study established the performance-intensity function for Beijing and Taiwan Mandarin bisyllabic word recognition tests in noise in native speakers of Wu Chinese. Effects of the test dialect and listeners' first language on psychometric variables (i.e., slope and 50%-correct threshold) were analyzed. Thirty-two normal-hearing Wu-speaking adults who used Mandarin since early childhood were compared to 16 native Mandarin-speaking adults. Both Beijing and Taiwan bisyllabic word recognition tests were presented at 8 signal-to-noise ratios (SNRs) in 4-dB steps (-12 dB to +16 dB). At each SNR, a half list (25 words) was presented in speech-spectrum noise to listeners' right ear. The order of the test, SNR, and half list was randomized across listeners. Listeners responded orally and in writing. Overall, the Wu-speaking listeners performed comparably to the Mandarin-speaking listeners on both tests. Compared to the Taiwan test, the Beijing test yielded a significantly lower threshold for both the Mandarin- and Wu-speaking listeners, as well as a significantly steeper slope for the Wu-speaking listeners. Both Mandarin tests can be used to evaluate Wu-speaking listeners. Of the 2, the Taiwan Mandarin test results in more comparable functions across listener groups. Differences in the performance-intensity function between listener groups and between tests indicate a first language and dialectal effect, respectively.
Working memory capacity may influence perceived effort during aided speech recognition in noise.
Rudner, Mary; Lunner, Thomas; Behrens, Thomas; Thorén, Elisabet Sundewall; Rönnberg, Jerker
2012-09-01
Recently there has been interest in using subjective ratings as a measure of perceived effort during speech recognition in noise. Perceived effort may be an indicator of cognitive load. Thus, subjective effort ratings during speech recognition in noise may covary both with signal-to-noise ratio (SNR) and individual cognitive capacity. The present study investigated the relation between subjective ratings of the effort involved in listening to speech in noise, speech recognition performance, and individual working memory (WM) capacity in hearing impaired hearing aid users. In two experiments, participants with hearing loss rated perceived effort during aided speech perception in noise. Noise type and SNR were manipulated in both experiments, and in the second experiment hearing aid compression release settings were also manipulated. Speech recognition performance was measured along with WM capacity. There were 46 participants in all with bilateral mild to moderate sloping hearing loss. In Experiment 1 there were 16 native Danish speakers (eight women and eight men) with a mean age of 63.5 yr (SD = 12.1) and average pure tone (PT) threshold of 47. 6 dB (SD = 9.8). In Experiment 2 there were 30 native Swedish speakers (19 women and 11 men) with a mean age of 70 yr (SD = 7.8) and average PT threshold of 45.8 dB (SD = 6.6). A visual analog scale (VAS) was used for effort rating in both experiments. In Experiment 1, effort was rated at individually adapted SNRs while in Experiment 2 it was rated at fixed SNRs. Speech recognition in noise performance was measured using adaptive procedures in both experiments with Dantale II sentences in Experiment 1 and Hagerman sentences in Experiment 2. WM capacity was measured using a letter-monitoring task in Experiment 1 and the reading span task in Experiment 2. In both experiments, there was a strong and significant relation between rated effort and SNR that was independent of individual WM capacity, whereas the relation between rated effort and noise type seemed to be influenced by individual WM capacity. Experiment 2 showed that hearing aid compression setting influenced rated effort. Subjective ratings of the effort involved in speech recognition in noise reflect SNRs, and individual cognitive capacity seems to influence relative rating of noise type. American Academy of Audiology.
2004-11-01
this paper we describe the systems developed by MITLL and used in DARPA EARS Rich Transcription Fall 2004 (RT-04F) speaker diarization evaluation...many types of audio sources, the focus if the DARPA EARS project and the NIST Rich Transcription evaluations is primarily speaker diarization ...present or samples of any of the speakers . An overview of the general diarization problem and approaches can be found in [1]. In this paper, we
Recognizing speech in a novel accent: the motor theory of speech perception reframed.
Moulin-Frier, Clément; Arbib, Michael A
2013-08-01
The motor theory of speech perception holds that we perceive the speech of another in terms of a motor representation of that speech. However, when we have learned to recognize a foreign accent, it seems plausible that recognition of a word rarely involves reconstruction of the speech gestures of the speaker rather than the listener. To better assess the motor theory and this observation, we proceed in three stages. Part 1 places the motor theory of speech perception in a larger framework based on our earlier models of the adaptive formation of mirror neurons for grasping, and for viewing extensions of that mirror system as part of a larger system for neuro-linguistic processing, augmented by the present consideration of recognizing speech in a novel accent. Part 2 then offers a novel computational model of how a listener comes to understand the speech of someone speaking the listener's native language with a foreign accent. The core tenet of the model is that the listener uses hypotheses about the word the speaker is currently uttering to update probabilities linking the sound produced by the speaker to phonemes in the native language repertoire of the listener. This, on average, improves the recognition of later words. This model is neutral regarding the nature of the representations it uses (motor vs. auditory). It serve as a reference point for the discussion in Part 3, which proposes a dual-stream neuro-linguistic architecture to revisits claims for and against the motor theory of speech perception and the relevance of mirror neurons, and extracts some implications for the reframing of the motor theory.
Landwehr, Markus; Fürstenberg, Dirk; Walger, Martin; von Wedel, Hasso; Meister, Hartmut
2014-01-01
Advances in speech coding strategies and electrode array designs for cochlear implants (CIs) predominantly aim at improving speech perception. Current efforts are also directed at transmitting appropriate cues of the fundamental frequency (F0) to the auditory nerve with respect to speech quality, prosody, and music perception. The aim of this study was to examine the effects of various electrode configurations and coding strategies on speech intonation identification, speaker gender identification, and music quality rating. In six MED-EL CI users electrodes were selectively deactivated in order to simulate different insertion depths and inter-electrode distances when using the high definition continuous interleaved sampling (HDCIS) and fine structure processing (FSP) speech coding strategies. Identification of intonation and speaker gender was determined and music quality rating was assessed. For intonation identification HDCIS was robust against the different electrode configurations, whereas fine structure processing showed significantly worse results when a short electrode depth was simulated. In contrast, speaker gender recognition was not affected by electrode configuration or speech coding strategy. Music quality rating was sensitive to electrode configuration. In conclusion, the three experiments revealed different outcomes, even though they all addressed the reception of F0 cues. Rapid changes in F0, as seen with intonation, were the most sensitive to electrode configurations and coding strategies. In contrast, electrode configurations and coding strategies did not show large effects when F0 information was available over a longer time period, as seen with speaker gender. Music quality relies on additional spectral cues other than F0, and was poorest when a shallow insertion was simulated.
Calandruccio, Lauren; Zhou, Haibo
2014-01-01
Purpose To examine whether improved speech recognition during linguistically mismatched target–masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method Monolingual English speakers (n = 20) and English–Greek simultaneous bilinguals (n = 20) listened to English sentences in the presence of competing English and Greek speech. Data were analyzed using mixed-effects regression models to determine differences in English recogition performance between the 2 groups and 2 masker conditions. Results Results indicated that English sentence recognition for monolinguals and simultaneous English–Greek bilinguals improved when the masker speech changed from competing English to competing Greek speech. Conclusion The improvement in speech recognition that has been observed for linguistically mismatched target–masker experiments cannot be simply explained by the masker language being linguistically unknown or unfamiliar to the listeners. Listeners can improve their speech recognition in linguistically mismatched target–masker experiments even when the listener is able to obtain meaningful linguistic information from the masker speech. PMID:24167230
Agarwalla, Swapna; Sarma, Kandarpa Kumar
2016-06-01
Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time. It is found that the proposed ML based sentence extraction techniques and the composite feature set used with RNN as classifier outperform all other approaches. By using ANN in FF form as feature extractor, the performance of the system is evaluated and a comparison is made. Experimental results show that the application of big data samples has enhanced the learning of the ASR system. Further, the ANN based sample and feature extraction techniques are found to be efficient enough to enable application of ML techniques in big data aspects as part of ASR systems. Copyright © 2015 Elsevier Ltd. All rights reserved.
Chinese Attitudes towards Varieties of English: A Pre-Olympic Examination
ERIC Educational Resources Information Center
Xu, Wei; Wang, Yu; Case, Rod E.
2010-01-01
This study reports on findings of an investigation into Chinese students' attitudes towards varieties of English before the 2008 Beijing Olympic Games. One hundred and eight college students in mainland China evaluated six English speeches by two American English speakers, two British English speakers, and two Chinese English speakers for social…
Native- and Non-Native Speaking English Teachers in Vietnam: Weighing the Benefits
ERIC Educational Resources Information Center
Walkinshaw, Ian; Duong, Oanh Thi Hoang
2012-01-01
This paper examines a common belief that learners of English as a foreign language prefer to learn English from native-speaker teachers rather than non-native speakers of English. 50 Vietnamese learners of English evaluated the importance of native-speakerness compared with seven qualities valued in an English language teacher: teaching…
Deep bottleneck features for spoken language identification.
Jiang, Bing; Song, Yan; Wei, Si; Liu, Jun-Hua; McLoughlin, Ian Vince; Dai, Li-Rong
2014-01-01
A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed.
Lee, Kichol; Casali, John G
2016-01-01
To investigate the effect of controlled low-speed wind-noise on the auditory situation awareness performance afforded by military hearing protection/enhancement devices (HPED) and tactical communication and protective systems (TCAPS). Recognition/identification and pass-through communications tasks were separately conducted under three wind conditions (0, 5, and 10 mph). Subjects wore two in-ear-type TCAPS, one earmuff-type TCAPS, a Combat Arms Earplug in its 'open' or pass-through setting, and an EB-15LE electronic earplug. Devices with electronic gain systems were tested under two gain settings: 'unity' and 'max'. Testing without any device (open ear) was conducted as a control. Ten subjects were recruited from the student population at Virginia Tech. Audiometric requirements were 25 dBHL or better at 500, 1000, 2000, 4000, and 8000 Hz in both ears. Performance on the interaction of communication task-by-device was significantly different only in 0 mph wind speed. The between-device performance differences varied with azimuthal speaker locations. It is evident from this study that stable (non-gusting) wind speeds up to 10 mph did not significantly degrade recognition/identification task performance and pass-through communication performance of the group of HPEDs and TCAPS tested. However, the various devices performed differently as the test sound signal speaker location was varied and it appears that physical as well as electronic features may have contributed to this directional result.
Speech information retrieval: a review
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hafen, Ryan P.; Henry, Michael J.
Audio is an information-rich component of multimedia. Information can be extracted from audio in a number of different ways, and thus there are several established audio signal analysis research fields. These fields include speech recognition, speaker recognition, audio segmentation and classification, and audio finger-printing. The information that can be extracted from tools and methods developed in these fields can greatly enhance multimedia systems. In this paper, we present the current state of research in each of the major audio analysis fields. The goal is to introduce enough back-ground for someone new in the field to quickly gain high-level understanding andmore » to provide direction for further study.« less
Development of a Low-Cost, Noninvasive, Portable Visual Speech Recognition Program.
Kohlberg, Gavriel D; Gal, Ya'akov Kobi; Lalwani, Anil K
2016-09-01
Loss of speech following tracheostomy and laryngectomy severely limits communication to simple gestures and facial expressions that are largely ineffective. To facilitate communication in these patients, we seek to develop a low-cost, noninvasive, portable, and simple visual speech recognition program (VSRP) to convert articulatory facial movements into speech. A Microsoft Kinect-based VSRP was developed to capture spatial coordinates of lip movements and translate them into speech. The articulatory speech movements associated with 12 sentences were used to train an artificial neural network classifier. The accuracy of the classifier was then evaluated on a separate, previously unseen set of articulatory speech movements. The VSRP was successfully implemented and tested in 5 subjects. It achieved an accuracy rate of 77.2% (65.0%-87.6% for the 5 speakers) on a 12-sentence data set. The mean time to classify an individual sentence was 2.03 milliseconds (1.91-2.16). We have demonstrated the feasibility of a low-cost, noninvasive, portable VSRP based on Kinect to accurately predict speech from articulation movements in clinically trivial time. This VSRP could be used as a novel communication device for aphonic patients. © The Author(s) 2016.
ERIC Educational Resources Information Center
Nickerson, Catherine
2015-01-01
The impact of globalisation in the last 20 years has led to an overwhelming increase in the use of English as the medium through which many business people get their work done. As a result, the linguistic landscape within which we now operate as researchers and teachers has changed both rapidly and beyond all recognition. In the discussion below,…
Limited connected speech experiment
NASA Astrophysics Data System (ADS)
Landell, P. B.
1983-03-01
The purpose of this contract was to demonstrate that connected Speech Recognition (CSR) can be performed in real-time on a vocabulary of one hundred words and to test the performance of the CSR system for twenty-five male and twenty-five female speakers. This report describes the contractor's real-time laboratory CSR system, the data base and training software developed in accordance with the contract, and the results of the performance tests.
Combining Multiple Knowledge Sources for Speech Recognition
1988-09-15
Thus, the first is thle to clarify the pronunciationt ( TASSEAJ for the acronym TASA !). best adaptation sentence, the second sentence, whens addled...10 rapid adapltati,,n sen- tenrces, and 15 spell-i,, de phrases. 6101 resource rirailageo lei SPEAKER-DEPENDENT DATABASE sentences were randortily...combining the smoothed phoneme models with the de - system tested on a standard database using two well de . tailed context models. BYBLOS makes maximal use
ERIC Educational Resources Information Center
Ashwell, Tim; Elam, Jesse R.
2017-01-01
The ultimate aim of our research project was to use the Google Web Speech API to automate scoring of elicited imitation (EI) tests. However, in order to achieve this goal, we had to take a number of preparatory steps. We needed to assess how accurate this speech recognition tool is in recognizing native speakers' production of the test items; we…
Gisting Technique Development.
1981-12-01
furnished tapes (" Stonehenge " database) which were used for previous contracts. Recognition results for English male and female speakers are presented in...independent " Stonehenge " test data. A variety of options in generating word arrays were tried; the results below describe the most successful of these. The...time to carry out any quantitative tests, ............. Page 22 even the obvious one of retraining the " Stonehenge " English vocabulary on-line, we
Crossmodal plasticity in the fusiform gyrus of late blind individuals during voice recognition.
Hölig, Cordula; Föcker, Julia; Best, Anna; Röder, Brigitte; Büchel, Christian
2014-12-01
Blind individuals are trained in identifying other people through voices. In congenitally blind adults the anterior fusiform gyrus has been shown to be active during voice recognition. Such crossmodal changes have been associated with a superiority of blind adults in voice perception. The key question of the present functional magnetic resonance imaging (fMRI) study was whether visual deprivation that occurs in adulthood is followed by similar adaptive changes of the voice identification system. Late blind individuals and matched sighted participants were tested in a priming paradigm, in which two voice stimuli were subsequently presented. The prime (S1) and the target (S2) were either from the same speaker (person-congruent voices) or from two different speakers (person-incongruent voices). Participants had to classify the S2 as either coming from an old or a young person. Only in late blind but not in matched sighted controls, the activation in the anterior fusiform gyrus was modulated by voice identity: late blind volunteers showed an increase of the BOLD signal in response to person-incongruent compared with person-congruent trials. These results suggest that the fusiform gyrus adapts to input of a new modality even in the mature brain and thus demonstrate an adult type of crossmodal plasticity. Copyright © 2014 Elsevier Inc. All rights reserved.
Revisiting Speech Rate and Utterance Length Manipulations in Stuttering Speakers
ERIC Educational Resources Information Center
Blomgren, Michael; Goberman, Alexander M.
2008-01-01
The goal of this study was to evaluate stuttering frequency across a multidimensional (2 x 2) hierarchy of speech performance tasks. Specifically, this study examined the interaction between changes in length of utterance and levels of speech rate stability. Forty-four adult male speakers participated in the study (22 stuttering speakers and 22…
Gardiner, John M; Gregg, Vernon H; Karayianni, Irene
2006-03-01
We report four experiments in which a remember-know paradigm was combined with a response deadline procedure in order to assess memory awareness in fast, as compared with slow,recognition judgments. In the experiments, we also investigated the perceptual effects of study-test congruence, either for picture size or for speaker's voice, following either full or divided attention at study. These perceptual effects occurred in remembering with full attention and in knowing with divided attention, but they were uninfluenced by recognition speed, indicating that their occurrence in remembering or knowing depends more on conscious resources at encoding than on those at retrieval. The results have implications for theoretical accounts of remembering and knowing that assume that remembering is more consciously controlled and effortful, whereas knowing is more automatic and faster.
Li, Wenbo; Zhao, Sheng; Wu, Nan; Zhong, Junwen; Wang, Bo; Lin, Shizhe; Chen, Shuwen; Yuan, Fang; Jiang, Hulin; Xiao, Yongjun; Hu, Bin; Zhou, Jun
2017-07-19
Wearable active sensors have extensive applications in mobile biosensing and human-machine interaction but require good flexibility, high sensitivity, excellent stability, and self-powered feature. In this work, cellular polypropylene (PP) piezoelectret was chosen as the core material of a sensitivity-enhanced wearable active voiceprint sensor (SWAVS) to realize voiceprint recognition. By virtue of the dipole orientation control method, the air layers in the piezoelectret were efficiently utilized, and the current sensitivity was enhanced (from 1.98 pA/Hz to 5.81 pA/Hz at 115 dB). The SWAVS exhibited the superiorities of high sensitivity, accurate frequency response, and excellent stability. The voiceprint recognition system could make correct reactions to human voices by judging both the password and speaker. This study presented a voiceprint sensor with potential applications in noncontact biometric recognition and safety guarantee systems, promoting the progress of wearable sensor networks.
NASA Astrophysics Data System (ADS)
Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.
2009-12-01
This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.
Children's Sociolinguistic Evaluations of Nice Foreigners and Mean Americans
ERIC Educational Resources Information Center
Kinzler, Katherine D.; DeJesus, Jasmine M.
2013-01-01
Three experiments investigated 5- to 6-year-old monolingual English-speaking American children's sociolinguistic evaluations of others based on their accent (native, foreign) and social actions (nice, mean, neutral). In Experiment 1, children expressed social preferences for native-accented English speakers over foreign-accented speakers, and they…
Shin, Young Hoon; Seo, Jiwon
2016-01-01
People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing. PMID:27801867
Analysis of wolves and sheep. Final report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.; Papcun, G.; Zlokarnik, I.
1997-08-01
In evaluating speaker verification systems, asymmetries have been observed in the ease with which people are able to break into other people`s voice locks. People who are good at breaking into voice locks are called wolves, and people whose locks are easy to break into are called sheep. (Goats are people that have a difficult time opening their own voice locks.) Analyses of speaker verification algorithms could be used to understand wolf/sheep asymmetries. Using the notion of a ``speaker space``, it is demonstrated that such asymmetries could arise even though the similarity of voice 1 to voice 2 is themore » same as the inverse similarity. This explains partially the wolf/sheep asymmetries, although there may be other factors. The speaker space can be computed from interspeaker similarity data using multidimensional scaling, and such speaker space can be used to given a good approximation of the interspeaker similarities. The derived speaker space can be used to predict which of the enrolled speakers are likely to be wolves and which are likely to be sheep. However, a speaker must first enroll in the speaker key system and then be compared to each of the other speakers; a good estimate of a person`s speaker space position could be obtained using only a speech sample.« less
Pinheiro, Ana P; Rezaii, Neguine; Nestor, Paul G; Rauber, Andréia; Spencer, Kevin M; Niznikiewicz, Margaret
2016-02-01
During speech comprehension, multiple cues need to be integrated at a millisecond speed, including semantic information, as well as voice identity and affect cues. A processing advantage has been demonstrated for self-related stimuli when compared with non-self stimuli, and for emotional relative to neutral stimuli. However, very few studies investigated self-other speech discrimination and, in particular, how emotional valence and voice identity interactively modulate speech processing. In the present study we probed how the processing of words' semantic valence is modulated by speaker's identity (self vs. non-self voice). Sixteen healthy subjects listened to 420 prerecorded adjectives differing in voice identity (self vs. non-self) and semantic valence (neutral, positive and negative), while electroencephalographic data were recorded. Participants were instructed to decide whether the speech they heard was their own (self-speech condition), someone else's (non-self speech), or if they were unsure. The ERP results demonstrated interactive effects of speaker's identity and emotional valence on both early (N1, P2) and late (Late Positive Potential - LPP) processing stages: compared with non-self speech, self-speech with neutral valence elicited more negative N1 amplitude, self-speech with positive valence elicited more positive P2 amplitude, and self-speech with both positive and negative valence elicited more positive LPP. ERP differences between self and non-self speech occurred in spite of similar accuracy in the recognition of both types of stimuli. Together, these findings suggest that emotion and speaker's identity interact during speech processing, in line with observations of partially dependent processing of speech and speaker information. Copyright © 2016. Published by Elsevier Inc.
Congenital Amusia in Speakers of a Tone Language: Association with Lexical Tone Agnosia
ERIC Educational Resources Information Center
Nan, Yun; Sun, Yanan; Peretz, Isabelle
2010-01-01
Congenital amusia is a neurogenetic disorder that affects the processing of musical pitch in speakers of non-tonal languages like English and French. We assessed whether this musical disorder exists among speakers of Mandarin Chinese who use pitch to alter the meaning of words. Using the Montreal Battery of Evaluation of Amusia, we tested 117…
You had me at "Hello": Rapid extraction of dialect information from spoken words.
Scharinger, Mathias; Monahan, Philip J; Idsardi, William J
2011-06-15
Research on the neuronal underpinnings of speaker identity recognition has identified voice-selective areas in the human brain with evolutionary homologues in non-human primates who have comparable areas for processing species-specific calls. Most studies have focused on estimating the extent and location of these areas. In contrast, relatively few experiments have investigated the time-course of speaker identity, and in particular, dialect processing and identification by electro- or neuromagnetic means. We show here that dialect extraction occurs speaker-independently, pre-attentively and categorically. We used Standard American English and African-American English exemplars of 'Hello' in a magnetoencephalographic (MEG) Mismatch Negativity (MMN) experiment. The MMN as an automatic change detection response of the brain reflected dialect differences that were not entirely reducible to acoustic differences between the pronunciations of 'Hello'. Source analyses of the M100, an auditory evoked response to the vowels suggested additional processing in voice-selective areas whenever a dialect change was detected. These findings are not only relevant for the cognitive neuroscience of language, but also for the social sciences concerned with dialect and race perception. Copyright © 2011 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Houston, Thomas Rappe, Jr.
A homophone is a word having the same pronunciation as another English word, but a different spelling. A list of 7,300 English homophones was compiled and used to construct two tests. Scores were obtained in these and on reference tests for J. P. Guilford's factors CMU, CSU, DMU, and DSU for 70 native speakers of midwestern American English from a…
ERIC Educational Resources Information Center
Vainio, Seppo; Anneli, Pajunen; Hyona, Jukka
2014-01-01
This study investigated the effect of the first language (L1) on the visual word recognition of inflected nouns in second language (L2) Finnish by native Russian and Chinese speakers. Case inflection is common in Russian and in Finnish but nonexistent in Chinese. Several models have been posited to describe L2 morphological processing. The unified…
Scalable Learning for Geostatistics and Speaker Recognition
2011-01-01
of prior knowledge of the model or due to improved robustness requirements). Both these methods have their own advantages and disadvantages. The use...application. If the data is well-correlated and low-dimensional, any prior knowledge available on the data can be used to build a parametric model. In the...absence of prior knowledge , non-parametric methods can be used. If the data is high-dimensional, PCA based dimensionality reduction is often the first
[Characteristics, advantages, and limits of matrix tests].
Brand, T; Wagener, K C
2017-03-01
Deterioration of communication abilities due to hearing problems is particularly relevant in listening situations with noise. Therefore, speech intelligibility tests in noise are required for audiological diagnostics and evaluation of hearing rehabilitation. This study analyzed the characteristics of matrix tests assessing the 50 % speech recognition threshold in noise. What are their advantages and limitations? Matrix tests are based on a matrix of 50 words (10 five-word sentences with same grammatical structure). In the standard setting, 20 sentences are presented using an adaptive procedure estimating the individual 50 % speech recognition threshold in noise. At present, matrix tests in 17 different languages are available. A high international comparability of matrix tests exists. The German language matrix test (OLSA, male speaker) has a reference 50 % speech recognition threshold of -7.1 (± 1.1) dB SNR. Before using a matrix test for the first time, the test person has to become familiar with the basic speech material using two training lists. Hereafter, matrix tests produce constant results even if repeated many times. Matrix tests are suitable for users of hearing aids and cochlear implants, particularly for assessment of benefit during the fitting process. Matrix tests can be performed in closed form and consequently with non-native listeners, even if the experimenter does not speak the test person's native language. Short versions of matrix tests are available for listeners with a shorter memory span, e.g., children.
Mühler, Roland; Ziese, Michael; Rostalski, Dorothea
2009-01-01
The purpose of the study was to develop a speaker discrimination test for cochlear implant (CI) users. The speech material was drawn from the Oldenburg Logatome (OLLO) corpus, which contains 150 different logatomes read by 40 German and 10 French native speakers. The prototype test battery included 120 logatome pairs spoken by 5 male and 5 female speakers with balanced representations of the conditions 'same speaker' and 'different speaker'. Ten adult normal-hearing listeners and 12 adult postlingually deafened CI users were included in a study to evaluate the suitability of the test. The mean speaker discrimination score for the CI users was 67.3% correct and for the normal-hearing listeners 92.2% correct. A significant influence of voice gender and fundamental frequency difference on the speaker discrimination score was found in CI users as well as in normal-hearing listeners. Since the test results of the CI users were significantly above chance level and no ceiling effect was observed, we conclude that subsets of the OLLO corpus are very well suited to speaker discrimination experiments in CI users. Copyright 2008 S. Karger AG, Basel.
Sensory Intelligence for Extraction of an Abstract Auditory Rule: A Cross-Linguistic Study.
Guo, Xiao-Tao; Wang, Xiao-Dong; Liang, Xiu-Yuan; Wang, Ming; Chen, Lin
2018-02-21
In a complex linguistic environment, while speech sounds can greatly vary, some shared features are often invariant. These invariant features constitute so-called abstract auditory rules. Our previous study has shown that with auditory sensory intelligence, the human brain can automatically extract the abstract auditory rules in the speech sound stream, presumably serving as the neural basis for speech comprehension. However, whether the sensory intelligence for extraction of abstract auditory rules in speech is inherent or experience-dependent remains unclear. To address this issue, we constructed a complex speech sound stream using auditory materials in Mandarin Chinese, in which syllables had a flat lexical tone but differed in other acoustic features to form an abstract auditory rule. This rule was occasionally and randomly violated by the syllables with the rising, dipping or falling tone. We found that both Chinese and foreign speakers detected the violations of the abstract auditory rule in the speech sound stream at a pre-attentive stage, as revealed by the whole-head recordings of mismatch negativity (MMN) in a passive paradigm. However, MMNs peaked earlier in Chinese speakers than in foreign speakers. Furthermore, Chinese speakers showed different MMN peak latencies for the three deviant types, which paralleled recognition points. These findings indicate that the sensory intelligence for extraction of abstract auditory rules in speech sounds is innate but shaped by language experience. Copyright © 2018 IBRO. Published by Elsevier Ltd. All rights reserved.
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
Holzrichter, J.F.; Ng, L.C.
1998-03-17
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
Holzrichter, John F.; Ng, Lawrence C.
1998-01-01
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.
Multifunctional microcontrollable interface module
NASA Astrophysics Data System (ADS)
Spitzer, Mark B.; Zavracky, Paul M.; Rensing, Noa M.; Crawford, J.; Hockman, Angela H.; Aquilino, P. D.; Girolamo, Henry J.
2001-08-01
This paper reports the development of a complete eyeglass- mounted computer interface system including display, camera and audio subsystems. The display system provides an SVGA image with a 20 degree horizontal field of view. The camera system has been optimized for face recognition and provides a 19 degree horizontal field of view. A microphone and built-in pre-amp optimized for voice recognition and a speaker on an articulated arm are included for audio. An important feature of the system is a high degree of adjustability and reconfigurability. The system has been developed for testing by the Military Police, in a complete system comprising the eyeglass-mounted interface, a wearable computer, and an RF link. Details of the design, construction, and performance of the eyeglass-based system are discussed.
On how the brain decodes vocal cues about speaker confidence.
Jiang, Xiaoming; Pell, Marc D
2015-05-01
In speech communication, listeners must accurately decode vocal cues that refer to the speaker's mental state, such as their confidence or 'feeling of knowing'. However, the time course and neural mechanisms associated with online inferences about speaker confidence are unclear. Here, we used event-related potentials (ERPs) to examine the temporal neural dynamics underlying a listener's ability to infer speaker confidence from vocal cues during speech processing. We recorded listeners' real-time brain responses while they evaluated statements wherein the speaker's tone of voice conveyed one of three levels of confidence (confident, close-to-confident, unconfident) or were spoken in a neutral manner. Neural responses time-locked to event onset show that the perceived level of speaker confidence could be differentiated at distinct time points during speech processing: unconfident expressions elicited a weaker P2 than all other expressions of confidence (or neutral-intending utterances), whereas close-to-confident expressions elicited a reduced negative response in the 330-500 msec and 550-740 msec time window. Neutral-intending expressions, which were also perceived as relatively confident, elicited a more delayed, larger sustained positivity than all other expressions in the 980-1270 msec window for this task. These findings provide the first piece of evidence of how quickly the brain responds to vocal cues signifying the extent of a speaker's confidence during online speech comprehension; first, a rough dissociation between unconfident and confident voices occurs as early as 200 msec after speech onset. At a later stage, further differentiation of the exact level of speaker confidence (i.e., close-to-confident, very confident) is evaluated via an inferential system to determine the speaker's meaning under current task settings. These findings extend three-stage models of how vocal emotion cues are processed in speech comprehension (e.g., Schirmer & Kotz, 2006) by revealing how a speaker's mental state (i.e., feeling of knowing) is simultaneously inferred from vocal expressions. Copyright © 2015 Elsevier Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Dillon, Christina
2013-01-01
The goal of this project was to design, model, build, and test a flat panel speaker and frame for a spherical dome structure being made into a simulator. The simulator will be a test bed for evaluating an immersive environment for human interfaces. This project focused on the loud speakers and a sound diffuser for the dome. The rest of the team worked on an Ambisonics 3D sound system, video projection system, and multi-direction treadmill to create the most realistic scene possible. The main programs utilized in this project, were Pro-E and COMSOL. Pro-E was used for creating detailed figures for the fabrication of a frame that held a flat panel loud speaker. The loud speaker was made from a thin sheet of Plexiglas and 4 acoustic exciters. COMSOL, a multiphysics finite analysis simulator, was used to model and evaluate all stages of the loud speaker, frame, and sound diffuser. Acoustical testing measurements were utilized to create polar plots from the working prototype which were then compared to the COMSOL simulations to select the optimal design for the dome. The final goal of the project was to install the flat panel loud speaker design in addition to a sound diffuser on to the wall of the dome. After running tests in COMSOL on various speaker configurations, including a warped Plexiglas version, the optimal speaker design included a flat piece of Plexiglas with a rounded frame to match the curvature of the dome. Eight of these loud speakers will be mounted into an inch and a half of high performance acoustic insulation, or Thinsulate, that will cover the inside of the dome. The following technical paper discusses these projects and explains the engineering processes used, knowledge gained, and the projected future goals of this project
Visual speech influences speech perception immediately but not automatically.
Mitterer, Holger; Reinisch, Eva
2017-02-01
Two experiments examined the time course of the use of auditory and visual speech cues to spoken word recognition using an eye-tracking paradigm. Results of the first experiment showed that the use of visual speech cues from lipreading is reduced if concurrently presented pictures require a division of attentional resources. This reduction was evident even when listeners' eye gaze was on the speaker rather than the (static) pictures. Experiment 2 used a deictic hand gesture to foster attention to the speaker. At the same time, the visual processing load was reduced by keeping the visual display constant over a fixed number of successive trials. Under these conditions, the visual speech cues from lipreading were used. Moreover, the eye-tracking data indicated that visual information was used immediately and even earlier than auditory information. In combination, these data indicate that visual speech cues are not used automatically, but if they are used, they are used immediately.
Neural networks to classify speaker independent isolated words recorded in radio car environments
NASA Astrophysics Data System (ADS)
Alippi, C.; Simeoni, M.; Torri, V.
1993-02-01
Many applications, in particular the ones requiring nonlinear signal processing, have proved Artificial Neural Networks (ANN's) to be invaluable tools for model free estimation. The classifying abilities of ANN's are addressed by testing their performance in a speaker independent word recognition application. A real world case requiring implementation of compact integrated devices is taken into account: the classification of isolated words in radio car environment. A multispeaker database of isolated words was recorded in different environments. Data were first processed to determinate the boundaries of each word and then to extract speech features, the latter accomplished by using cepstral coefficient representation, log area ratios and filters bank techniques. Multilayered perceptron and adaptive vector quantization neural paradigms were tested to find a reasonable compromise between performances and network simplicity, fundamental requirement for the implementation of compact real time running neural devices.
Experiments on Urdu Text Recognition
NASA Astrophysics Data System (ADS)
Mukhtar, Omar; Setlur, Srirangaraj; Govindaraju, Venu
Urdu is a language spoken in the Indian subcontinent by an estimated 130-270 million speakers. At the spoken level, Urdu and Hindi are considered dialects of a single language because of shared vocabulary and the similarity in grammar. At the written level, however, Urdu is much closer to Arabic because it is written in Nastaliq, the calligraphic style of the Persian-Arabic script. Therefore, a speaker of Hindi can understand spoken Urdu but may not be able to read written Urdu because Hindi is written in Devanagari script, whereas an Arabic writer can read the written words but may not understand the spoken Urdu. In this chapter we present an overview of written Urdu. Prior research in handwritten Urdu OCR is very limited. We present (perhaps) the first system for recognizing handwritten Urdu words. On a data set of about 1300 handwritten words, we achieved an accuracy of 70% for the top choice, and 82% for the top three choices.
Jürgens, Rebecca; Grass, Annika; Drolet, Matthis; Fischer, Julia
Both in the performative arts and in emotion research, professional actors are assumed to be capable of delivering emotions comparable to spontaneous emotional expressions. This study examines the effects of acting training on vocal emotion depiction and recognition. We predicted that professional actors express emotions in a more realistic fashion than non-professional actors. However, professional acting training may lead to a particular speech pattern; this might account for vocal expressions by actors that are less comparable to authentic samples than the ones by non-professional actors. We compared 80 emotional speech tokens from radio interviews with 80 re-enactments by professional and inexperienced actors, respectively. We analyzed recognition accuracies for emotion and authenticity ratings and compared the acoustic structure of the speech tokens. Both play-acted conditions yielded similar recognition accuracies and possessed more variable pitch contours than the spontaneous recordings. However, professional actors exhibited signs of different articulation patterns compared to non-trained speakers. Our results indicate that for emotion research, emotional expressions by professional actors are not better suited than those from non-actors.
Automatic voice recognition using traditional and artificial neural network approaches
NASA Technical Reports Server (NTRS)
Botros, Nazeih M.
1989-01-01
The main objective of this research is to develop an algorithm for isolated-word recognition. This research is focused on digital signal analysis rather than linguistic analysis of speech. Features extraction is carried out by applying a Linear Predictive Coding (LPC) algorithm with order of 10. Continuous-word and speaker independent recognition will be considered in future study after accomplishing this isolated word research. To examine the similarity between the reference and the training sets, two approaches are explored. The first is implementing traditional pattern recognition techniques where a dynamic time warping algorithm is applied to align the two sets and calculate the probability of matching by measuring the Euclidean distance between the two sets. The second is implementing a backpropagation artificial neural net model with three layers as the pattern classifier. The adaptation rule implemented in this network is the generalized least mean square (LMS) rule. The first approach has been accomplished. A vocabulary of 50 words was selected and tested. The accuracy of the algorithm was found to be around 85 percent. The second approach is in progress at the present time.
Segment-based acoustic models for continuous speech recognition
NASA Astrophysics Data System (ADS)
Ostendorf, Mari; Rohlicek, J. R.
1993-07-01
This research aims to develop new and more accurate stochastic models for speaker-independent continuous speech recognition, by extending previous work in segment-based modeling and by introducing a new hierarchical approach to representing intra-utterance statistical dependencies. These techniques, which are more costly than traditional approaches because of the large search space associated with higher order models, are made feasible through rescoring a set of HMM-generated N-best sentence hypotheses. We expect these different modeling techniques to result in improved recognition performance over that achieved by current systems, which handle only frame-based observations and assume that these observations are independent given an underlying state sequence. In the fourth quarter of the project, we have completed the following: (1) ported our recognition system to the Wall Street Journal task, a standard task in the ARPA community; (2) developed an initial dependency-tree model of intra-utterance observation correlation; and (3) implemented baseline language model estimation software. Our initial results on the Wall Street Journal task are quite good and represent significantly improved performance over most HMM systems reporting on the Nov. 1992 5k vocabulary test set.
Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902
Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J
2013-01-01
Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.
Speaker normalization and adaptation using second-order connectionist networks.
Watrous, R L
1993-01-01
A method for speaker normalization and adaption using connectionist networks is developed. A speaker-specific linear transformation of observations of the speech signal is computed using second-order network units. Classification is accomplished by a multilayer feedforward network that operates on the normalized speech data. The network is adapted for a new talker by modifying the transformation parameters while leaving the classifier fixed. This is accomplished by backpropagating classification error through the classifier to the second-order transformation units. This method was evaluated for the classification of ten vowels for 76 speakers using the first two formant values of the Peterson-Barney data. The results suggest that rapid speaker adaptation resulting in high classification accuracy can be accomplished by this method.
Parallel Processing of Large Scale Microphone Arrays for Sound Capture
NASA Astrophysics Data System (ADS)
Jan, Ea-Ee.
1995-01-01
Performance of microphone sound pick up is degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent in a teleconferencing environment in which the microphone is positioned far away from the speaker. Besides, the ideal teleconference should feel as easy and natural as face-to-face communication with another person. This suggests hands-free sound capture with no tether or encumbrance by hand-held or body-worn sound equipment. Microphone arrays for this application represent an appropriate approach. This research develops new microphone array and signal processing techniques for high quality hands-free sound capture in noisy, reverberant enclosures. The new techniques combine matched-filtering of individual sensors and parallel processing to provide acute spatial volume selectivity which is capable of mitigating the deleterious effects of noise interference and multipath distortion. The new method outperforms traditional delay-and-sum beamformers which provide only directional spatial selectivity. The research additionally explores truncated matched-filtering and random distribution of transducers to reduce complexity and improve sound capture quality. All designs are first established by computer simulation of array performance in reverberant enclosures. The simulation is achieved by a room model which can efficiently calculate the acoustic multipath in a rectangular enclosure up to a prescribed order of images. It also calculates the incident angle of the arriving signal. Experimental arrays were constructed and their performance was measured in real rooms. Real room data were collected in a hard-walled laboratory and a controllable variable acoustics enclosure of similar size, approximately 6 x 6 x 3 m. An extensive speech database was also collected in these two enclosures for future research on microphone arrays. The simulation results are shown to be consistent with the real room data. Localization of sound sources has been explored using cross-power spectrum time delay estimation and has been evaluated using real room data under slightly, moderately and highly reverberant conditions. To improve the accuracy and reliability of the source localization, an outlier detector that removes incorrect time delay estimation has been invented. To provide speaker selectivity for microphone array systems, a hands-free speaker identification system has been studied. A recently invented feature using selected spectrum information outperforms traditional recognition methods. Measured results demonstrate the capabilities of speaker selectivity from a matched-filtered array. In addition, simulation utilities, including matched -filtering processing of the array and hands-free speaker identification, have been implemented on the massively -parallel nCube super-computer. This parallel computation highlights the requirements for real-time processing of array signals.
Center for Neural Engineering: applications of pulse-coupled neural networks
NASA Astrophysics Data System (ADS)
Malkani, Mohan; Bodruzzaman, Mohammad; Johnson, John L.; Davis, Joel
1999-03-01
Pulsed-Coupled Neural Network (PCNN) is an oscillatory model neural network where grouping of cells and grouping among the groups that form the output time series (number of cells that fires in each input presentation also called `icon'). This is based on the synchronicity of oscillations. Recent work by Johnson and others demonstrated the functional capabilities of networks containing such elements for invariant feature extraction using intensity maps. PCNN thus presents itself as a more biologically plausible model with solid functional potential. This paper will present the summary of several projects and their results where we successfully applied PCNN. In project one, the PCNN was applied for object recognition and classification through a robotic vision system. The features (icons) generated by the PCNN were then fed into a feedforward neural network for classification. In project two, we developed techniques for sensory data fusion. The PCNN algorithm was implemented and tested on a B14 mobile robot. The PCNN-based features were extracted from the images taken from the robot vision system and used in conjunction with the map generated by data fusion of the sonar and wheel encoder data for the navigation of the mobile robot. In our third project, we applied the PCNN for speaker recognition. The spectrogram image of speech signals are fed into the PCNN to produce invariant feature icons which are then fed into a feedforward neural network for speaker identification.
ERIC Educational Resources Information Center
Alowaydhi, Wafa Hafez
2016-01-01
The current study aimed at standardizing the program of learning Arabic for non-native speakers in Saudi Electronic University according to certain standards of total quality. To achieve its purpose, the study adopted the descriptive analytical method. The author prepared a measurement tool for evaluating the electronic learning programs in light…
Non-native Listeners’ Recognition of High-Variability Speech Using PRESTO
Tamati, Terrin N.; Pisoni, David B.
2015-01-01
Background Natural variability in speech is a significant challenge to robust successful spoken word recognition. In everyday listening environments, listeners must quickly adapt and adjust to multiple sources of variability in both the signal and listening environments. High-variability speech may be particularly difficult to understand for non-native listeners, who have less experience with the second language (L2) phonological system and less detailed knowledge of sociolinguistic variation of the L2. Purpose The purpose of this study was to investigate the effects of high-variability sentences on non-native speech recognition and to explore the underlying sources of individual differences in speech recognition abilities of non-native listeners. Research Design Participants completed two sentence recognition tasks involving high-variability and low-variability sentences. They also completed a battery of behavioral tasks and self-report questionnaires designed to assess their indexical processing skills, vocabulary knowledge, and several core neurocognitive abilities. Study Sample Native speakers of Mandarin (n = 25) living in the United States recruited from the Indiana University community participated in the current study. A native comparison group consisted of scores obtained from native speakers of English (n = 21) in the Indiana University community taken from an earlier study. Data Collection and Analysis Speech recognition in high-variability listening conditions was assessed with a sentence recognition task using sentences from PRESTO (Perceptually Robust English Sentence Test Open-Set) mixed in 6-talker multitalker babble. Speech recognition in low-variability listening conditions was assessed using sentences from HINT (Hearing In Noise Test) mixed in 6-talker multitalker babble. Indexical processing skills were measured using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Vocabulary knowledge was assessed with the WordFam word familiarity test, and executive functioning was assessed with the BRIEF-A (Behavioral Rating Inventory of Executive Function – Adult Version) self-report questionnaire. Scores from the non-native listeners on behavioral tasks and self-report questionnaires were compared with scores obtained from native listeners tested in a previous study and were examined for individual differences. Results Non-native keyword recognition scores were significantly lower on PRESTO sentences than on HINT sentences. Non-native listeners’ keyword recognition scores were also lower than native listeners’ scores on both sentence recognition tasks. Differences in performance on the sentence recognition tasks between non-native and native listeners were larger on PRESTO than on HINT, although group differences varied by signal-to-noise ratio. The non-native and native groups also differed in the ability to categorize talkers by region of origin and in vocabulary knowledge. Individual non-native word recognition accuracy on PRESTO sentences in multitalker babble at more favorable signal-to-noise ratios was found to be related to several BRIEF-A subscales and composite scores. However, non-native performance on PRESTO was not related to regional dialect categorization, talker and gender discrimination, or vocabulary knowledge. Conclusions High-variability sentences in multitalker babble were particularly challenging for non-native listeners. Difficulty under high-variability testing conditions was related to lack of experience with the L2, especially L2 sociolinguistic information, compared with native listeners. Individual differences among the non-native listeners were related to weaknesses in core neurocognitive abilities affecting behavioral control in everyday life. PMID:25405842
Speech coding, reconstruction and recognition using acoustics and electromagnetic waves
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holzrichter, J.F.; Ng, L.C.
The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used formore » purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.« less
Reilly, Kevin J.; Spencer, Kristie A.
2013-01-01
The current study investigated the processes responsible for selection of sounds and syllables during production of speech sequences in 10 adults with hypokinetic dysarthria from Parkinson’s disease, five adults with ataxic dysarthria, and 14 healthy control speakers. Speech production data from a choice reaction time task were analyzed to evaluate the effects of sequence length and practice on speech sound sequencing. Speakers produced sequences that were between one and five syllables in length over five experimental runs of 60 trials each. In contrast to the healthy speakers, speakers with hypokinetic dysarthria demonstrated exaggerated sequence length effects for both inter-syllable intervals (ISIs) and speech error rates. Conversely, speakers with ataxic dysarthria failed to demonstrate a sequence length effect on ISIs and were also the only group that did not exhibit practice-related changes in ISIs and speech error rates over the five experimental runs. The exaggerated sequence length effects in the hypokinetic speakers with Parkinson’s disease are consistent with an impairment of action selection during speech sequence production. The absent length effects observed in the speakers with ataxic dysarthria is consistent with previous findings that indicate a limited capacity to buffer speech sequences in advance of their execution. In addition, the lack of practice effects in these speakers suggests that learning-related improvements in the production rate and accuracy of speech sequences involves processing by structures of the cerebellum. Together, the current findings inform models of serial control for speech in healthy speakers and support the notion that sequencing deficits contribute to speech symptoms in speakers with hypokinetic or ataxic dysarthria. In addition, these findings indicate that speech sequencing is differentially impaired in hypokinetic and ataxic dysarthria. PMID:24137121
An articulatorily constrained, maximum entropy approach to speech recognition and speech coding
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hogden, J.
Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values aremore » constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.« less
Rep. Johnson, Henry C. "Hank," Jr. [D-GA-4
2014-02-11
House - 02/11/2014 Referred to the Committee on Financial Services, and in addition to the Committee on House Administration, for a period to be subsequently determined by the Speaker, in each case for consideration of such provisions as fall within the jurisdiction of the committee... (All Actions) Tracker: This bill has the status IntroducedHere are the steps for Status of Legislation:
2004-06-01
American Society for Cell Biology, San Francisco , California, Dec. 13-17, 2003. 3. Invited speaker in one of symposiums "Recognition of estrogen receptor in...Oncology, Bahamonde MI, Mann GE, Vergara C, Latorre R 1999 University of Virginia Health Science Center, Charlottesville, Acute activation of Maxi-K...intracellular Bahamonde, G.E. Mann, C. Vergara , R. Latorre, Acute activa- calcium, Proc. Nati. Acad. Sci. USA. 96 (1999) 4686-4691. tion of Maxi-K channels by
Developing Multi-Voice Speech Recognition Confidence Measures and Applying Them to AHLTA-Mobile
2011-05-01
target application, then only the phoneme models used in that application’s command set need be adapted. For the purpose of the recorder app , I opted...and solve if. We also plan on creating a simplified civilian version of the recorder for iPhone and Android . Conclusion: First, speaker search...pushed forward to the field hospital before the injured soldier arrives. It is not onerous to play all of them. Trouble Shooting: You say “Blood
Quantitative analysis of professionally trained versus untrained voices.
Siupsinskiene, Nora
2003-01-01
The aim of this study was to compare healthy trained and untrained voices as well as healthy and dysphonic trained voices in adults using combined voice range profile and aerodynamic tests, to define the normal range limiting values of quantitative voice parameters and to select the most informative quantitative voice parameters for separation between healthy and dysphonic trained voices. Three groups of persons were evaluated. One hundred eighty six healthy volunteers were divided into two groups according to voice training: non-professional speakers group consisted of 106 untrained voices persons (36 males and 70 females) and professional speakers group--of 80 trained voices persons (21 males and 59 females). Clinical group consisted of 103 dysphonic professional speakers (23 males and 80 females) with various voice disorders. Eighteen quantitative voice parameters from combined voice range profile (VRP) test were analyzed: 8 of voice range profile, 8 of speaking voice, overall vocal dysfunction degree and coefficient of sound, and aerodynamic maximum phonation time. Analysis showed that healthy professional speakers demonstrated expanded vocal abilities in comparison to healthy non-professional speakers. Quantitative voice range profile parameters- pitch range, high frequency limit, area of high frequencies and coefficient of sound differed significantly between healthy professional and non-professional voices, and were more informative than speaking voice or aerodynamic parameters in showing the voice training. Logistic stepwise regression revealed that VRP area in high frequencies was sufficient to discriminate between healthy and dysphonic professional speakers for male subjects (overall discrimination accuracy--81.8%) and combination of three quantitative parameters (VRP high frequency limit, maximum voice intensity and slope of speaking curve) for female subjects (overall model discrimination accuracy--75.4%). We concluded that quantitative voice assessment with selected parameters might be useful for evaluation of voice education for healthy professional speakers as well as for detection of vocal dysfunction and evaluation of rehabilitation effect in dysphonic professionals.
Speech Prosody Across Stimulus Types for Individuals with Parkinson's Disease.
K-Y Ma, Joan; Schneider, Christine B; Hoffmann, Rüdiger; Storch, Alexander
2015-01-01
Up to 89% of the individuals with Parkinson's disease (PD) experience speech problem over the course of the disease. Speech prosody and intelligibility are two of the most affected areas in hypokinetic dysarthria. However, assessment of these areas could potentially be problematic as speech prosody and intelligibility could be affected by the type of speech materials employed. To comparatively explore the effects of different types of speech stimulus on speech prosody and intelligibility in PD speakers. Speech prosody and intelligibility of two groups of individuals with varying degree of dysarthria resulting from PD was compared to that of a group of control speakers using sentence reading, passage reading and monologue. Acoustic analysis including measures on fundamental frequency (F0), intensity and speech rate was used to form a prosodic profile for each individual. Speech intelligibility was measured for the speakers with dysarthria using direct magnitude estimation. Difference in F0 variability between the speakers with dysarthria and control speakers was only observed in sentence reading task. Difference in the average intensity level was observed for speakers with mild dysarthria to that of the control speakers. Additionally, there were stimulus effect on both intelligibility and prosodic profile. The prosodic profile of PD speakers was different from that of the control speakers in the more structured task, and lower intelligibility was found in less structured task. This highlighted the value of both structured and natural stimulus to evaluate speech production in PD speakers.
Social dominance orientation, nonnative accents, and hiring recommendations.
Hansen, Karolina; Dovidio, John F
2016-10-01
Discrimination against nonnative speakers is widespread and largely socially acceptable. Nonnative speakers are evaluated negatively because accent is a sign that they belong to an outgroup and because understanding their speech requires unusual effort from listeners. The present research investigated intergroup bias, based on stronger support for hierarchical relations between groups (social dominance orientation [SDO]), as a predictor of hiring recommendations of nonnative speakers. In an online experiment using an adaptation of the thin-slices methodology, 65 U.S. adults (54% women; 80% White; Mage = 35.91, range = 18-67) heard a recording of a job applicant speaking with an Asian (Mandarin Chinese) or a Latino (Spanish) accent. Participants indicated how likely they would be to recommend hiring the speaker, answered questions about the text, and indicated how difficult it was to understand the applicant. Independent of objective comprehension, participants high in SDO reported that it was more difficult to understand a Latino speaker than an Asian speaker. SDO predicted hiring recommendations of the speakers, but this relationship was mediated by the perception that nonnative speakers were difficult to understand. This effect was stronger for speakers from lower status groups (Latinos relative to Asians) and was not related to objective comprehension. These findings suggest a cycle of prejudice toward nonnative speakers: Not only do perceptions of difficulty in understanding cause prejudice toward them, but also prejudice toward low-status groups can lead to perceived difficulty in understanding members of these groups. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Common constraints limit Korean and English character recognition in peripheral vision.
He, Yingchen; Kwon, MiYoung; Legge, Gordon E
2018-01-01
The visual span refers to the number of adjacent characters that can be recognized in a single glance. It is viewed as a sensory bottleneck in reading for both normal and clinical populations. In peripheral vision, the visual span for English characters can be enlarged after training with a letter-recognition task. Here, we examined the transfer of training from Korean to English characters for a group of bilingual Korean native speakers. In the pre- and posttests, we measured visual spans for Korean characters and English letters. Training (1.5 hours × 4 days) consisted of repetitive visual-span measurements for Korean trigrams (strings of three characters). Our training enlarged the visual spans for Korean single characters and trigrams, and the benefit transferred to untrained English symbols. The improvement was largely due to a reduction of within-character and between-character crowding in Korean recognition, as well as between-letter crowding in English recognition. We also found a negative correlation between the size of the visual span and the average pattern complexity of the symbol set. Together, our results showed that the visual span is limited by common sensory (crowding) and physical (pattern complexity) factors regardless of the language script, providing evidence that the visual span reflects a universal bottleneck for text recognition.
Common constraints limit Korean and English character recognition in peripheral vision
He, Yingchen; Kwon, MiYoung; Legge, Gordon E.
2018-01-01
The visual span refers to the number of adjacent characters that can be recognized in a single glance. It is viewed as a sensory bottleneck in reading for both normal and clinical populations. In peripheral vision, the visual span for English characters can be enlarged after training with a letter-recognition task. Here, we examined the transfer of training from Korean to English characters for a group of bilingual Korean native speakers. In the pre- and posttests, we measured visual spans for Korean characters and English letters. Training (1.5 hours × 4 days) consisted of repetitive visual-span measurements for Korean trigrams (strings of three characters). Our training enlarged the visual spans for Korean single characters and trigrams, and the benefit transferred to untrained English symbols. The improvement was largely due to a reduction of within-character and between-character crowding in Korean recognition, as well as between-letter crowding in English recognition. We also found a negative correlation between the size of the visual span and the average pattern complexity of the symbol set. Together, our results showed that the visual span is limited by common sensory (crowding) and physical (pattern complexity) factors regardless of the language script, providing evidence that the visual span reflects a universal bottleneck for text recognition. PMID:29327041
Mobile Audience Response Systems at a Continuing Medical Education Conference.
Beaumont, Alexandra; Gousseau, Michael; Sommerfeld, Connor; Leitao, Darren; Gooi, Adrian
2017-01-01
Mobile audience response systems (mARS) are electronic systems allowing speakers to ask questions and audience members to respond anonymously and immediately on a screen which enables learners to view their peers' responses as well as their own. mARS encourages increased interaction and active learning. This study aims to examine the perceptions of audience members and speakers towards the implementation of mARS at a national medical conference. mARS was implemented at the CSO Annual Meeting in Winnipeg 2015. Eleven presenters agreed to participate in the mARS trial. Both audience and presenters received instructions. Five-point Likert questions and short answer questions were emailed to all conference attendees and the data was evaluated. Twenty-seven participants responded, 23 audience members and 4 instructors. Overall, responders indicated improved attention, involvement, engagement and recognition of audience's understanding of topics with the use of mARS. mARS was perceived as easy to use, with clear instructions, and the majority of respondents expressed an interest in using mARS in more presentations and in future national medical conferences. Most respondents preferred lectures with mARS over lectures without mARS. Some negative feedback on mARS involved dissatisfaction with how some presenters implemented mARS into the workshops. Overall mARS was perceived positively with the majority of respondents wanting mARS implemented in more national medical conferences. Future studies should look at how mARS can be used as an educational tool to help improve patient outcomes.
Long short-term memory for speaker generalization in supervised speech separation
Chen, Jitong; Wang, DeLiang
2017-01-01
Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. To improve speaker generalization, a separation model based on long short-term memory (LSTM) is proposed, which naturally accounts for temporal dynamics of speech. Systematic evaluation shows that the proposed model substantially outperforms a DNN-based model on unseen speakers and unseen noises in terms of objective speech intelligibility. Analyzing LSTM internal representations reveals that LSTM captures long-term speech contexts. It is also found that the LSTM model is more advantageous for low-latency speech separation and it, without future frames, performs better than the DNN model with future frames. The proposed model represents an effective approach for speaker- and noise-independent speech separation. PMID:28679261
NASA Technical Reports Server (NTRS)
Costanza, Bryan T.; Horne, William C.; Schery, S. D.; Babb, Alex T.
2011-01-01
The Aero-Physics Branch at NASA Ames Research Center utilizes a 32- by 48-inch subsonic wind tunnel for aerodynamics research. The feasibility of acquiring acoustic measurements with a phased microphone array was recently explored. Acoustic characterization of the wind tunnel was carried out with a floor-mounted 24-element array and two ceiling-mounted speakers. The minimum speaker level for accurate level measurement was evaluated for various tunnel speeds up to a Mach number of 0.15 and streamwise speaker locations. A variety of post-processing procedures, including conventional beamforming and deconvolutional processing such as TIDY, were used. The speaker measurements, with and without flow, were used to compare actual versus simulated in-flow speaker calibrations. Data for wind-off speaker sound and wind-on tunnel background noise were found valuable for predicting sound levels for which the speakers were detectable when the wind was on. Speaker sources were detectable 2 - 10 dB below the peak background noise level with conventional data processing. The effectiveness of background noise cross-spectral matrix subtraction was assessed and found to improve the detectability of test sound sources by approximately 10 dB over a wide frequency range.
ERIC Educational Resources Information Center
Thibault, Pierrette; Sankoff, Gillian
1999-01-01
Analyzes the reactions of francophone Montrealers (n=116) to the recorded speech of English speakers using French. Particular focus is on finding out which linguistic traits of speech triggered the judgments on the speakers' competence and to what extent they met the judges expectations with regard to their job suitability. (Author/VWL)
Using Avatars for Improving Speaker Identification in Captioning
NASA Astrophysics Data System (ADS)
Vy, Quoc V.; Fels, Deborah I.
Captioning is the main method for accessing television and film content by people who are deaf or hard-of-hearing. One major difficulty consistently identified by the community is that of knowing who is speaking particularly for an off screen narrator. A captioning system was created using a participatory design method to improve speaker identification. The final prototype contained avatars and a coloured border for identifying specific speakers. Evaluation results were very positive; however participants also wanted to customize various components such as caption and avatar location.
Multisensory speech perception in autism spectrum disorder: From phoneme to whole-word perception.
Stevenson, Ryan A; Baum, Sarah H; Segers, Magali; Ferber, Susanne; Barense, Morgan D; Wallace, Mark T
2017-07-01
Speech perception in noisy environments is boosted when a listener can see the speaker's mouth and integrate the auditory and visual speech information. Autistic children have a diminished capacity to integrate sensory information across modalities, which contributes to core symptoms of autism, such as impairments in social communication. We investigated the abilities of autistic and typically-developing (TD) children to integrate auditory and visual speech stimuli in various signal-to-noise ratios (SNR). Measurements of both whole-word and phoneme recognition were recorded. At the level of whole-word recognition, autistic children exhibited reduced performance in both the auditory and audiovisual modalities. Importantly, autistic children showed reduced behavioral benefit from multisensory integration with whole-word recognition, specifically at low SNRs. At the level of phoneme recognition, autistic children exhibited reduced performance relative to their TD peers in auditory, visual, and audiovisual modalities. However, and in contrast to their performance at the level of whole-word recognition, both autistic and TD children showed benefits from multisensory integration for phoneme recognition. In accordance with the principle of inverse effectiveness, both groups exhibited greater benefit at low SNRs relative to high SNRs. Thus, while autistic children showed typical multisensory benefits during phoneme recognition, these benefits did not translate to typical multisensory benefit of whole-word recognition in noisy environments. We hypothesize that sensory impairments in autistic children raise the SNR threshold needed to extract meaningful information from a given sensory input, resulting in subsequent failure to exhibit behavioral benefits from additional sensory information at the level of whole-word recognition. Autism Res 2017. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. Autism Res 2017, 10: 1280-1290. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. © 2017 International Society for Autism Research, Wiley Periodicals, Inc.
Reading component skills of learners in adult basic education.
MacArthur, Charles A; Konold, Timothy R; Glutting, Joseph J; Alamprese, Judith A
2010-01-01
The purposes of this study were to investigate the reliability and construct validity of measures of reading component skills with a sample of adult basic education (ABE) learners, including both native and nonnative English speakers, and to describe the performance of those learners on the measures. Investigation of measures of reading components is needed because available measures were neither developed for nor normed on ABE populations or with nonnative speakers of English. The study included 486 students, 334 born or educated in the United States (native) and 152 not born or educated in the United States (nonnative) but who spoke English well enough to participate in English reading classes. All students had scores on 11 measures covering five constructs: decoding, word recognition, spelling, fluency, and comprehension. Confirmatory factor analysis (CFA) was used to test three models: a two-factor model with print and meaning factors; a three-factor model that separated out a fluency factor; and a five-factor model based on the hypothesized constructs. The five-factor model fit best. In addition, the CFA model fit both native and nonnative populations equally well without modification, showing that the tests measure the same constructs with the same accuracy for both groups. Group comparisons found no difference between the native and nonnative samples on word recognition, but the native sample scored higher on fluency and comprehension and lower on decoding than did the nonnative sample. Students with self-reported learning disabilities scored lower on all reading components. Differences by age and gender were also analyzed.
Preterm birth in the Inuit and First Nations populations of Québec, Canada, 1981-2008.
Auger, Nathalie; Fon Sing, Mélanie; Park, Alison L; Lo, Ernest; Trempe, Normand; Luo, Zhong-Cheng
2012-03-24
To evaluate preterm birth (PTB) for Inuit and First Nations vs. non-Indigenous populations in the province of Québec, Canada. Retrospective cohort study. We evaluated singleton live births for Québec residents, 1981-2008 (n = 2,310,466). Municipality of residence (Inuit-inhabited, First Nations-inhabited, rest of Québec) and language (Inuit, First Nations, French/English) were used to identify Inuit and First Nations births. The outcome was PTB (<37 completed weeks). Cox proportional hazards regression was employed to estimate hazard ratios (HR) and 95% confidence intervals (CI) of PTB, adjusting for maternal age, education, marital status, parity and birth year. PTB rates were higher for Inuit language speakers in Inuit-inhabited areas and the rest of Québec compared with French/English speakers in the rest of Québec, and disparities persisted over time. Relative to French/English speakers in the rest of Québec, Inuit language speakers in the rest of Québec had the highest risk of PTB (HR 1.98, 95% CI: 1.62-2.41). The risk was also elevated for Inuit language speakers in Inuit-inhabited areas, though to a lesser extent (HR 1.29, 95% CI: 1.18-1.41). In contrast, First Nations language speakers in First Nations-inhabited areas and the rest of Québec had similar or lower risks of PTB relative to French/English speakers in the rest of Québec. Inuit populations, especially those outside Inuit-inhabited areas, have persistently elevated risks of PTB, indicating a need for strategies to prevent PTB in this population.
Preterm birth in the Inuit and First Nations populations of Québec, Canada, 1981–2008
Auger, Nathalie; Sing, Mélanie Fon; Park, Alison L.; Lo, Ernest; Trempe, Normand; Luo, Zhong-Cheng
2012-01-01
Objectives To evaluate preterm birth (PTB) for Inuit and First Nations vs. non-Indigenous populations in the province of Québec, Canada. Study design Retrospective cohort study. Methods We evaluated singleton live births for Québec residents, 1981–2008 (n =2,310,466). Municipality of residence (Inuit-inhabited, First Nations-inhabited, rest of Québec) and language (Inuit, First Nations, French/English) were used to identify Inuit and First Nations births. The outcome was PTB (<37 completed weeks). Cox proportional hazards regression was employed to estimate hazard ratios (HR) and 95% confidence intervals (CI) of PTB, adjusting for maternal age, education, marital status, parity and birth year. Results PTB rates were higher for Inuit language speakers in Inuit-inhabited areas and the rest of Québec compared with French/English speakers in the rest of Québec, and disparities persisted over time. Relative to French/English speakers in the rest of Québec, Inuit language speakers in the rest of Québec had the highest risk of PTB (HR 1.98, 95% CI: 1.62–2.41). The risk was also elevated for Inuit language speakers in Inuit-inhabited areas, though to a lesser extent (HR 1.29, 95% CI: 1.18–1.41). In contrast, First Nations language speakers in First Nations-inhabited areas and the rest of Québec had similar or lower risks of PTB relative to French/English speakers in the rest of Québec. Conclusions Inuit populations, especially those outside Inuit-inhabited areas, have persistently elevated risks of PTB, indicating a need for strategies to prevent PTB in this population. PMID:22456035
Domain-specific impairment of source memory following a right posterior medial temporal lobe lesion.
Peters, Jan; Koch, Benno; Schwarz, Michael; Daum, Irene
2007-01-01
This single case analysis of memory performance in a patient with an ischemic lesion affecting posterior but not anterior right medial temporal lobe (MTL) indicates that source memory can be disrupted in a domain-specific manner. The patient showed normal recognition memory for gray-scale photos of objects (visual condition) and spoken words (auditory condition). While memory for visual source (texture/color of the background against which pictures appeared) was within the normal range, auditory source memory (male/female speaker voice) was at chance level, a performance pattern significantly different from the control group. This dissociation is consistent with recent fMRI evidence of anterior/posterior MTL dissociations depending upon the nature of source information (visual texture/color vs. auditory speaker voice). The findings are in good agreement with the view of dissociable memory processing by the perirhinal cortex (anterior MTL) and parahippocampal cortex (posterior MTL), depending upon the neocortical input that these regions receive. (c) 2007 Wiley-Liss, Inc.
Paying attention to attention in recognition memory: insights from models and electrophysiology.
Dubé, Chad; Payne, Lisa; Sekuler, Robert; Rotello, Caren M
2013-12-01
Reliance on remembered facts or events requires memory for their sources, that is, the contexts in which those facts or events were embedded. Understanding of source retrieval has been stymied by the fact that uncontrolled fluctuations of attention during encoding can cloud results of key importance to theoretical development. To address this issue, we combined electrophysiology (high-density electroencephalogram, EEG, recordings) with computational modeling of behavioral results. We manipulated subjects' attention to an auditory attribute, whether the source of individual study words was a male or female speaker. Posterior alpha-band (8-14 Hz) power in subjects' EEG increased after a cue to ignore the voice of the person who was about to speak. Receiver-operating-characteristic analysis validated our interpretation of oscillatory dynamics as a marker of attention to source information. With attention under experimental control, computational modeling showed unequivocally that memory for source (male or female speaker) reflected a continuous signal detection process rather than a threshold recollection process.
Professional Ethics for Astronomers
NASA Astrophysics Data System (ADS)
Marvel, K. B.
2005-05-01
There is a growing recognition that professional ethics is an important topic for all professional scientists, especially physical scientists. Situations at the National Laboratories have dramatically proven this point. Professional ethics is usually only considered important for the health sciences and the legal and medical professions. However, certain aspects of the day to day work of professional astronomers can be impacted by ethical issues. Examples include refereeing scientific papers, serving on grant panels or telescope allocation committees, submitting grant proposals, providing proper references in publications, proposals or talks and even writing recommendation letters for job candidates or serving on search committees. This session will feature several speakers on a variety of topics and provide time for questions and answers from the audience. Confirmed speakers include: Kate Kirby, Director Institute for Theoretical Atomic and Molecular Physics - Professional Ethics in the Physical Sciences: An Overview Rob Kennicutt, Astrophysical Journal Editor - Ethical Issues for Publishing Astronomers Peggy Fischer, Office of the NSF Inspector General - Professional Ethics from the NSF Inspector General's Point of View
A neuroimaging study of conflict during word recognition.
Riba, Jordi; Heldmann, Marcus; Carreiras, Manuel; Münte, Thomas F
2010-08-04
Using functional magnetic resonance imaging the neural activity associated with error commission and conflict monitoring in a lexical decision task was assessed. In a cohort of 20 native speakers of Spanish conflict was introduced by presenting words with high and low lexical frequency and pseudo-words with high and low syllabic frequency for the first syllable. Erroneous versus correct responses showed activation in the frontomedial and left inferior frontal cortex. A similar pattern was found for correctly classified words of low versus high lexical frequency and for correctly classified pseudo-words of high versus low syllabic frequency. Conflict-related activations for language materials largely overlapped with error-induced activations. The effect of syllabic frequency underscores the role of sublexical processing in visual word recognition and supports the view that the initial syllable mediates between the letter and word level.
Intonational Phrasing Is Constrained by Meaning, Not Balance
ERIC Educational Resources Information Center
Breen, Mara; Watson, Duane G.; Gibson, Edward
2011-01-01
This paper evaluates two classes of hypotheses about how people prosodically segment utterances: (1) meaning-based proposals, with a focus on Watson and Gibson's (2004) proposal, according to which speakers tend to produce boundaries before and after long constituents; and (2) balancing proposals, according to which speakers tend to produce…
Comparing Native and Non-Native Raters of US Federal Government Speaking Tests
ERIC Educational Resources Information Center
Brooks, Rachel Lunde
2013-01-01
Previous Language Testing research has largely reported that although many raters' characteristics affect their evaluations of language assessments (Reed & Cohen, 2001), being a native speaker or non-native speaker rater does not significantly affect final ratings (Kim, 2009). In Second Language Acquisition, some researchers conclude that…
ERIC Educational Resources Information Center
Fabiano-Smith, Leah; Hoffman, Katherine
2018-01-01
Purpose: Bilingual children whose phonological skills are evaluated using measures designed for monolingual English speakers are at risk for misdiagnosis of speech sound disorders (De Lamo White & Jin, 2011). Method: Forty-four children participated in this study: 15 typically developing monolingual English speakers, 7 monolingual English…
Quantitative Assessment of Interutterance Stability: Application to Dysarthria
ERIC Educational Resources Information Center
Cummins, Fred; Lowit, Anja; van Brenk, Frits
2014-01-01
Purpose: Following recent attempts to quantify articulatory impairment in speech, the present study evaluates the usefulness of a novel measure of motor stability to characterize dysarthria. Method: The study included 8 speakers with ataxic dysarthria (AD), 16 speakers with hypokinetic dysarthria (HD) as a result of Parkinson's disease, and…
Somatotype and Body Composition of Normal and Dysphonic Adult Speakers.
Franco, Débora; Fragoso, Isabel; Andrea, Mário; Teles, Júlia; Martins, Fernando
2017-01-01
Voice quality provides information about the anatomical characteristics of the speaker. The patterns of somatotype and body composition can provide essential knowledge to characterize the individuality of voice quality. The aim of this study was to verify if there were significant differences in somatotype and body composition between normal and dysphonic speakers. Cross-sectional study. Anthropometric measurements were taken of a sample of 72 adult participants (40 normal speakers and 32 dysphonic speakers) according to International Society for the Advancement of Kinanthropometry standards, which allowed the calculation of endomorphism, mesomorphism, ectomorphism components, body density, body mass index, fat mass, percentage fat, and fat-free mass. Perception and acoustic evaluations as well as nasoendoscopy were used to assign speakers into normal or dysphonic groups. There were no significant differences between normal and dysphonic speakers in the mean somatotype attitudinal distance and somatotype dispersion distance (in spite of marginally significant differences [P < 0.10] in somatotype attitudinal distance and somatotype dispersion distance between groups) and in the mean vector of the somatotype components. Furthermore, no significant differences were found between groups concerning the mean of percentage fat, fat mass, fat-free mass, body density, and body mass index after controlling by sex. The findings suggested no significant differences in the somatotype and body composition variables, between normal and dysphonic speakers. Copyright © 2017 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
The non-trusty clown attack on model-based speaker recognition systems
NASA Astrophysics Data System (ADS)
Farrokh Baroughi, Alireza; Craver, Scott
2015-03-01
Biometric detectors for speaker identification commonly employ a statistical model for a subject's voice, such as a Gaussian Mixture Model, that combines multiple means to improve detector performance. This allows a malicious insider to amend or append a component of a subject's statistical model so that a detector behaves normally except under a carefully engineered circumstance. This allows an attacker to force a misclassification of his or her voice only when desired, by smuggling data into a database far in advance of an attack. Note that the attack is possible if attacker has access to database even for a limited time to modify victim's model. We exhibit such an attack on a speaker identification, in which an attacker can force a misclassification by speaking in an unusual voice, and replacing the least weighted component of victim's model by the most weighted competent of the unusual voice of the attacker's model. The reason attacker make his or her voice unusual during the attack is because his or her normal voice model can be in database, and by attacking with unusual voice, the attacker has the option to be recognized as himself or herself when talking normally or as the victim when talking in the unusual manner. By attaching an appropriately weighted vector to a victim's model, we can impersonate all users in our simulations, while avoiding unwanted false rejections.
Congenital amusia in speakers of a tone language: association with lexical tone agnosia.
Nan, Yun; Sun, Yanan; Peretz, Isabelle
2010-09-01
Congenital amusia is a neurogenetic disorder that affects the processing of musical pitch in speakers of non-tonal languages like English and French. We assessed whether this musical disorder exists among speakers of Mandarin Chinese who use pitch to alter the meaning of words. Using the Montreal Battery of Evaluation of Amusia, we tested 117 healthy young Mandarin speakers with no self-declared musical problems and 22 individuals who reported musical difficulties and scored two standard deviations below the mean obtained by the Mandarin speakers without amusia. These 22 amusic individuals showed a similar pattern of musical impairment as did amusic speakers of non-tonal languages, by exhibiting a more pronounced deficit in melody than in rhythm processing. Furthermore, nearly half the tested amusics had impairments in the discrimination and identification of Mandarin lexical tones. Six showed marked impairments, displaying what could be called lexical tone agnosia, but had normal tone production. Our results show that speakers of tone languages such as Mandarin may experience musical pitch disorder despite early exposure to speech-relevant pitch contrasts. The observed association between the musical disorder and lexical tone difficulty indicates that the pitch disorder as defining congenital amusia is not specific to music or culture but is rather general in nature.
Wheat, Katherine L; Cornelissen, Piers L; Sack, Alexander T; Schuhmann, Teresa; Goebel, Rainer; Blomert, Leo
2013-05-01
Magnetoencephalography (MEG) has shown pseudohomophone priming effects at Broca's area (specifically pars opercularis of left inferior frontal gyrus and precentral gyrus; LIFGpo/PCG) within ∼100ms of viewing a word. This is consistent with Broca's area involvement in fast phonological access during visual word recognition. Here we used online transcranial magnetic stimulation (TMS) to investigate whether LIFGpo/PCG is necessary for (not just correlated with) visual word recognition by ∼100ms. Pulses were delivered to individually fMRI-defined LIFGpo/PCG in Dutch speakers 75-500ms after stimulus onset during reading and picture naming. Reading and picture naming reactions times were significantly slower following pulses at 225-300ms. Contrary to predictions, there was no disruption to reading for pulses before 225ms. This does not provide evidence in favour of a functional role for LIFGpo/PCG in reading before 225ms in this case, but does extend previous findings in picture stimuli to written Dutch words. Copyright © 2012 Elsevier Inc. All rights reserved.
Diminutives facilitate word segmentation in natural speech: cross-linguistic evidence.
Kempe, Vera; Brooks, Patricia J; Gillis, Steven; Samson, Graham
2007-06-01
Final-syllable invariance is characteristic of diminutives (e.g., doggie), which are a pervasive feature of the child-directed speech registers of many languages. Invariance in word endings has been shown to facilitate word segmentation (Kempe, Brooks, & Gillis, 2005) in an incidental-learning paradigm in which synthesized Dutch pseudonouns were used. To broaden the cross-linguistic evidence for this invariance effect and to increase its ecological validity, adult English speakers (n=276) were exposed to naturally spoken Dutch or Russian pseudonouns presented in sentence contexts. A forced choice test was given to assess target recognition, with foils comprising unfamiliar syllable combinations in Experiments 1 and 2 and syllable combinations straddling word boundaries in Experiment 3. A control group (n=210) received the recognition test with no prior exposure to targets. Recognition performance improved with increasing final-syllable rhyme invariance, with larger increases for the experimental group. This confirms that word ending invariance is a valid segmentation cue in artificial, as well as naturalistic, speech and that diminutives may aid segmentation in a number of languages.
Reasoning about knowledge: Children's evaluations of generality and verifiability.
Koenig, Melissa A; Cole, Caitlin A; Meyer, Meredith; Ridge, Katherine E; Kushnir, Tamar; Gelman, Susan A
2015-12-01
In a series of experiments, we examined 3- to 8-year-old children's (N=223) and adults' (N=32) use of two properties of testimony to estimate a speaker's knowledge: generality and verifiability. Participants were presented with a "Generic speaker" who made a series of 4 general claims about "pangolins" (a novel animal kind), and a "Specific speaker" who made a series of 4 specific claims about "this pangolin" as an individual. To investigate the role of verifiability, we systematically varied whether the claim referred to a perceptually-obvious feature visible in a picture (e.g., "has a pointy nose") or a non-evident feature that was not visible (e.g., "sleeps in a hollow tree"). Three main findings emerged: (1) young children showed a pronounced reliance on verifiability that decreased with age. Three-year-old children were especially prone to credit knowledge to speakers who made verifiable claims, whereas 7- to 8-year-olds and adults credited knowledge to generic speakers regardless of whether the claims were verifiable; (2) children's attributions of knowledge to generic speakers was not detectable until age 5, and only when those claims were also verifiable; (3) children often generalized speakers' knowledge outside of the pangolin domain, indicating a belief that a person's knowledge about pangolins likely extends to new facts. Findings indicate that young children may be inclined to doubt speakers who make claims they cannot verify themselves, as well as a developmentally increasing appreciation for speakers who make general claims. Copyright © 2015 Elsevier Inc. All rights reserved.
Identifying a Foreign Accent in an Unfamiliar Language
ERIC Educational Resources Information Center
Major, Roy C.
2007-01-01
This study explores the question of whether native and nonnative listeners, some familiar with the language and some not, differ in their accent ratings of native speakers (NSs) and nonnative speakers (NNSs). Although a few studies have employed native and nonnative judges to evaluate native and nonnative speech, the present study is perhaps the…
A Study of Cleft Palate Speakers with Marginal Velopharyngeal Competence.
ERIC Educational Resources Information Center
Hardin, M. A.; And Others
1986-01-01
The study examined a previously hypothesized model for a subgroup of cleft palate speakers with marginal velopharyngeal competence during speech. Evaluation of 52 5- and 6-year-olds with appropriate lateral X-ray results indicated that most met fewer than three of the other five criteria required by the model. (Author/DB)
Jiang, Chenghui; Whitehill, Tara L
2014-04-01
Speech errors associated with cleft palate are well established for English and several other Indo-European languages. Few articles describing the speech of Putonghua (standard Mandarin Chinese) speakers with cleft palate have been published in English language journals. Although methodological guidelines have been published for the perceptual speech evaluation of individuals with cleft palate, there has been no critical review of methodological issues in studies of Putonghua speakers with cleft palate. A literature search was conducted to identify relevant studies published over the past 30 years in Chinese language journals. Only studies incorporating perceptual analysis of speech were included. Thirty-seven articles which met inclusion criteria were analyzed and coded on a number of methodological variables. Reliability was established by having all variables recoded for all studies. This critical review identified many methodological issues. These design flaws make it difficult to draw reliable conclusions about characteristic speech errors in this group of speakers. Specific recommendations are made to improve the reliability and validity of future studies, as well to facilitate cross-center comparisons.
Automated Speech Rate Measurement in Dysarthria.
Martens, Heidi; Dekens, Tomas; Van Nuffelen, Gwen; Latacz, Lukas; Verhelst, Werner; De Bodt, Marc
2015-06-01
In this study, a new algorithm for automated determination of speech rate (SR) in dysarthric speech is evaluated. We investigated how reliably the algorithm calculates the SR of dysarthric speech samples when compared with calculation performed by speech-language pathologists. The new algorithm was trained and tested using Dutch speech samples of 36 speakers with no history of speech impairment and 40 speakers with mild to moderate dysarthria. We tested the algorithm under various conditions: according to speech task type (sentence reading, passage reading, and storytelling) and algorithm optimization method (speaker group optimization and individual speaker optimization). Correlations between automated and human SR determination were calculated for each condition. High correlations between automated and human SR determination were found in the various testing conditions. The new algorithm measures SR in a sufficiently reliable manner. It is currently being integrated in a clinical software tool for assessing and managing prosody in dysarthric speech. Further research is needed to fine-tune the algorithm to severely dysarthric speech, to make the algorithm less sensitive to background noise, and to evaluate how the algorithm deals with syllabic consonants.
NASA Astrophysics Data System (ADS)
S. Al-Kaltakchi, Musab T.; Woo, Wai L.; Dlay, Satnam; Chambers, Jonathon A.
2017-12-01
In this study, a speaker identification system is considered consisting of a feature extraction stage which utilizes both power normalized cepstral coefficients (PNCCs) and Mel frequency cepstral coefficients (MFCC). Normalization is applied by employing cepstral mean and variance normalization (CMVN) and feature warping (FW), together with acoustic modeling using a Gaussian mixture model-universal background model (GMM-UBM). The main contributions are comprehensive evaluations of the effect of both additive white Gaussian noise (AWGN) and non-stationary noise (NSN) (with and without a G.712 type handset) upon identification performance. In particular, three NSN types with varying signal to noise ratios (SNRs) were tested corresponding to street traffic, a bus interior, and a crowded talking environment. The performance evaluation also considered the effect of late fusion techniques based on score fusion, namely, mean, maximum, and linear weighted sum fusion. The databases employed were TIMIT, SITW, and NIST 2008; and 120 speakers were selected from each database to yield 3600 speech utterances. As recommendations from the study, mean fusion is found to yield overall best performance in terms of speaker identification accuracy (SIA) with noisy speech, whereas linear weighted sum fusion is overall best for original database recordings.
Speaker Introductions at Internal Medicine Grand Rounds: Forms of Address Reveal Gender Bias.
Files, Julia A; Mayer, Anita P; Ko, Marcia G; Friedrich, Patricia; Jenkins, Marjorie; Bryan, Michael J; Vegunta, Suneela; Wittich, Christopher M; Lyle, Melissa A; Melikian, Ryan; Duston, Trevor; Chang, Yu-Hui H; Hayes, Sharonne N
2017-05-01
Gender bias has been identified as one of the drivers of gender disparity in academic medicine. Bias may be reinforced by gender subordinating language or differential use of formality in forms of address. Professional titles may influence the perceived expertise and authority of the referenced individual. The objective of this study is to examine how professional titles were used in the same and mixed-gender speaker introductions at Internal Medicine Grand Rounds (IMGR). A retrospective observational study of video-archived speaker introductions at consecutive IMGR was conducted at two different locations (Arizona, Minnesota) of an academic medical center. Introducers and speakers at IMGR were physician and scientist peers holding MD, PhD, or MD/PhD degrees. The primary outcome was whether or not a speaker's professional title was used during the first form of address during speaker introductions at IMGR. As secondary outcomes, we evaluated whether or not the speakers professional title was used in any form of address during the introduction. Three hundred twenty-one forms of address were analyzed. Female introducers were more likely to use professional titles when introducing any speaker during the first form of address compared with male introducers (96.2% [102/106] vs. 65.6% [141/215]; p < 0.001). Female dyads utilized formal titles during the first form of address 97.8% (45/46) compared with male dyads who utilized a formal title 72.4% (110/152) of the time (p = 0.007). In mixed-gender dyads, where the introducer was female and speaker male, formal titles were used 95.0% (57/60) of the time. Male introducers of female speakers utilized professional titles 49.2% (31/63) of the time (p < 0.001). In this study, women introduced by men at IMGR were less likely to be addressed by professional title than were men introduced by men. Differential formality in speaker introductions may amplify isolation, marginalization, and professional discomfiture expressed by women faculty in academic medicine.
Speaker verification using committee neural networks.
Reddy, Narender P; Buch, Ojas A
2003-10-01
Security is a major problem in web based access or remote access to data bases. In the present study, the technique of committee neural networks was developed for speech based speaker verification. Speech data from the designated speaker and several imposters were obtained. Several parameters were extracted in the time and frequency domains, and fed to neural networks. Several neural networks were trained and the five best performing networks were recruited into the committee. The committee decision was based on majority voting of the member networks. The committee opinion was evaluated with further testing data. The committee correctly identified the designated speaker in (50 out of 50) 100% of the cases and rejected imposters in (150 out of 150) 100% of the cases. The committee decision was not unanimous in majority of the cases tested.
Methods for examining data quality in healthcare integrated data repositories.
Huser, Vojtech; Kahn, Michael G; Brown, Jeffrey S; Gouripeddi, Ramkiran
2018-01-01
This paper summarizes content of the workshop focused on data quality. The first speaker (VH) described data quality infrastructure and data quality evaluation methods currently in place within the Observational Data Science and Informatics (OHDSI) consortium. The speaker described in detail a data quality tool called Achilles Heel and latest development for extending this tool. Interim results of an ongoing Data Quality study within the OHDSI consortium were also presented. The second speaker (MK) described lessons learned and new data quality checks developed by the PEDsNet pediatric research network. The last two speakers (JB, RG) described tools developed by the Sentinel Initiative and University of Utah's service oriented framework. The workshop discussed at the end and throughout how data quality assessment can be advanced by combining best features of each network.
Tebb, Kathleen P; Pollack, Lance M; Millstein, Shana; Otero-Sabogal, Regina; Wibbelsman, Charles J
2014-09-01
To explore parental beliefs and attitudes about confidential services for their teenagers; and to develop an instrument to assess these beliefs and attitudes that could be used among English and Spanish speakers. The long-term goal is to use this research to better understand and evaluate interventions to improve parental knowledge and attitudes toward their adolescent's access and utilization of comprehensive confidential health services. The instrument was developed using an extensive literature review and theoretical framework followed by qualitative data from focus groups and in-depth interviews. It was then pilot tested with a random sample of English- and Spanish-speaking parents and further revised. The final instrument was administered to a random sample of 1,000 mothers. The psychometric properties of the instrument were assessed for Spanish and English speakers. The instrument consisted of 12 scales. Most Cronbach alphas were >.70 for Spanish and English speakers. Fewer items for Spanish speakers "loaded" for the Responsibility and Communication scales. Parental Control of Health Information failed for Spanish speakers. The Parental Attitudes of Adolescent Confidential Health Services Questionnaire (PAACS-Q) contains 12 scales and is a valid and reliable instrument to assess parental knowledge and attitudes toward confidential health services for adolescents among English speakers and all but one scale was applicable for Spanish speakers. More research is needed to understand key constructs with Spanish speakers. Copyright © 2014 Society for Adolescent Health and Medicine. Published by Elsevier Inc. All rights reserved.
The artful dodger: answering the wrong question the right way.
Rogers, Todd; Norton, Michael I
2011-06-01
What happens when speakers try to "dodge" a question they would rather not answer by answering a different question? In 4 studies, we show that listeners can fail to detect dodges when speakers answer similar-but objectively incorrect-questions (the "artful dodge"), a detection failure that goes hand-in-hand with a failure to rate dodgers more negatively. We propose that dodges go undetected because listeners' attention is not usually directed toward a goal of dodge detection (i.e., Is this person answering the question?) but rather toward a goal of social evaluation (i.e., Do I like this person?). Listeners were not blind to all dodge attempts, however. Dodge detection increased when listeners' attention was diverted from social goals toward determining the relevance of the speaker's answers (Study 1), when speakers answered a question egregiously dissimilar to the one asked (Study 2), and when listeners' attention was directed to the question asked by keeping it visible during speakers' answers (Study 4). We also examined the interpersonal consequences of dodge attempts: When listeners were guided to detect dodges, they rated speakers more negatively (Study 2), and listeners rated speakers who answered a similar question in a fluent manner more positively than speakers who answered the actual question but disfluently (Study 3). These results add to the literatures on both Gricean conversational norms and goal-directed attention. We discuss the practical implications of our findings in the contexts of interpersonal communication and public debates.
Ménard, Lucie; Turgeon, Christine; Trudeau-Fisette, Paméla; Bellavance-Courtemanche, Marie
2016-01-01
The impact of congenital visual deprivation on speech production in adults was examined in an ultrasound study of compensation strategies for lip-tube perturbation. Acoustic and articulatory analyses of the rounded vowel /u/ produced by 12 congenitally blind adult French speakers and 11 sighted adult French speakers were conducted under two conditions: normal and perturbed (with a 25-mm diameter tube inserted between the lips). Vowels were produced with auditory feedback and without auditory feedback (masked noise) to evaluate the extent to which both groups relied on this type of feedback to control speech movements. The acoustic analyses revealed that all participants mainly altered F2 and F0 and, to a lesser extent, F1 in the perturbed condition - only when auditory feedback was available. There were group differences in the articulatory strategies recruited to compensate; while all speakers moved their tongues more backward in the perturbed condition, blind speakers modified tongue-shape parameters to a greater extent than sighted speakers.
A pilot study to assess oral health literacy by comparing a word recognition and comprehension tool.
Khan, Khadija; Ruby, Brendan; Goldblatt, Ruth S; Schensul, Jean J; Reisine, Susan
2014-11-18
Oral health literacy is important to oral health outcomes. Very little has been established on comparing word recognition to comprehension in oral health literacy especially in older adults. Our goal was to compare methods to measure oral health literacy in older adults by using the Rapid Estimate of Literacy in Dentistry (REALD-30) tool including word recognition and comprehension and by assessing comprehension of a brochure about dry mouth. 75 males and 75 females were recruited from the University of Connecticut Dental practice. Participants were English speakers and at least 50 years of age. They were asked to read the REALD-30 words out loud (word recognition) and then define them (comprehension). Each correctly-pronounced and defined word was scored 1 for total REALD-30 word recognition and REALD-30 comprehension scores of 0-30. Participants then read the National Institute of Dental and Craniofacial Research brochure "Dry Mouth" and answered three questions defining dry mouth, causes and treatment. Participants also completed a survey on dental behavior. Participants scored higher on REALD-30 word recognition with a mean of 22.98 (SD = 5.1) compared to REALD-30 comprehension with a mean of 16.1 (SD = 4.3). The mean score on the brochure comprehension was 5.1 of a possible total of 7 (SD = 1.6). Pearson correlations demonstrated significant associations among the three measures. Multivariate regression showed that females and those with higher education had significantly higher scores on REALD-30 word-recognition, and dry mouth brochure questions. Being white was significantly related to higher REALD-30 recognition and comprehension scores but not to the scores on the brochure. This pilot study demonstrates the feasibility of using the REALD-30 and a brochure to assess literacy in a University setting among older adults. Participants had higher scores on the word recognition than on comprehension agreeing with other studies that recognition does not imply understanding.
The Impact of Early Bilingualism on Face Recognition Processes.
Kandel, Sonia; Burfin, Sabine; Méary, David; Ruiz-Tada, Elisa; Costa, Albert; Pascalis, Olivier
2016-01-01
Early linguistic experience has an impact on the way we decode audiovisual speech in face-to-face communication. The present study examined whether differences in visual speech decoding could be linked to a broader difference in face processing. To identify a phoneme we have to do an analysis of the speaker's face to focus on the relevant cues for speech decoding (e.g., locating the mouth with respect to the eyes). Face recognition processes were investigated through two classic effects in face recognition studies: the Other-Race Effect (ORE) and the Inversion Effect. Bilingual and monolingual participants did a face recognition task with Caucasian faces (own race), Chinese faces (other race), and cars that were presented in an Upright or Inverted position. The results revealed that monolinguals exhibited the classic ORE. Bilinguals did not. Overall, bilinguals were slower than monolinguals. These results suggest that bilinguals' face processing abilities differ from monolinguals'. Early exposure to more than one language may lead to a perceptual organization that goes beyond language processing and could extend to face analysis. We hypothesize that these differences could be due to the fact that bilinguals focus on different parts of the face than monolinguals, making them more efficient in other race face processing but slower. However, more studies using eye-tracking techniques are necessary to confirm this explanation.
Processing of Acoustic Cues in Lexical-Tone Identification by Pediatric Cochlear-Implant Recipients
Peng, Shu-Chen; Lu, Hui-Ping; Lu, Nelson; Lin, Yung-Song; Deroche, Mickael L. D.
2017-01-01
Purpose The objective was to investigate acoustic cue processing in lexical-tone recognition by pediatric cochlear-implant (CI) recipients who are native Mandarin speakers. Method Lexical-tone recognition was assessed in pediatric CI recipients and listeners with normal hearing (NH) in 2 tasks. In Task 1, participants identified naturally uttered words that were contrastive in lexical tones. For Task 2, a disyllabic word (yanjing) was manipulated orthogonally, varying in fundamental-frequency (F0) contours and duration patterns. Participants identified each token with the second syllable jing pronounced with Tone 1 (a high level tone) as eyes or with Tone 4 (a high falling tone) as eyeglasses. Results CI participants' recognition accuracy was significantly lower than NH listeners' in Task 1. In Task 2, CI participants' reliance on F0 contours was significantly less than that of NH listeners; their reliance on duration patterns, however, was significantly higher than that of NH listeners. Both CI and NH listeners' performance in Task 1 was significantly correlated with their reliance on F0 contours in Task 2. Conclusion For pediatric CI recipients, lexical-tone recognition using naturally uttered words is primarily related to their reliance on F0 contours, although duration patterns may be used as an additional cue. PMID:28388709
Arruti, Andoni; Cearreta, Idoia; Álvarez, Aitor; Lazkano, Elena; Sierra, Basilio
2014-01-01
Study of emotions in human–computer interaction is a growing research area. This paper shows an attempt to select the most significant features for emotion recognition in spoken Basque and Spanish Languages using different methods for feature selection. RekEmozio database was used as the experimental data set. Several Machine Learning paradigms were used for the emotion classification task. Experiments were executed in three phases, using different sets of features as classification variables in each phase. Moreover, feature subset selection was applied at each phase in order to seek for the most relevant feature subset. The three phases approach was selected to check the validity of the proposed approach. Achieved results show that an instance-based learning algorithm using feature subset selection techniques based on evolutionary algorithms is the best Machine Learning paradigm in automatic emotion recognition, with all different feature sets, obtaining a mean of 80,05% emotion recognition rate in Basque and a 74,82% in Spanish. In order to check the goodness of the proposed process, a greedy searching approach (FSS-Forward) has been applied and a comparison between them is provided. Based on achieved results, a set of most relevant non-speaker dependent features is proposed for both languages and new perspectives are suggested. PMID:25279686
Orthographic effects in spoken word recognition: Evidence from Chinese.
Qu, Qingqing; Damian, Markus F
2017-06-01
Extensive evidence from alphabetic languages demonstrates a role of orthography in the processing of spoken words. Because alphabetic systems explicitly code speech sounds, such effects are perhaps not surprising. However, it is less clear whether orthographic codes are involuntarily accessed from spoken words in languages with non-alphabetic systems, in which the sound-spelling correspondence is largely arbitrary. We investigated the role of orthography via a semantic relatedness judgment task: native Mandarin speakers judged whether or not spoken word pairs were related in meaning. Word pairs were either semantically related, orthographically related, or unrelated. Results showed that relatedness judgments were made faster for word pairs that were semantically related than for unrelated word pairs. Critically, orthographic overlap on semantically unrelated word pairs induced a significant increase in response latencies. These findings indicate that orthographic information is involuntarily accessed in spoken-word recognition, even in a non-alphabetic language such as Chinese.
Donkin, Christopher; Brown, Scott D; Heathcote, Andrew
2009-02-01
Psychological experiments often collect choice responses using buttonpresses. However, spoken responses are useful in many cases-for example, when working with special clinical populations, or when a paradigm demands vocalization, or when accurate response time measurements are desired. In these cases, spoken responses are typically collected using a voice key, which usually involves manual coding by experimenters in a tedious and error-prone manner. We describe ChoiceKey, an open-source speech recognition package for MATLAB. It can be optimized by training for small response sets and different speakers. We show ChoiceKey to be reliable with minimal training for most participants in experiments with two different responses. Problems presented by individual differences, and occasional atypical responses, are examined, and extensions to larger response sets are explored. The ChoiceKey source files and instructions may be downloaded as supplemental materials for this article from brm.psychonomic-journals.org/content/supplemental.
ERIC Educational Resources Information Center
Engelhardt, Paul E.; Alfridijanta, Oliver; McMullon, Mhairi E. G.; Corley, Martin
2017-01-01
We re-evaluate conclusions about disfluency production in high-functioning forms of autism spectrum disorder (HFA). Previous studies examined individuals with HFA to address a theoretical question regarding speaker- and listener-oriented disfluencies. Individuals with HFA tend to be self-centric and have poor pragmatic language skills, and should…
Beyond Semantic Accuracy: Preschoolers Evaluate a Speaker's Reasons
ERIC Educational Resources Information Center
Koenig, Melissa A.
2012-01-01
Children's sensitivity to the quality of epistemic reasons and their selective trust in the more reasonable of 2 informants was investigated in 2 experiments. Three-, 4-, and 5-year-old children (N = 90) were presented with speakers who stated different kinds of evidence for what they believed. Experiment 1 showed that children of all age groups…
ERIC Educational Resources Information Center
Gaillard, Stéphanie; Tremblay, Annie
2016-01-01
This study investigated the elicited imitation task (EIT) as a tool for measuring linguistic proficiency in a second/foreign (L2) language, focusing on French. Nonnative French speakers (n = 94) and native French speakers (n = 6) completed an EIT that included 50 sentences varying in length and complexity. Three raters evaluated productions on…
ERIC Educational Resources Information Center
Doebel, Sabine; Rowell, Shaina F.; Koenig, Melissa A.
2016-01-01
The reported research tested the hypothesis that young children detect logical inconsistency in communicative contexts that support the evaluation of speakers' epistemic reliability. In two experiments (N = 194), 3- to 5-year-olds were presented with two speakers who expressed logically consistent or inconsistent claims. Three-year-olds failed to…
Newly learned word forms are abstract and integrated immediately after acquisition
Kapnoula, Efthymia C.; McMurray, Bob
2015-01-01
A hotly debated question in word learning concerns the conditions under which newly learned words compete or interfere with familiar words during spoken word recognition. This has recently been described as a key marker of the integration of a new word into the lexicon and was thought to require consolidation Dumay & Gaskell, (Psychological Science, 18, 35–39, 2007; Gaskell & Dumay, Cognition, 89, 105–132, 2003). Recently, however, Kapnoula, Packard, Gupta, and McMurray, (Cognition, 134, 85–99, 2015) showed that interference can be observed immediately after a word is first learned, implying very rapid integration of new words into the lexicon. It is an open question whether these kinds of effects derive from episodic traces of novel words or from more abstract and lexicalized representations. Here we addressed this question by testing inhibition for newly learned words using training and test stimuli presented in different talker voices. During training, participants were exposed to a set of nonwords spoken by a female speaker. Immediately after training, we assessed the ability of the novel word forms to inhibit familiar words, using a variant of the visual world paradigm. Crucially, the test items were produced by a male speaker. An analysis of fixations showed that even with a change in voice, newly learned words interfered with the recognition of similar known words. These findings show that lexical competition effects from newly learned words spread across different talker voices, which suggests that newly learned words can be sufficiently lexicalized, and abstract with respect to talker voice, without consolidation. PMID:26202702
Electrophysiology of subject-verb agreement mediated by speakers' gender.
Hanulíková, Adriana; Carreiras, Manuel
2015-01-01
An important property of speech is that it explicitly conveys features of a speaker's identity such as age or gender. This event-related potential (ERP) study examined the effects of social information provided by a speaker's gender, i.e., the conceptual representation of gender, on subject-verb agreement. Despite numerous studies on agreement, little is known about syntactic computations generated by speaker characteristics extracted from the acoustic signal. Slovak is well suited to investigate this issue because it is a morphologically rich language in which agreement involves features for number, case, and gender. Grammaticality of a sentence can be evaluated by checking a speaker's gender as conveyed by his/her voice. We examined how conceptual information about speaker gender, which is not syntactic but rather social and pragmatic in nature, is interpreted for the computation of agreement patterns. ERP responses to verbs disagreeing with the speaker's gender (e.g., a sentence including a masculine verbal inflection spoken by a female person 'the neighbors were upset because I (∗)stoleMASC plums') elicited a larger early posterior negativity compared to correct sentences. When the agreement was purely syntactic and did not depend on the speaker's gender, a disagreement between a formally marked subject and the verb inflection (e.g., the womanFEM (∗)stoleMASC plums) resulted in a larger P600 preceded by a larger anterior negativity compared to the control sentences. This result is in line with proposals according to which the recruitment of non-syntactic information such as the gender of the speaker results in N400-like effects, while formally marked syntactic features lead to structural integration as reflected in a LAN/P600 complex.
Groenewold, Rimke; Armstrong, Elizabeth
2018-05-14
Previous research has shown that speakers with aphasia rely on enactment more often than non-brain-damaged language users. Several studies have been conducted to explain this observed increase, demonstrating that spoken language containing enactment is easier to produce and is more engaging to the conversation partner. This paper describes the effects of the occurrence of enactment in casual conversation involving individuals with aphasia on its level of conversational assertiveness. To evaluate whether and to what extent the occurrence of enactment in speech of individuals with aphasia contributes to its conversational assertiveness. Conversations between a speaker with aphasia and his wife (drawn from AphasiaBank) were analysed in several steps. First, the transcripts were divided into moves, and all moves were coded according to the systemic functional linguistics (SFL) framework. Next, all moves were labelled in terms of their level of conversational assertiveness, as defined in the previous literature. Finally, all enactments were identified and their level of conversational assertiveness was compared with that of non-enactments. Throughout their conversations, the non-brain-damaged speaker was more assertive than the speaker with aphasia. However, the speaker with aphasia produced more enactments than the non-brain-damaged speaker. The moves of the speaker with aphasia containing enactment were more assertive than those without enactment. The use of enactment in the conversations under study positively affected the level of conversational assertiveness of the speaker with aphasia, a competence that is important for speakers with aphasia because it contributes to their floor time, chances to be heard seriously and degree of control over the conversation topic. © 2018 The Authors International Journal of Language & Communication Disorders published by John Wiley & Sons Ltd on behalf of Royal College of Speech and Language Therapists.
Goehring, Tobias; Bolner, Federico; Monaghan, Jessica J M; van Dijk, Bas; Zarowski, Andrzej; Bleeck, Stefan
2017-02-01
Speech understanding in noisy environments is still one of the major challenges for cochlear implant (CI) users in everyday life. We evaluated a speech enhancement algorithm based on neural networks (NNSE) for improving speech intelligibility in noise for CI users. The algorithm decomposes the noisy speech signal into time-frequency units, extracts a set of auditory-inspired features and feeds them to the neural network to produce an estimation of which frequency channels contain more perceptually important information (higher signal-to-noise ratio, SNR). This estimate is used to attenuate noise-dominated and retain speech-dominated CI channels for electrical stimulation, as in traditional n-of-m CI coding strategies. The proposed algorithm was evaluated by measuring the speech-in-noise performance of 14 CI users using three types of background noise. Two NNSE algorithms were compared: a speaker-dependent algorithm, that was trained on the target speaker used for testing, and a speaker-independent algorithm, that was trained on different speakers. Significant improvements in the intelligibility of speech in stationary and fluctuating noises were found relative to the unprocessed condition for the speaker-dependent algorithm in all noise types and for the speaker-independent algorithm in 2 out of 3 noise types. The NNSE algorithms used noise-specific neural networks that generalized to novel segments of the same noise type and worked over a range of SNRs. The proposed algorithm has the potential to improve the intelligibility of speech in noise for CI users while meeting the requirements of low computational complexity and processing delay for application in CI devices. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Reaching Spanish-speaking smokers online: a 10-year worldwide research program
Muñoz, Ricardo Felipe; Chen, Ken; Bunge, Eduardo Liniers; Bravin, Julia Isabela; Shaughnessy, Elizabeth Annelly; Pérez-Stable, Eliseo Joaquín
2014-01-01
Objective To describe a 10-year proof-of-concept smoking cessation research program evaluating the reach of online health interventions throughout the Americas. Methods Recruitment occurred from 2002–2011, primarily using Google.com AdWords. Over 6 million smokers from the Americas entered keywords related to smoking cessation; 57 882 smokers (15 912 English speakers and 41 970 Spanish speakers) were recruited into online self-help automated intervention studies. To examine disparities in utilization of methods to quit smoking, cessation aids used by English speakers and Spanish speakers were compared. To determine whether online interventions reduce disparities, abstinence rates were also compared. Finally, the reach of the intervention was illustrated for three large Spanish-speaking countries of the Americas—Argentina, Mexico, and Peru—and the United States of America. Results Few participants had utilized other methods to stop smoking before coming to the Internet site; most reported using no previous smoking cessation aids: 69.2% of Spanish speakers versus 51.8% of English speakers (P < 0.01). The most used method was nicotine gum, 13.9%. Nicotine dependence levels were similar to those reported for in-person smoking cessation trials. Overall observed quit rate for English speakers was 38.1% and for Spanish speakers, 37.0%; quit rates in which participants with missing data were considered to be smoking were 11.1% and 10.6%, respectively. Neither comparison was significantly different. Conclusions The systematic use of evidence-based Internet interventions for health problems could have a broad impact throughout the Americas, at little or no cost to individuals or to ministries of health. PMID:25211569
ERIC Educational Resources Information Center
Kobari-Wright, Vissy V.; Miguel, Caio F.
2014-01-01
We evaluated the effects of listener training on the emergence of categorization and speaker behavior (i.e., tacts) using a nonconcurrent multiple baseline design. Four children with autism learned to select pictures given their dictated category names. We assessed whether they could match and tact pictures by category. After training, 3…
Overcoming Gender Bias in Oral Testing: The Effect of Introducing Candidates.
ERIC Educational Resources Information Center
Ferguson, Bonnie
1994-01-01
Examined the effect of gender on English language teachers' evaluations of audiotapes of Japanese students' spoken English. The results showed a slight pro-male bias when the speakers were introduced as students but none when the speakers were introduced as doctors or experts in their field. Teacher gender was found to have no effect on the rating…
Word and Pseudoword Superiority Effects: Evidence From a Shallow Orthography Language.
Ripamonti, Enrico; Luzzatti, Claudio; Zoccolotti, Pierluigi; Traficante, Daniela
2017-08-03
The Word Superiority Effect (WSE) denotes better recognition of a letter embedded in a word rather than in a pseudoword. Along with WSE, also a Pseudoword Superiority Effect (PSE) has been described: it is easier to recognize a letter in a legal pseudoword than in an unpronounceable nonword. At the current state of the art, both WSE and PSE have been mainly tested with English speakers. The present study uses the Reicher-Wheeler paradigm with native speakers of Italian (a shallow orthography language). Differently from English and French, we found WSE for RTs only, whereas PSE was significant for both accuracy and reaction times (RTs). This finding indicates that, in the Reicher-Wheeler task, readers of a shallow orthography language can effectively rely on both the lexical and the sublexical routes. As to the effect of letter position, a clear advantage for the first letter position emerged, a finding suggesting a fine-grained processing of the letter strings with coding of letter position, and indicating the role of visual acuity and crowding factors.
What is French for déjà vu? Descriptions of déjà vu in native French and English speakers.
Fortier, Jonathan; Moulin, Chris J A
2015-11-01
Little is known about how people characterise and classify the experience of déjà vu. The term déjà vu might capture a range of different phenomena and people may use it differently. We examined the description of déjà vu in two languages: French and English, hypothesising that the use of déjà vu would vary between the two languages. In French, the phrase déjà vu can be used to indicate a veridical experience of recognition - as in "I have already seen this face before". However, the same is not true in English. In an online questionnaire, we found equal rates of déjà vu amongst French and English speakers, and key differences in how the experience was described. As expected, the French group described the experience as being more frequent, but there was the unexpected finding that they found it to be more troubling. Copyright © 2015 Elsevier Inc. All rights reserved.
Masked priming effects are modulated by expertise in the script.
Perea, Manuel; Abu Mallouh, Reem; Garcı A-Orza, Javier; Carreiras, Manuel
2011-05-01
In a recent study using a masked priming same-different matching task, Garcı´a-Orza, Perea, and Munoz (2010) found a transposition priming effect for letter strings, digit strings, and symbol strings, but not for strings of pseudoletters (i.e., EPRI-ERPI produced similar response times to the control pair EDBI-ERPI). They argued that the mechanism responsible for position coding in masked priming is not operative with those "objects" whose identity cannot be attained rapidly. To assess this hypothesis, Experiment 1 examined masked priming effects in Arabic for native speakers of Arabic, whereas participants in Experiments 2 and 3 were lower intermediate learners of Arabic and readers with no knowledge of Arabic, respectively. Results showed a masked priming effect only for readers who are familiar with the Arabic script. Furthermore, transposed-letter priming in native speakers of Arabic only occurred when the order of the root letters was kept intact. In Experiments 3-7, we examined why masked repetition priming is absent for readers who are unfamiliar with the Arabic script. We discuss the implications of these findings for models of visual-word recognition.
Speaker gender identification based on majority vote classifiers
NASA Astrophysics Data System (ADS)
Mezghani, Eya; Charfeddine, Maha; Nicolas, Henri; Ben Amar, Chokri
2017-03-01
Speaker gender identification is considered among the most important tools in several multimedia applications namely in automatic speech recognition, interactive voice response systems and audio browsing systems. Gender identification systems performance is closely linked to the selected feature set and the employed classification model. Typical techniques are based on selecting the best performing classification method or searching optimum tuning of one classifier parameters through experimentation. In this paper, we consider a relevant and rich set of features involving pitch, MFCCs as well as other temporal and frequency-domain descriptors. Five classification models including decision tree, discriminant analysis, nave Bayes, support vector machine and k-nearest neighbor was experimented. The three best perming classifiers among the five ones will contribute by majority voting between their scores. Experimentations were performed on three different datasets spoken in three languages: English, German and Arabic in order to validate language independency of the proposed scheme. Results confirm that the presented system has reached a satisfying accuracy rate and promising classification performance thanks to the discriminating abilities and diversity of the used features combined with mid-level statistics.
Morphological learning in a novel language: A cross-language comparison.
Havas, Viktória; Waris, Otto; Vaquero, Lucía; Rodríguez-Fornells, Antoni; Laine, Matti
2015-01-01
Being able to extract and interpret the internal structure of complex word forms such as the English word dance+r+s is crucial for successful language learning. We examined whether the ability to extract morphological information during word learning is affected by the morphological features of one's native tongue. Spanish and Finnish adult participants performed a word-picture associative learning task in an artificial language where the target words included a suffix marking the gender of the corresponding animate object. The short exposure phase was followed by a word recognition task and a generalization task for the suffix. The participants' native tongues vary greatly in terms of morphological structure, leading to two opposing hypotheses. On the one hand, Spanish speakers may be more effective in identifying gender in a novel language because this feature is present in Spanish but not in Finnish. On the other hand, Finnish speakers may have an advantage as the abundance of bound morphemes in their language calls for continuous morphological decomposition. The results support the latter alternative, suggesting that lifelong experience on morphological decomposition provides an advantage in novel morphological learning.
Knowledge and implicature: modeling language understanding as social cognition.
Goodman, Noah D; Stuhlmüller, Andreas
2013-01-01
Is language understanding a special case of social cognition? To help evaluate this view, we can formalize it as the rational speech-act theory: Listeners assume that speakers choose their utterances approximately optimally, and listeners interpret an utterance by using Bayesian inference to "invert" this model of the speaker. We apply this framework to model scalar implicature ("some" implies "not all," and "N" implies "not more than N"). This model predicts an interaction between the speaker's knowledge state and the listener's interpretation. We test these predictions in two experiments and find good fit between model predictions and human judgments. Copyright © 2013 Cognitive Science Society, Inc.
2004-03-12
KENNEDY SPACE CENTER, FLA. - Florida Gov. Jeb Bush (left) and Center Director Jim Kennedy attend the luncheon at the 2004 Florida Regional FIRST competition held at the University of Central Florida. Both are featured speakers. The event hosted 41 teams from Canada, Brazil, Great Britain and the United States. FIRST is a nonprofit organization, For Inspiration and Recognition of Science and Technology, that sponsors the event pitting gladiator robots against each other in an athletic-style competition. The FIRST robotics competition is designed to provide students with a hands-on, inside look at engineering and other professional careers, pairing high school students with engineer mentors and corporations.
1983-08-16
34. " .. ,,,,.-j.Aid-is.. ;,,i . -i.t . "’" ’, V ,1 5- 4. 3- kHz 2-’ r 1 r s ’.:’ BOGEY 5D 0 S BOGEY 12D Figure 10. Spectrograms of two versions of the word...MF5852801B 0001 Reviewed by Approved and Released by Ashton Graybiel, M.D. Captain W. M. Houk , MC, USN Chief Scientific Advisor Commanding Officer 16 August...incorporating knowledge about these changes into speech recognition systems. i A J- I. . S , .4, ... ..’-° -- -iii l - - .- - i- . .. " •- - i ,f , i
Early testimonial learning: monitoring speech acts and speakers.
Stephens, Elizabeth; Suarez, Sarah; Koenig, Melissa
2015-01-01
Testimony provides children with a rich source of knowledge about the world and the people in it. However, testimony is not guaranteed to be veridical, and speakers vary greatly in both knowledge and intent. In this chapter, we argue that children encounter two primary types of conflicts when learning from speakers: conflicts of knowledge and conflicts of interest. We review recent research on children's selective trust in testimony and propose two distinct mechanisms supporting early epistemic vigilance in response to the conflicts associated with speakers. The first section of the chapter focuses on the mechanism of coherence checking, which occurs during the process of message comprehension and facilitates children's comparison of information communicated through testimony to their prior knowledge, alerting them to inaccurate, inconsistent, irrational, and implausible messages. The second section focuses on source-monitoring processes. When children lack relevant prior knowledge with which to evaluate testimonial messages, they monitor speakers themselves for evidence of competence and morality, attending to cues such as confidence, consensus, access to information, prosocial and antisocial behavior, and group membership. © 2015 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Karam, Walid; Mokbel, Chafic; Greige, Hanna; Chollet, Gerard
2006-05-01
A GMM based audio visual speaker verification system is described and an Active Appearance Model with a linear speaker transformation system is used to evaluate the robustness of the verification. An Active Appearance Model (AAM) is used to automatically locate and track a speaker's face in a video recording. A Gaussian Mixture Model (GMM) based classifier (BECARS) is used for face verification. GMM training and testing is accomplished on DCT based extracted features of the detected faces. On the audio side, speech features are extracted and used for speaker verification with the GMM based classifier. Fusion of both audio and video modalities for audio visual speaker verification is compared with face verification and speaker verification systems. To improve the robustness of the multimodal biometric identity verification system, an audio visual imposture system is envisioned. It consists of an automatic voice transformation technique that an impostor may use to assume the identity of an authorized client. Features of the transformed voice are then combined with the corresponding appearance features and fed into the GMM based system BECARS for training. An attempt is made to increase the acceptance rate of the impostor and to analyzing the robustness of the verification system. Experiments are being conducted on the BANCA database, with a prospect of experimenting on the newly developed PDAtabase developed within the scope of the SecurePhone project.
Human phoneme recognition depending on speech-intrinsic variability.
Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger
2010-11-01
The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).
The role of lexical variables in the visual recognition of Chinese characters: A megastudy analysis.
Sze, Wei Ping; Yap, Melvin J; Rickard Liow, Susan J
2015-01-01
Logographic Chinese orthography partially represents both phonology and semantics. By capturing the online processing of a large pool of Chinese characters, we were able to examine the relative salience of specific lexical variables when this nonalphabetic script is read. Using a sample of native mainland Chinese speakers (N = 35), lexical decision latencies for 1560 single characters were collated into a database, before the effects of a comprehensive range of variables were explored. Hierarchical regression analyses determined the unique item-level variance explained by orthographic (frequency, stroke count), semantic (age of learning, imageability, number of meanings), and phonological (consistency, phonological frequency) factors. Orthographic and semantic variables, respectively, accounted for more collective variance than the phonological variables. Significant main effects were further observed for the individual orthographic and semantic predictors. These results are consistent with the idea that skilled readers tend to rely on orthographic and semantic information when processing visually presented characters. This megastudy approach marks an important extension to existing work on Chinese character recognition, which hitherto has relied on factorial designs. Collectively, the findings reported here represent a useful set of empirical constraints for future computational models of character recognition.
The cognitive neuroscience of person identification.
Biederman, Irving; Shilowich, Bryan E; Herald, Sarah B; Margalit, Eshed; Maarek, Rafael; Meschke, Emily X; Hacker, Catrina M
2018-02-14
We compare and contrast five differences between person identification by voice and face. 1. There is little or no cost when a familiar face is to be recognized from an unrestricted set of possible faces, even at Rapid Serial Visual Presentation (RSVP) rates, but the accuracy of familiar voice recognition declines precipitously when the set of possible speakers is increased from one to a mere handful. 2. Whereas deficits in face recognition are typically perceptual in origin, those with normal perception of voices can manifest severe deficits in their identification. 3. Congenital prosopagnosics (CPros) and congenital phonagnosics (CPhon) are generally unable to imagine familiar faces and voices, respectively. Only in CPros, however, is this deficit a manifestation of a general inability to form visual images of any kind. CPhons report no deficit in imaging non-voice sounds. 4. The prevalence of CPhons of 3.2% is somewhat higher than the reported prevalence of approximately 2.0% for CPros in the population. There is evidence that CPhon represents a distinct condition statistically and not just normal variation. 5. Face and voice recognition proficiency are uncorrelated rather than reflecting limitations of a general capacity for person individuation. Copyright © 2018 Elsevier Ltd. All rights reserved.
A model of acoustic interspeaker variability based on the concept of formant-cavity affiliation
NASA Astrophysics Data System (ADS)
Apostol, Lian; Perrier, Pascal; Bailly, Gérard
2004-01-01
A method is proposed to model the interspeaker variability of formant patterns for oral vowels. It is assumed that this variability originates in the differences existing among speakers in the respective lengths of their front and back vocal-tract cavities. In order to characterize, from the spectral description of the acoustic speech signal, these vocal-tract differences between speakers, each formant is interpreted, according to the concept of formant-cavity affiliation, as a resonance of a specific vocal-tract cavity. Its frequency can thus be directly related to the corresponding cavity length, and a transformation model can be proposed from a speaker A to a speaker B on the basis of the frequency ratios of the formants corresponding to the same resonances. In order to minimize the number of sounds to be recorded for each speaker in order to carry out this speaker transformation, the frequency ratios are exactly computed only for the three extreme cardinal vowels [eye, aye, you] and they are approximated for the remaining vowels through an interpolation function. The method is evaluated through its capacity to transform the (F1,F2) formant patterns of eight oral vowels pronounced by five male speakers into the (F1,F2) patterns of the corresponding vowels generated by an articulatory model of the vocal tract. The resulting formant patterns are compared to those provided by normalization techniques published in the literature. The proposed method is found to be efficient, but a number of limitations are also observed and discussed. These limitations can be associated with the formant-cavity affiliation model itself or with a possible influence of speaker-specific vocal-tract geometry in the cross-sectional direction, which the model might not have taken into account.
Revisiting speech rate and utterance length manipulations in stuttering speakers.
Blomgren, Michael; Goberman, Alexander M
2008-01-01
The goal of this study was to evaluate stuttering frequency across a multidimensional (2x2) hierarchy of speech performance tasks. Specifically, this study examined the interaction between changes in length of utterance and levels of speech rate stability. Forty-four adult male speakers participated in the study (22 stuttering speakers and 22 non-stuttering speakers). Participants were audio and video recorded while producing a spontaneous speech task and four different experimental speaking tasks. The four experimental speaking tasks involved reading a list of 45 words and a list 45 phrases two times each. One reading of each list involved speaking at a steady habitual rate (habitual rate tasks) and another reading involved producing each list at a variable speaking rate (variable rate tasks). For the variable rate tasks, participants were directed to produce words or phrases at randomly ordered slow, habitual, and fast rates. The stuttering speakers exhibited significantly more stuttering on the variable rate tasks than on the habitual rate tasks. In addition, the stuttering speakers exhibited significantly more stuttering on the first word of the phrase length tasks compared to the single word tasks. Overall, the results indicated that varying levels of both utterance length and temporal complexity function to modulate stuttering frequency in adult stuttering speakers. Discussion focuses on issues of speech performance according to stuttering severity and possible clinical implications. The reader will learn about and be able to: (1) describe the mediating effects of length of utterance and speech rate on the frequency of stuttering in stuttering speakers; (2) understand the rationale behind multidimensional skill performance matrices; and (3) describe possible applications of motor skill performance matrices to stuttering therapy.
Paats, A; Alumäe, T; Meister, E; Fridolin, I
2018-04-30
The aim of this study was to analyze retrospectively the influence of different acoustic and language models in order to determine the most important effects to the clinical performance of an Estonian language-based non-commercial radiology-oriented automatic speech recognition (ASR) system. An ASR system was developed for Estonian language in radiology domain by utilizing open-source software components (Kaldi toolkit, Thrax). The ASR system was trained with the real radiology text reports and dictations collected during development phases. The final version of the ASR system was tested by 11 radiologists who dictated 219 reports in total, in spontaneous manner in a real clinical environment. The audio files collected in the final phase were used to measure the performance of different versions of the ASR system retrospectively. ASR system versions were evaluated by word error rate (WER) for each speaker and modality and by WER difference for the first and the last version of the ASR system. Total average WER for the final version throughout all material was improved from 18.4% of the first version (v1) to 5.8% of the last (v8) version which corresponds to relative improvement of 68.5%. WER improvement was strongly related to modality and radiologist. In summary, the performance of the final ASR system version was close to optimal, delivering similar results to all modalities and being independent on user, the complexity of the radiology reports, user experience, and speech characteristics.
Voices to reckon with: perceptions of voice identity in clinical and non-clinical voice hearers
Badcock, Johanna C.; Chhabra, Saruchi
2013-01-01
The current review focuses on the perception of voice identity in clinical and non-clinical voice hearers. Identity perception in auditory verbal hallucinations (AVH) is grounded in the mechanisms of human (i.e., real, external) voice perception, and shapes the emotional (distress) and behavioral (help-seeking) response to the experience. Yet, the phenomenological assessment of voice identity is often limited, for example to the gender of the voice, and has failed to take advantage of recent models and evidence on human voice perception. In this paper we aim to synthesize the literature on identity in real and hallucinated voices and begin by providing a comprehensive overview of the features used to judge voice identity in healthy individuals and in people with schizophrenia. The findings suggest some subtle, but possibly systematic biases across different levels of voice identity in clinical hallucinators that are associated with higher levels of distress. Next we provide a critical evaluation of voice processing abilities in clinical and non-clinical voice hearers, including recent data collected in our laboratory. Our studies used diverse methods, assessing recognition and binding of words and voices in memory as well as multidimensional scaling of voice dissimilarity judgments. The findings overall point to significant difficulties recognizing familiar speakers and discriminating between unfamiliar speakers in people with schizophrenia, both with and without AVH. In contrast, these voice processing abilities appear to be generally intact in non-clinical hallucinators. The review highlights some important avenues for future research and treatment of AVH associated with a need for care, and suggests some novel insights into other symptoms of psychosis. PMID:23565088
Real-time speech gisting for ATC applications
NASA Astrophysics Data System (ADS)
Dunkelberger, Kirk A.
1995-06-01
Command and control within the ATC environment remains primarily voice-based. Hence, automatic real time, speaker independent, continuous speech recognition (CSR) has many obvious applications and implied benefits to the ATC community: automated target tagging, aircraft compliance monitoring, controller training, automatic alarm disabling, display management, and many others. However, while current state-of-the-art CSR systems provide upwards of 98% word accuracy in laboratory environments, recent low-intrusion experiments in the ATCT environments demonstrated less than 70% word accuracy in spite of significant investments in recognizer tuning. Acoustic channel irregularities and controller/pilot grammar verities impact current CSR algorithms at their weakest points. It will be shown herein, however, that real time context- and environment-sensitive gisting can provide key command phrase recognition rates of greater than 95% using the same low-intrusion approach. The combination of real time inexact syntactic pattern recognition techniques and a tight integration of CSR, gisting, and ATC database accessor system components is the key to these high phase recognition rates. A system concept for real time gisting in the ATC context is presented herein. After establishing an application context, discussion presents a minimal CSR technology context then focuses on the gisting mechanism, desirable interfaces into the ATCT database environment, and data and control flow within the prototype system. Results of recent tests for a subset of the functionality are presented together with suggestions for further research.
Soer, Maggi; Pottas, Lidia
2015-01-01
Background The home language of most audiologists in South Africa is either English or Afrikaans, whereas most South Africans speak an African language as their home language. The use of an English wordlist, the South African Spondaic (SAS) wordlist, which is familiar to the English Second Language (ESL) population, was developed by the author for testing the speech recognition threshold (SRT) of ESL speakers. Objectives The aim of this study was to compare the pure-tone average (PTA)/SRT correlation results of ESL participants when using the SAS wordlist (list A) and the CID W-1 spondaic wordlist (list B – less familiar; list C – more familiar CID W-1 words). Method A mixed-group correlational, quantitative design was adopted. PTA and SRT measurements were compared for lists A, B and C for 101 (197 ears) ESL participants with normal hearing or a minimal hearing loss (<26 dBHL; mean age 33.3). Results The Pearson correlation analysis revealed a strong PTA/SRT correlation when using list A (right 0.65; left 0.58) and list C (right 0.63; left 0.56). The use of list B revealed weak correlations (right 0.30; left 0.32). Paired sample t-tests indicated a statistically significantly stronger PTA/SRT correlation when list A was used, rather than list B or list C, at a 95% level of confidence. Conclusions The use of the SAS wordlist yielded a stronger PTA/SRT correlation than the use of the CID W-1 wordlist, when performing SRT testing on South African ESL speakers with normal hearing, or minimal hearing loss (<26 dBHL). PMID:26304218
ERIC Educational Resources Information Center
Fox, Harrison
The speaker discusses Congressional program evaluation. From the Congressional perspective, good evaluators understand the political, social, and economic processes; are familiar with various evaluation methods; and know how to use authority and power within their roles. Program evaluation serves three major purposes: to anticipate social impact…
Orthographic neighborhood effects in recognition and recall tasks in a transparent orthography.
Justi, Francis R R; Jaeger, Antonio
2017-04-01
The number of orthographic neighbors of a word influences its probability of being retrieved in recognition and free recall memory tests. Even though this phenomenon is well demonstrated for English words, it has yet to be demonstrated for languages with more predictable grapheme-phoneme mappings than English. To address this issue, 4 experiments were conducted to investigate effects of number of orthographic neighbors (N) and effects of frequency of occurrence of orthographic neighbors (NF) on memory retrieval of Brazilian Portuguese words. One hundred twenty-four Brazilian Portuguese speakers performed first a lexical-decision task (LDT) on words that were factorially manipulated according to N and NF, and intermixed with either nonpronounceable nonwords without orthographic neighbors (Experiments 1A and 2A), or with pronounceable nonwords with a large number of orthographic neighbors (Experiments 1B and 2B). The words were later used as probes on either recognition (Experiments 1A and 1B) or recall tests (Experiments 2A and 2B). Words with 1 orthographic neighbor were consistently better remembered than words with several orthographic neighbors in all recognition and recall tests. Notably, whereas in Experiment 1A higher false alarm rates were yielded for words with several rather than 1 orthographic neighbor, in Experiment 1B higher false alarm rates were yielded for words with 1 rather than several orthographic neighbors. Effects of NF, on the other hand, were not consistent among memory tasks. The effects of N on the recognition and recall tests conducted here are interpreted in light of dual process models of recognition. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Kong, Anthony Pak-Hin; Whiteside, Janet; Bargmann, Peggy
2016-10-01
Discourse from speakers with dementia and aphasia is associated with comparable but not identical deficits, necessitating appropriate methods to differentiate them. The current study aims to validate the Main Concept Analysis (MCA) to be used for eliciting and quantifying discourse among native typical English speakers and to establish its norm, and investigate the validity and sensitivity of the MCA to compare discourse produced by individuals with fluent aphasia, non-fluent aphasia, or dementia of Alzheimer's type (DAT), and unimpaired elderly. Discourse elicited through a sequential picture description task was collected from 60 unimpaired participants to determine the MCA scoring criteria; 12 speakers with fluent aphasia, 12 with non-fluent aphasia, 13 with DAT, and 20 elderly participants from the healthy group were compared on the finalized MCA. Results of MANOVA revealed significant univariate omnibus effects of speaker group as an independent variable on each main concept index. MCA profiles differed significantly between all participant groups except dementia versus fluent aphasia. Correlations between the MCA performances and the Western Aphasia Battery and Cognitive Linguistic Quick Test were found to be statistically significant among the clinical groups. The MCA was appropriate to be used among native speakers of English. The results also provided further empirical evidence of discourse deficits in aphasia and dementia. Practitioners can use the MCA to evaluate discourse production systemically and objectively.
Evolving Spiking Neural Networks for Recognition of Aged Voices.
Silva, Marco; Vellasco, Marley M B R; Cataldo, Edson
2017-01-01
The aging of the voice, known as presbyphonia, is a natural process that can cause great change in vocal quality of the individual. This is a relevant problem to those people who use their voices professionally, and its early identification can help determine a suitable treatment to avoid its progress or even to eliminate the problem. This work focuses on the development of a new model for the identification of aging voices (independently of their chronological age), using as input attributes parameters extracted from the voice and glottal signals. The proposed model, named Quantum binary-real evolving Spiking Neural Network (QbrSNN), is based on spiking neural networks (SNNs), with an unsupervised training algorithm, and a Quantum-Inspired Evolutionary Algorithm that automatically determines the most relevant attributes and the optimal parameters that configure the SNN. The QbrSNN model was evaluated in a database composed of 120 records, containing samples from three groups of speakers. The results obtained indicate that the proposed model provides better accuracy than other approaches, with fewer input attributes. Copyright © 2017 The Voice Foundation. Published by Elsevier Inc. All rights reserved.
Pruitt, John S; Jenkins, James J; Strange, Winifred
2006-03-01
Perception of second language speech sounds is influenced by one's first language. For example, speakers of American English have difficulty perceiving dental versus retroflex stop consonants in Hindi although English has both dental and retroflex allophones of alveolar stops. Japanese, unlike English, has a contrast similar to Hindi, specifically, the Japanese /d/ versus the flapped /r/ which is sometimes produced as a retroflex. This study compared American and Japanese speakers' identification of the Hindi contrast in CV syllable contexts where C varied in voicing and aspiration. The study then evaluated the participants' increase in identifying the distinction after training with a computer-interactive program. Training sessions progressively increased in difficulty by decreasing the extent of vowel truncation in stimuli and by adding new speakers. Although all participants improved significantly, Japanese participants were more accurate than Americans in distinguishing the contrast on pretest, during training, and on posttest. Transfer was observed to three new consonantal contexts, a new vowel context, and a new speaker's productions. Some abstract aspect of the contrast was apparently learned during training. It is suggested that allophonic experience with dental and retroflex stops may be detrimental to perception of the new contrast.
Validity of Single-Item Screening for Limited Health Literacy in English and Spanish Speakers.
Bishop, Wendy Pechero; Craddock Lee, Simon J; Skinner, Celette Sugg; Jones, Tiffany M; McCallister, Katharine; Tiro, Jasmin A
2016-05-01
To evaluate 3 single-item screening measures for limited health literacy in a community-based population of English and Spanish speakers. We recruited 324 English and 314 Spanish speakers from a community research registry in Dallas, Texas, enrolled between 2009 and 2012. We used 3 screening measures: (1) How would you rate your ability to read?; (2) How confident are you filling out medical forms by yourself?; and (3) How often do you have someone help you read hospital materials? In analyses stratified by language, we used area under the receiver operating characteristic (AUROC) curves to compare each item with the validated 40-item Short Test of Functional Health Literacy in Adults. For English speakers, no difference was seen among the items. For Spanish speakers, "ability to read" identified inadequate literacy better than "help reading hospital materials" (AUROC curve = 0.76 vs 0.65; P = .019). The "ability to read" item performed the best, supporting use as a screening tool in safety-net systems caring for diverse populations. Future studies should investigate how to implement brief measures in safety-net settings and whether highlighting health literacy level influences providers' communication practices and patient outcomes.
Evaluating acoustic speaker normalization algorithms: evidence from longitudinal child data.
Kohn, Mary Elizabeth; Farrington, Charlie
2012-03-01
Speaker vowel formant normalization, a technique that controls for variation introduced by physical differences between speakers, is necessary in variationist studies to compare speakers of different ages, genders, and physiological makeup in order to understand non-physiological variation patterns within populations. Many algorithms have been established to reduce variation introduced into vocalic data from physiological sources. The lack of real-time studies tracking the effectiveness of these normalization algorithms from childhood through adolescence inhibits exploration of child participation in vowel shifts. This analysis compares normalization techniques applied to data collected from ten African American children across five time points. Linear regressions compare the reduction in variation attributable to age and gender for each speaker for the vowels BEET, BAT, BOT, BUT, and BOAR. A normalization technique is successful if it maintains variation attributable to a reference sociolinguistic variable, while reducing variation attributable to age. Results indicate that normalization techniques which rely on both a measure of central tendency and range of the vowel space perform best at reducing variation attributable to age, although some variation attributable to age persists after normalization for some sections of the vowel space. © 2012 Acoustical Society of America
Are there too few women presenting at emergency medicine conferences?
Carley, Simon; Carden, Richard; Riley, Rebecca; May, Natalie; Hruska, Katrin; Beardsell, Iain; Johnston, Michelle; Body, Richard
2016-10-01
There is a perception that women are under-represented as speakers at emergency medicine (EM) conferences. We aimed to evaluate the ratio of male to female speakers and the proportion of presenting time by gender at major international EM conferences. Conference programmes of the major English-speaking EM conferences occurring from 2014 to 2015 were obtained. The number of presentations, the gender of the speaker and the duration of each presentation were recorded. We analysed eight major EM conferences. These included 2382 presentations, of which 29.9% (range 22.5%-40.9%) were given by women. In total, 56 104 min of presentations were analysed, of which 27.6% (range 21%-36.7%) were delivered by women. On average, presentations by women were 95 s shorter than presentations by men (23 vs 21 min 25 s). Male speakers exceed female speakers at major EM conferences. The reasons for this imbalance are likely complex and multifactorial and may reflect the gender imbalance within the specialty. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://www.bmj.com/company/products-services/rights-and-licensing/
Bonin, Patrick; Guillemard-Tsaparina, Diana; Méot, Alain
2013-09-01
We report object-naming and object recognition times collected from Russian native speakers for the colorized version of the Snodgrass and Vanderwart (Journal of Experimental Psychology: Human Learning and Memory 6:174-215, 1980) pictures (Rossion & Pourtois, Perception 33:217-236, 2004). New norms for image variability, body-object interaction [BOI], and subjective frequency collected in Russian, as well as new name agreement scores for the colorized pictures in French, are also reported. In both object-naming and object comprehension times, the name agreement, image agreement, and age-of-acquisition variables made significant independent contributions. Objective word frequency was reliable in object-naming latencies only. The variables of image variability, BOI, and subjective frequency were not significant in either object naming or object comprehension. Finally, imageability was reliable in both tasks. The new norms and object-naming and object recognition times are provided as supplemental materials.
Second Language Ability and Emotional Prosody Perception
Bhatara, Anjali; Laukka, Petri; Boll-Avetisyan, Natalie; Granjon, Lionel; Anger Elfenbein, Hillary; Bänziger, Tanja
2016-01-01
The present study examines the effect of language experience on vocal emotion perception in a second language. Native speakers of French with varying levels of self-reported English ability were asked to identify emotions from vocal expressions produced by American actors in a forced-choice task, and to rate their pleasantness, power, alertness and intensity on continuous scales. Stimuli included emotionally expressive English speech (emotional prosody) and non-linguistic vocalizations (affect bursts), and a baseline condition with Swiss-French pseudo-speech. Results revealed effects of English ability on the recognition of emotions in English speech but not in non-linguistic vocalizations. Specifically, higher English ability was associated with less accurate identification of positive emotions, but not with the interpretation of negative emotions. Moreover, higher English ability was associated with lower ratings of pleasantness and power, again only for emotional prosody. This suggests that second language skills may sometimes interfere with emotion recognition from speech prosody, particularly for positive emotions. PMID:27253326
The role of voice input for human-machine communication.
Cohen, P R; Oviatt, S L
1995-01-01
Optimism is growing that the near future will witness rapid growth in human-computer interaction using voice. System prototypes have recently been built that demonstrate speaker-independent real-time speech recognition, and understanding of naturally spoken utterances with vocabularies of 1000 to 2000 words, and larger. Already, computer manufacturers are building speech recognition subsystems into their new product lines. However, before this technology can be broadly useful, a substantial knowledge base is needed about human spoken language and performance during computer-based spoken interaction. This paper reviews application areas in which spoken interaction can play a significant role, assesses potential benefits of spoken interaction with machines, and compares voice with other modalities of human-computer interaction. It also discusses information that will be needed to build a firm empirical foundation for the design of future spoken and multimodal interfaces. Finally, it argues for a more systematic and scientific approach to investigating spoken input and performance with future language technology. PMID:7479803
Implementation of the Intelligent Voice System for Kazakh
NASA Astrophysics Data System (ADS)
Yessenbayev, Zh; Saparkhojayev, N.; Tibeyev, T.
2014-04-01
Modern speech technologies are highly advanced and widely used in day-to-day applications. However, this is mostly concerned with the languages of well-developed countries such as English, German, Japan, Russian, etc. As for Kazakh, the situation is less prominent and research in this field is only starting to evolve. In this research and application-oriented project, we introduce an intelligent voice system for the fast deployment of call-centers and information desks supporting Kazakh speech. The demand on such a system is obvious if the country's large size and small population is considered. The landline and cell phones become the only means of communication for the distant villages and suburbs. The system features Kazakh speech recognition and synthesis modules as well as a web-GUI for efficient dialog management. For speech recognition we use CMU Sphinx engine and for speech synthesis- MaryTTS. The web-GUI is implemented in Java enabling operators to quickly create and manage the dialogs in user-friendly graphical environment. The call routines are handled by Asterisk PBX and JBoss Application Server. The system supports such technologies and protocols as VoIP, VoiceXML, FastAGI, Java SpeechAPI and J2EE. For the speech recognition experiments we compiled and used the first Kazakh speech corpus with the utterances from 169 native speakers. The performance of the speech recognizer is 4.1% WER on isolated word recognition and 6.9% WER on clean continuous speech recognition tasks. The speech synthesis experiments include the training of male and female voices.
Nasalance measures in Cantonese-speaking women.
Whitehill, T L
2001-03-01
To establish and evaluate stimulus materials for nasalance measurement in Cantonese speakers, to provide normative data for Cantonese-speaking women, and to evaluate session-to-session reliability of nasalance measures. One hundred forty-one Cantonese-speaking women with normal resonance who were students in the Department of Speech and Hearing Sciences, University of Hong Kong. Participants read aloud four speech stimuli: oral sentences, nasal sentences, an oral paragraph (similar to the Zoo Passage), and an oral-nasal paragraph (similar to the Rainbow Passage). Data were collected and analyzed using the Kay Nasometer 6200. Data collection was repeated for a subgroup of speakers (n = 28) on a separate day. Nasalance materials were evaluated by using statistical tests of difference and correlation. Group mean (standard deviation) nasalance scores for oral sentences, nasal sentences, oral paragraph, and oral-nasal paragraph were 16.79 (5.99), 55.67 (7.38), 13.68 (7.16), and 35.46 (6.22), respectively. There was a significant difference in mean nasalance scores for oral versus nasal materials. Correlations between stimuli were as expected, ranging from 0.43 to 0.91. Session-to-session reliability was within 5 points for over 95% of speakers for the oral stimuli but for less than 76% of speakers for the nasal and oral-nasal stimuli. Standard nasalance materials have been developed for Cantonese, and normative data have been established for Cantonese women. Evaluation of materials indicated acceptable differentiation between oral and nasal materials. Two stimuli (nasal sentences and oral paragraph) are recommended for future use. Comparison with findings from other languages showed similarities in scores; possible language-specific differences are discussed. Session-to-session reliability was poorer for nasal than oral stimuli.
How shared reality is created in interpersonal communication.
Echterhoff, Gerald; Schmalbach, Bjarne
2017-12-29
Communication is a key arena and means for shared-reality creation. Most studies explicitly devoted to shared reality have focused on the opening part of a conversation, that is, a speaker's initial message to an audience. The aspect of communication examined by this research is the evaluative adaptation (tuning) of the messages to the audience's attitude or judgment. The speaker's shared-reality creation is typically assessed by the extent to which the speaker's evaluative representation of the topic matches the audience-tuned view expressed in the message. We first review research on such audience-tuning effects, with a focus on shared-reality goals and conditions facilitating the generalization of shared reality. We then review studies using other paradigms that illustrate factors of shared-reality creation in communication, including mere message production, grounding, validation responses, and communication about commonly known information (including stereotypes) in intragroup communication. The different lines of research reveal the potency, but also boundary conditions, of communication effects on shared reality. Copyright © 2017. Published by Elsevier Ltd.
Arctic Visiting Speakers Series (AVS)
NASA Astrophysics Data System (ADS)
Fox, S. E.; Griswold, J.
2011-12-01
The Arctic Visiting Speakers (AVS) Series funds researchers and other arctic experts to travel and share their knowledge in communities where they might not otherwise connect. Speakers cover a wide range of arctic research topics and can address a variety of audiences including K-12 students, graduate and undergraduate students, and the general public. Host applications are accepted on an on-going basis, depending on funding availability. Applications need to be submitted at least 1 month prior to the expected tour dates. Interested hosts can choose speakers from an online Speakers Bureau or invite a speaker of their choice. Preference is given to individuals and organizations to host speakers that reach a broad audience and the general public. AVS tours are encouraged to span several days, allowing ample time for interactions with faculty, students, local media, and community members. Applications for both domestic and international visits will be considered. Applications for international visits should involve participation of more than one host organization and must include either a US-based speaker or a US-based organization. This is a small but important program that educates the public about Arctic issues. There have been 27 tours since 2007 that have impacted communities across the globe including: Gatineau, Quebec Canada; St. Petersburg, Russia; Piscataway, New Jersey; Cordova, Alaska; Nuuk, Greenland; Elizabethtown, Pennsylvania; Oslo, Norway; Inari, Finland; Borgarnes, Iceland; San Francisco, California and Wolcott, Vermont to name a few. Tours have included lectures to K-12 schools, college and university students, tribal organizations, Boy Scout troops, science center and museum patrons, and the general public. There are approximately 300 attendees enjoying each AVS tour, roughly 4100 people have been reached since 2007. The expectations for each tour are extremely manageable. Hosts must submit a schedule of events and a tour summary to be posted online. Hosts must acknowledge the National Science Foundation Office of Polar Programs and ARCUS in all promotional materials. Host agrees to send ARCUS photographs, fliers, and if possible a video of the main lecture. Host and speaker agree to collect data on the number of attendees in each audience to submit as part of a post-tour evaluation. The grants can generally cover all the expenses of a tour, depending on the location. A maximum of 2,000 will be provided for the travel related expenses of a speaker on a domestic visit. A maxiμm of 2,500 will be provided for the travel related expenses of a speaker on an international visit. Each speaker will receive an honorarium of $300.
Design of a digital voice data compression technique for orbiter voice channels
NASA Technical Reports Server (NTRS)
1975-01-01
Candidate techniques were investigated for digital voice compression to a transmission rate of 8 kbps. Good voice quality, speaker recognition, and robustness in the presence of error bursts were considered. The technique of delayed-decision adaptive predictive coding is described and compared with conventional adaptive predictive coding. Results include a set of experimental simulations recorded on analog tape. The two FM broadcast segments produced show the delayed-decision technique to be virtually undegraded or minimally degraded at .001 and .01 Viterbi decoder bit error rates. Preliminary estimates of the hardware complexity of this technique indicate potential for implementation in space shuttle orbiters.
2004-03-12
KENNEDY SPACE CENTER, FLA. - Florida Gov. Jeb Bush (left) and Center Director Jim Kennedy enjoy a humorous break at the luncheon for the 2004 Florida Regional FIRST competition held at the University of Central Florida. Both are featured speakers. The event hosted 41 teams from Canada, Brazil, Great Britain and the United States. FIRST is a nonprofit organization, For Inspiration and Recognition of Science and Technology, that sponsors the event pitting gladiator robots against each other in an athletic-style competition. The FIRST robotics competition is designed to provide students with a hands-on, inside look at engineering and other professional careers, pairing high school students with engineer mentors and corporations.
National Test and Evaluation Conference (26th)
2010-03-04
Operations Division, Office of the Chief of Naval Operations (OPNAV N41) Luncheon Speaker · BrigGen Mike Dana , USMC, Director of Logistics...Speaker u BrigGen Mike Dana , USMC, Director of Logistics & Engineering, J4, NORAD and USNORTHCOM u Col Alex Vohr, USMC, Director of Logistics, J4...are the Core Elements of a Curriculum on Contemporary Strategy, and What are the Best Methods of Teaching Them? Dr Richard Betts, Arnold A. Saltzman
Facial Expression Generation from Speaker's Emotional States in Daily Conversation
NASA Astrophysics Data System (ADS)
Mori, Hiroki; Ohshima, Koh
A framework for generating facial expressions from emotional states in daily conversation is described. It provides a mapping between emotional states and facial expressions, where the former is represented by vectors with psychologically-defined abstract dimensions, and the latter is coded by the Facial Action Coding System. In order to obtain the mapping, parallel data with rated emotional states and facial expressions were collected for utterances of a female speaker, and a neural network was trained with the data. The effectiveness of proposed method is verified by a subjective evaluation test. As the result, the Mean Opinion Score with respect to the suitability of generated facial expression was 3.86 for the speaker, which was close to that of hand-made facial expressions.
Bottalico, Pasquale; Graetzer, Simone; Hunter, Eric J.
2015-01-01
Speakers adjust their vocal effort when communicating in different room acoustic and noise conditions and when instructed to speak at different volumes. The present paper reports on the effects of voice style, noise level, and acoustic feedback on vocal effort, evaluated as sound pressure level, and self-reported vocal fatigue, comfort, and control. Speakers increased their level in the presence of babble and when instructed to talk in a loud style, and lowered it when acoustic feedback was increased and when talking in a soft style. Self-reported responses indicated a preference for the normal style without babble noise. PMID:26723357
English vowel learning by speakers of Mandarin
NASA Astrophysics Data System (ADS)
Thomson, Ron I.
2005-04-01
One of the most influential models of second language (L2) speech perception and production [Flege, Speech Perception and Linguistic Experience (York, Baltimore, 1995) pp. 233-277] argues that during initial stages of L2 acquisition, perceptual categories sharing the same or nearly the same acoustic space as first language (L1) categories will be processed as members of that L1 category. Previous research has generally been limited to testing these claims on binary L2 contrasts, rather than larger portions of the perceptual space. This study examines the development of 10 English vowel categories by 20 Mandarin L1 learners of English. Imitation of English vowel stimuli by these learners, at 6 data collection points over the course of one year, were recorded. Using a statistical pattern recognition model, these productions were then assessed against native speaker norms. The degree to which the learners' perception/production shifted toward the target English vowels and the degree to which they matched L1 categories in ways predicted by theoretical models are discussed. The results of this experiment suggest that previous claims about perceptual assimilation of L2 categories to L1 categories may be too strong.
Voice recognition through phonetic features with Punjabi utterances
NASA Astrophysics Data System (ADS)
Kaur, Jasdeep; Juglan, K. C.; Sharma, Vishal; Upadhyay, R. K.
2017-07-01
This paper deals with perception and disorders of speech in view of Punjabi language. Visualizing the importance of voice identification, various parameters of speaker identification has been studied. The speech material was recorded with a tape recorder in their normal and disguised mode of utterances. Out of the recorded speech materials, the utterances free from noise, etc were selected for their auditory and acoustic spectrographic analysis. The comparison of normal and disguised speech of seven subjects is reported. The fundamental frequency (F0) at similar places, Plosive duration at certain phoneme, Amplitude ratio (A1:A2) etc. were compared in normal and disguised speech. It was found that the formant frequency of normal and disguised speech remains almost similar only if it is compared at the position of same vowel quality and quantity. If the vowel is more closed or more open in the disguised utterance the formant frequency will be changed in comparison to normal utterance. The ratio of the amplitude (A1: A2) is found to be speaker dependent. It remains unchanged in the disguised utterance. However, this value may shift in disguised utterance if cross sectioning is not done at the same location.
Computer-Mediated Assessment of Intelligibility in Aphasia and Apraxia of Speech
Haley, Katarina L.; Roth, Heidi; Grindstaff, Enetta; Jacks, Adam
2011-01-01
Background Previous work indicates that single word intelligibility tests developed for dysarthria are sensitive to segmental production errors in aphasic individuals with and without apraxia of speech. However, potential listener learning effects and difficulties adapting elicitation procedures to coexisting language impairments limit their applicability to left hemisphere stroke survivors. Aims The main purpose of this study was to examine basic psychometric properties for a new monosyllabic intelligibility test developed for individuals with aphasia and/or AOS. A related purpose was to examine clinical feasibility and potential to standardize a computer-mediated administration approach. Methods & Procedures A 600-item monosyllabic single word intelligibility test was constructed by assembling sets of phonetically similar words. Custom software was used to select 50 target words from this test in a pseudo-random fashion and to elicit and record production of these words by 23 speakers with aphasia and 20 neurologically healthy participants. To evaluate test-retest reliability, two identical sets of 50-word lists were elicited by requesting repetition after a live speaker model. To examine the effect of a different word set and auditory model, an additional set of 50 different words was elicited with a pre-recorded model. The recorded words were presented to normal-hearing listeners for identification via orthographic and multiple-choice response formats. To examine construct validity, production accuracy for each speaker was estimated via phonetic transcription and rating of overall articulation. Outcomes & Results Recording and listening tasks were completed in less than six minutes for all speakers and listeners. Aphasic speakers were significantly less intelligible than neurologically healthy speakers and displayed a wide range of intelligibility scores. Test-retest and inter-listener reliability estimates were strong. No significant difference was found in scores based on recordings from a live model versus a pre-recorded model, but some individual speakers favored the live model. Intelligibility test scores correlated highly with segmental accuracy derived from broad phonetic transcription of the same speech sample and a motor speech evaluation. Scores correlated moderately with rated articulation difficulty. Conclusions We describe a computerized, single-word intelligibility test that yields clinically feasible, reliable, and valid measures of segmental speech production in adults with aphasia. This tool can be used in clinical research to facilitate appropriate participant selection and to establish matching across comparison groups. For a majority of speakers, elicitation procedures can be standardized by using a pre-recorded auditory model for repetition. This assessment tool has potential utility for both clinical assessment and outcomes research. PMID:22215933
NASA Astrophysics Data System (ADS)
Peng, Bo; Zheng, Sifa; Liao, Xiangning; Lian, Xiaomin
2018-03-01
In order to achieve sound field reproduction in a wide frequency band, multiple-type speakers are used. The reproduction accuracy is not only affected by the signals sent to the speakers, but also depends on the position and the number of each type of speaker. The method of optimizing a mixed speaker array is investigated in this paper. A virtual-speaker weighting method is proposed to optimize both the position and the number of each type of speaker. In this method, a virtual-speaker model is proposed to quantify the increment of controllability of the speaker array when the speaker number increases. While optimizing a mixed speaker array, the gain of the virtual-speaker transfer function is used to determine the priority orders of the candidate speaker positions, which optimizes the position of each type of speaker. Then the relative gain of the virtual-speaker transfer function is used to determine whether the speakers are redundant, which optimizes the number of each type of speaker. Finally the virtual-speaker weighting method is verified by reproduction experiments of the interior sound field in a passenger car. The results validate that the optimum mixed speaker array can be obtained using the proposed method.
Systematic assessment of noise amplitude generated by toys intended for young children.
Mahboubi, Hossein; Oliaei, Sepehr; Badran, Karam W; Ziai, Kasra; Chang, Janice; Zardouz, Shawn; Shahriari, Shawn; Djalilian, Hamid R
2013-06-01
To systematically evaluate the noise generated by toys targeted for children and to compare the results over the course of 4 consecutive holiday shopping seasons. Experimental study. Academic medical center. During 2008-2011, more than 200 toys marketed for children older than 6 months were screened for loudness. The toys with sound output of more than 80 dBA at speaker level were retested in a soundproof audiometry booth. The generated sound amplitude of each toy was measured at speaker level and at 30 cm away from the speaker. Ninety different toys were analyzed. The mean (SD) noise amplitude was 100 (8) dBA (range, 80-121 dBA) at the speaker level and 80 (11) dBA (range, 60-109 dBA) at 30 cm away from the speaker. Eighty-eight (98%) had more than an 85-dBA noise amplitude at speaker level, whereas 19 (26%) had more than an 85-dBA noise amplitude at a 30-cm distance. Only the mean noise amplitude at 30 cm significantly declined during the studied period (P < .001). There was no significant difference in mean noise amplitude of different toys specified for different age groups. Our findings demonstrate the persistence of extremely loud toys marketed for very young children. Acoustic trauma from toys remains a potential risk factor for noise-induced hearing loss in this age group, warranting promotion of public awareness and regulatory considerations for manufacture and marketing of toys.
Which Features of Spanish Learners' Pronunciation Most Impact Listener Evaluations?
ERIC Educational Resources Information Center
McBride, Kara
2015-01-01
This study explores which features of Spanish as a foreign language (SFL) pronunciation most impact raters' evaluations. Native Spanish speakers (NSSs) from regions with different pronunciation norms were polled: 147 evaluators from northern Mexico and 99 evaluators from central Argentina. These evaluations were contrasted with ratings from…
Shuai, Lan; Malins, Jeffrey G
2017-02-01
Despite its prevalence as one of the most highly influential models of spoken word recognition, the TRACE model has yet to be extended to consider tonal languages such as Mandarin Chinese. A key reason for this is that the model in its current state does not encode lexical tone. In this report, we present a modified version of the jTRACE model in which we borrowed on its existing architecture to code for Mandarin phonemes and tones. Units are coded in a way that is meant to capture the similarity in timing of access to vowel and tone information that has been observed in previous studies of Mandarin spoken word recognition. We validated the model by first simulating a recent experiment that had used the visual world paradigm to investigate how native Mandarin speakers process monosyllabic Mandarin words (Malins & Joanisse, 2010). We then subsequently simulated two psycholinguistic phenomena: (1) differences in the timing of resolution of tonal contrast pairs, and (2) the interaction between syllable frequency and tonal probability. In all cases, the model gave rise to results comparable to those of published data with human subjects, suggesting that it is a viable working model of spoken word recognition in Mandarin. It is our hope that this tool will be of use to practitioners studying the psycholinguistics of Mandarin Chinese and will help inspire similar models for other tonal languages, such as Cantonese and Thai.
Reviewing the connection between speech and obstructive sleep apnea.
Espinoza-Cuadros, Fernando; Fernández-Pozo, Rubén; Toledano, Doroteo T; Alcázar-Ramírez, José D; López-Gonzalo, Eduardo; Hernández-Gómez, Luis A
2016-02-20
Sleep apnea (OSA) is a common sleep disorder characterized by recurring breathing pauses during sleep caused by a blockage of the upper airway (UA). The altered UA structure or function in OSA speakers has led to hypothesize the automatic analysis of speech for OSA assessment. In this paper we critically review several approaches using speech analysis and machine learning techniques for OSA detection, and discuss the limitations that can arise when using machine learning techniques for diagnostic applications. A large speech database including 426 male Spanish speakers suspected to suffer OSA and derived to a sleep disorders unit was used to study the clinical validity of several proposals using machine learning techniques to predict the apnea-hypopnea index (AHI) or classify individuals according to their OSA severity. AHI describes the severity of patients' condition. We first evaluate AHI prediction using state-of-the-art speaker recognition technologies: speech spectral information is modelled using supervectors or i-vectors techniques, and AHI is predicted through support vector regression (SVR). Using the same database we then critically review several OSA classification approaches previously proposed. The influence and possible interference of other clinical variables or characteristics available for our OSA population: age, height, weight, body mass index, and cervical perimeter, are also studied. The poor results obtained when estimating AHI using supervectors or i-vectors followed by SVR contrast with the positive results reported by previous research. This fact prompted us to a careful review of these approaches, also testing some reported results over our database. Several methodological limitations and deficiencies were detected that may have led to overoptimistic results. The methodological deficiencies observed after critically reviewing previous research can be relevant examples of potential pitfalls when using machine learning techniques for diagnostic applications. We have found two common limitations that can explain the likelihood of false discovery in previous research: (1) the use of prediction models derived from sources, such as speech, which are also correlated with other patient characteristics (age, height, sex,…) that act as confounding factors; and (2) overfitting of feature selection and validation methods when working with a high number of variables compared to the number of cases. We hope this study could not only be a useful example of relevant issues when using machine learning for medical diagnosis, but it will also help in guiding further research on the connection between speech and OSA.
Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation
Banks, Briony; Gowen, Emma; Munro, Kevin J.; Adank, Patti
2015-01-01
Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker’s facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants’ eye gaze was recorded to verify that they looked at the speaker’s face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation. PMID:26283946
Qi, Beier; Liu, Bo; Liu, Sha; Liu, Haihong; Dong, Ruijuan; Zhang, Ning; Gong, Shusheng
2011-05-01
To study the effect of cochlear electrode coverage and different insertion region on speech recognition, especially tone perception of cochlear implant users whose native language is Mandarin Chinese. Setting seven test conditions by fitting software. All conditions were created by switching on/off respective channels in order to simulate different insertion position. Then Mandarin CI users received 4 Speech tests, including Vowel Identification test, Consonant Identification test, Tone Identification test-male speaker, Mandarin HINT test (SRS) in quiet and noise. To all test conditions: the average score of vowel identification was significantly different, from 56% to 91% (Rank sum test, P < 0.05). The average score of consonant identification was significantly different, from 72% to 85% (ANOVNA, P < 0.05). The average score of Tone identification was not significantly different (ANOVNA, P > 0.05). However the more channels activated, the higher scores obtained, from 68% to 81%. This study shows that there is a correlation between insertion depth and speech recognition. Because all parts of the basement membrane can help CI users to improve their speech recognition ability, it is very important to enhance verbal communication ability and social interaction ability of CI users by increasing insertion depth and actively stimulating the top region of cochlear.
Potgieter, Jenni-Marí; Swanepoel, De Wet; Myburgh, Hermanus Carel; Hopper, Thomas Christopher; Smits, Cas
2015-07-01
The objective of this study was to develop and validate a smartphone-based digits-in-noise hearing test for South African English. Single digits (0-9) were recorded and spoken by a first language English female speaker. Level corrections were applied to create a set of homogeneous digits with steep speech recognition functions. A smartphone application was created to utilize 120 digit-triplets in noise as test material. An adaptive test procedure determined the speech reception threshold (SRT). Experiments were performed to determine headphones effects on the SRT and to establish normative data. Participants consisted of 40 normal-hearing subjects with thresholds ≤15 dB across the frequency spectrum (250-8000 Hz) and 186 subjects with normal-hearing in both ears, or normal-hearing in the better ear. The results show steep speech recognition functions with a slope of 20%/dB for digit-triplets presented in noise using the smartphone application. The results of five headphone types indicate that the smartphone-based hearing test is reliable and can be conducted using standard Android smartphone headphones or clinical headphones. A digits-in-noise hearing test was developed and validated for South Africa. The mean SRT and speech recognition functions correspond to previous developed telephone-based digits-in-noise tests.
Molineaux, Benjamin J
2017-03-01
Today, virtually all speakers of Mapudungun (formerly Araucanian), an endangered language of Chile and Argentina, are bilingual in Spanish. As a result, the firmness of native speaker intuitions-especially regarding perceptually complex issues such as word-stress-has been called into question. Even though native intuitions are unavoidable in the investigation of stress position, efforts can be made in order to clarify what the actual sources of the intuitions are, and how consistent and 'native' they remain given the language's asymmetrical contact conditions. In this article, the use of non-native speaker intuitions is proposed as a valid means for assessing the position of stress in Mapudungun, and evaluating whether it represents the unchanged, 'native' pattern. The alternative, of course, is that the patterns that present variability simply result from overlap of the bilingual speakers' phonological modules, hence displaying a contact-induced innovation. A forced decision perception task is reported on, showing that native and non-native perception of Mapudungun stress converges across speakers of six separate first languages, thus giving greater reliability to native judgements. The relative difference in the perception of Mapudungun stress given by Spanish monolinguals and Mapudungun-Spanish bilinguals is also taken to support the diachronic maintenance of the endangered language's stress system.
Formal implementation of a performance evaluation model for the face recognition system.
Shin, Yong-Nyuo; Kim, Jason; Lee, Yong-Jun; Shin, Woochang; Choi, Jin-Young
2008-01-01
Due to usability features, practical applications, and its lack of intrusiveness, face recognition technology, based on information, derived from individuals' facial features, has been attracting considerable attention recently. Reported recognition rates of commercialized face recognition systems cannot be admitted as official recognition rates, as they are based on assumptions that are beneficial to the specific system and face database. Therefore, performance evaluation methods and tools are necessary to objectively measure the accuracy and performance of any face recognition system. In this paper, we propose and formalize a performance evaluation model for the biometric recognition system, implementing an evaluation tool for face recognition systems based on the proposed model. Furthermore, we performed evaluations objectively by providing guidelines for the design and implementation of a performance evaluation system, formalizing the performance test process.
Expression transmission using exaggerated animation for Elfoid
Hori, Maiya; Tsuruda, Yu; Yoshimura, Hiroki; Iwai, Yoshio
2015-01-01
We propose an expression transmission system using a cellular-phone-type teleoperated robot called Elfoid. Elfoid has a soft exterior that provides the look and feel of human skin, and is designed to transmit the speaker's presence to their communication partner using a camera and microphone. To transmit the speaker's presence, Elfoid sends not only the voice of the speaker but also the facial expression captured by the camera. In this research, facial expressions are recognized using a machine learning technique. Elfoid cannot, however, display facial expressions because of its compactness and a lack of sufficiently small actuator motors. To overcome this problem, facial expressions are displayed using Elfoid's head-mounted mobile projector. In an experiment, we built a prototype system and experimentally evaluated it's subjective usability. PMID:26347686
GMM-based speaker age and gender classification in Czech and Slovak
NASA Astrophysics Data System (ADS)
Přibil, Jiří; Přibilová, Anna; Matoušek, Jindřich
2017-01-01
The paper describes an experiment with using the Gaussian mixture models (GMM) for automatic classification of the speaker age and gender. It analyses and compares the influence of different number of mixtures and different types of speech features used for GMM gender/age classification. Dependence of the computational complexity on the number of used mixtures is also analysed. Finally, the GMM classification accuracy is compared with the output of the conventional listening tests. The results of these objective and subjective evaluations are in correspondence.
Potts, Lisa G; Skinner, Margaret W; Litovsky, Ruth A; Strube, Michael J; Kuk, Francis
2009-06-01
The use of bilateral amplification is now common clinical practice for hearing aid users but not for cochlear implant recipients. In the past, most cochlear implant recipients were implanted in one ear and wore only a monaural cochlear implant processor. There has been recent interest in benefits arising from bilateral stimulation that may be present for cochlear implant recipients. One option for bilateral stimulation is the use of a cochlear implant in one ear and a hearing aid in the opposite nonimplanted ear (bimodal hearing). This study evaluated the effect of wearing a cochlear implant in one ear and a digital hearing aid in the opposite ear on speech recognition and localization. A repeated-measures correlational study was completed. Nineteen adult Cochlear Nucleus 24 implant recipients participated in the study. The participants were fit with a Widex Senso Vita 38 hearing aid to achieve maximum audibility and comfort within their dynamic range. Soundfield thresholds, loudness growth, speech recognition, localization, and subjective questionnaires were obtained six-eight weeks after the hearing aid fitting. Testing was completed in three conditions: hearing aid only, cochlear implant only, and cochlear implant and hearing aid (bimodal). All tests were repeated four weeks after the first test session. Repeated-measures analysis of variance was used to analyze the data. Significant effects were further examined using pairwise comparison of means or in the case of continuous moderators, regression analyses. The speech-recognition and localization tasks were unique, in that a speech stimulus presented from a variety of roaming azimuths (140 degree loudspeaker array) was used. Performance in the bimodal condition was significantly better for speech recognition and localization compared to the cochlear implant-only and hearing aid-only conditions. Performance was also different between these conditions when the location (i.e., side of the loudspeaker array that presented the word) was analyzed. In the bimodal condition, the speech-recognition and localization tasks were equal regardless of which side of the loudspeaker array presented the word, while performance was significantly poorer for the monaural conditions (hearing aid only and cochlear implant only) when the words were presented on the side with no stimulation. Binaural loudness summation of 1-3 dB was seen in soundfield thresholds and loudness growth in the bimodal condition. Measures of the audibility of sound with the hearing aid, including unaided thresholds, soundfield thresholds, and the Speech Intelligibility Index, were significant moderators of speech recognition and localization. Based on the questionnaire responses, participants showed a strong preference for bimodal stimulation. These findings suggest that a well-fit digital hearing aid worn in conjunction with a cochlear implant is beneficial to speech recognition and localization. The dynamic test procedures used in this study illustrate the importance of bilateral hearing for locating, identifying, and switching attention between multiple speakers. It is recommended that unilateral cochlear implant recipients, with measurable unaided hearing thresholds, be fit with a hearing aid.
Identifying Effective Signals to Predict Deleted and Suspended Accounts on Twitter across Languages
DOE Office of Scientific and Technical Information (OSTI.GOV)
Volkova, Svitlana; Bell, Eric B.
Social networks have an ephemerality to them where accounts and messages are constantly being edited, deleted, or marked as private. This continuous change comes from concerns around privacy, a potential desire for deception, and spam-like behavior. In this study we analyze multiple large datasets of thousands of active and deleted Twitter accounts to produce a series of predictive features for the removal or shutdown of an account. We have selected these accounts from speakers of three languages -- Russian, Spanish, and English to evaluate if speakers of various languages behave differently with regards to deleting accounts. We find that unlikemore » previously used profile and network features, the discourse of deleted vs. active accounts forms the basis for highly accurate account deletion prediction. More precisely, we observed that the presence of a certain set of terms in user tweets leads to a higher likelihood for that user's account deletion. We show that the predictive power of profile, language, affect, and network features is not consistent across speakers of the three evaluated languages.« less
Gröschel, J; Philipp, F; Skonetzki, St; Genzwürker, H; Wetter, Th; Ellinger, K
2004-02-01
Precise documentation of medical treatment in emergency medical missions and for resuscitation is essential from a medical, legal and quality assurance point of view [Anästhesiologie und Intensivmedizin, 41 (2000) 737]. All conventional methods of time recording are either too inaccurate or elaborate for routine application. Automated speech recognition may offer a solution. A special erase programme for the documentation of all time events was developed. Standard speech recognition software (IBM ViaVoice 7.0) was adapted and installed on two different computer systems. One was a stationary PC (500MHz Pentium III, 128MB RAM, Soundblaster PCI 128 Soundcard, Win NT 4.0), the other was a mobile pen-PC that had already proven its value during emergency missions [Der Notarzt 16, p. 177] (Fujitsu Stylistic 2300, 230Mhz MMX Processor, 160MB RAM, embedded soundcard ESS 1879 chipset, Win98 2nd ed.). On both computers two different microphones were tested. One was a standard headset that came with the recognition software, the other was a small microphone (Lavalier-Kondensatormikrofon EM 116 from Vivanco), that could be attached to the operators collar. Seven women and 15 men spoke a text with 29 phrases to be recognised. Two emergency physicians tested the system in a simulated emergency setting using the collar microphone and the pen-PC with an analogue wireless connection. Overall recognition was best for the PC with a headset (89%) followed by the pen-PC with a headset (85%), the PC with a microphone (84%) and the pen-PC with a microphone (80%). Nevertheless, the difference was not statistically significant. Recognition became significantly worse (89.5% versus 82.3%, P<0.0001 ) when numbers had to be recognised. The gender of speaker and the number of words in a sentence had no influence. Average recognition in the simulated emergency setting was 75%. At no time did false recognition appear. Time recording with automated speech recognition seems to be possible in emergency medical missions. Although results show an average recognition of only 75%, it is possible that missing elements may be reconstructed more precisely. Future technology should integrate a secure wireless connection between microphone and mobile computer. The system could then prove its value for real out-of-hospital emergencies.
Is the sagittal postural alignment different in normal and dysphonic adult speakers?
Franco, Débora; Martins, Fernando; Andrea, Mário; Fragoso, Isabel; Carrão, Luís; Teles, Júlia
2014-07-01
Clinical research in the field of voice disorders, in particular functional dysphonia, has suggested abnormal laryngeal posture due to muscle adaptive changes, although specific evidence regarding body posture has been lacking. The aim of our study was to verify if there were significant differences in sagittal spine alignment between normal (41 subjects) and dysphonic speakers (33 subjects). Cross-sectional study. Seventy-four adults, 35 males and 39 females, were submitted to sagittal plane photographs so that spine alignment could be analyzed through the Digimizer-MedCalc Software Ltd program. Perceptual and acoustic evaluation and nasoendoscopy were used for dysphonic judgments: normal and dysphonic speakers. For thoracic length curvature (TL) and for the kyphosis index (KI), a significant effect of dysphonia was observed with mean TL and KI significantly higher for the dysphonic speakers than for the normal speakers. Concerning the TL variable, a significant effect of sex was found, in which the mean of the TL was higher for males than females. The interaction between dysphonia and sex did not have a significant effect on TL and KI variables. For the lumbar length curvature variable, a significant main effect of sex was demonstrated; there was no significant main effect of dysphonia or significant sex×dysphonia interaction. Findings indicated significant differences in some sagittal spine posture measures between normal and dysphonic speakers. Postural measures can add useful information to voice assessment protocols and should be taken into account when considering particular treatment strategies. Copyright © 2014 The Voice Foundation. Published by Mosby, Inc. All rights reserved.
Robust matching for voice recognition
NASA Astrophysics Data System (ADS)
Higgins, Alan; Bahler, L.; Porter, J.; Blais, P.
1994-10-01
This paper describes an automated method of comparing a voice sample of an unknown individual with samples from known speakers in order to establish or verify the individual's identity. The method is based on a statistical pattern matching approach that employs a simple training procedure, requires no human intervention (transcription, work or phonetic marketing, etc.), and makes no assumptions regarding the expected form of the statistical distributions of the observations. The content of the speech material (vocabulary, grammar, etc.) is not assumed to be constrained in any way. An algorithm is described which incorporates frame pruning and channel equalization processes designed to achieve robust performance with reasonable computational resources. An experimental implementation demonstrating the feasibility of the concept is described.
Item analysis of three Spanish naming tests: a cross-cultural investigation.
Marquez de la Plata, Carlos; Arango-Lasprilla, Juan Carlos; Alegret, Montse; Moreno, Alexander; Tárraga, Luis; Lara, Mar; Hewlitt, Margaret; Hynan, Linda; Cullum, C Munro
2009-01-01
Neuropsychological evaluations conducted in the United States and abroad commonly include the use of tests translated from English to Spanish. The use of translated naming tests for evaluating predominately Spanish-speakers has recently been challenged on the grounds that translating test items may compromise a test's construct validity. The Texas Spanish Naming Test (TNT) has been developed in Spanish specifically for use with Spanish-speakers; however, it is unlikely patients from diverse Spanish-speaking geographical regions will perform uniformly on a naming test. The present study evaluated and compared the internal consistency and patterns of item-difficulty and -discrimination for the TNT and two commonly used translated naming tests in three countries (i.e., United States, Colombia, Spain). Two hundred fifty two subjects (136 demented, 116 nondemented) across three countries were administered the TNT, Modified Boston Naming Test-Spanish, and the naming subtest from the CERAD. The TNT demonstrated superior internal consistency to its counterparts, a superior item difficulty pattern than the CERAD naming test, and a superior item discrimination pattern than the MBNT-S across countries. Overall, all three Spanish naming tests differentiated nondemented and moderately demented individuals, but the results suggest the items of the TNT are most appropriate to use with Spanish-speakers. Preliminary normative data for the three tests examined in each country are provided.
Kong, Anthony Pak-Hin
2011-02-01
The 1st aim of this study was to further establish the external validity of the main concept (MC) analysis by examining its relationship with the Cantonese Linguistic Communication Measure (CLCM; Kong, 2006; Kong & Law, 2004)-an established quantitative system for narrative production-and the Cantonese version of the Western Aphasia Battery (CAB; Yiu, 1992). The 2nd purpose of the study was to evaluate how well the MC analysis reflects the stability of discourse production among chronic Cantonese speakers with aphasia. Sixteen participants with aphasia were evaluated on the MC analysis, CAB, and CLCM in the summer of 2008 and were subsequently reassessed in the summer of 2009. They encompassed a range of aphasia severity (with an Aphasia Quotient ranging between 30.2/100 and 94.8/100 at the time of the 1st evaluation). Significant associations were found between the MC measures and the corresponding CLCM indices and CAB performance scores that were relevant to the presence, accuracy, and completeness of content in oral narratives. Moreover, the MC analysis was found to yield comparable scores for chronic speakers on 2 occasions 1 year apart. The present study has further established the external validity of MC analysis in Cantonese. Future investigations involving more speakers with aphasia will allow adequate description of its psychometric properties.
Perceptual evaluation and acoustic analysis of pneumatic artificial larynx.
Xu, Jie Jie; Chen, Xi; Lu, Mei Ping; Qiao, Ming Zhe
2009-12-01
To investigate the perceptual and acoustic characteristics of the pneumatic artificial larynx (PAL) and evaluate its speech ability and clinical value. Prospective study. The study was conducted in the Voice Lab, Department of Otorhinolaryngology, The First Affiliated Hospital of Nanjing Medical University. Forty-six laryngectomy patients using the PAL were rated for intelligibility and fluency of speech. The voice signals of sustained vowel /a/ for 40 healthy controls and 42 successful patients using the PAL were measured by a computer system. The acoustic parameters and sound spectrographs were analyzed and compared between the two groups. Forty-two of 46 patients using the PAL (91.3%) acquired successful speech capability. The intelligibility scores of 42 successful PAL speakers ranged from 71 to 95 percent, and the intelligibility range of four unsuccessful speakers was 30 to 50 percent. The fluency was judged as good or excellent in 42 successful patients, and poor or fair in four unsuccessful patients. There was no significant difference in average fundamental frequency, maximum intensity, jitter, shimmer, and normalized noise energy (NNE) between 42 successful PAL speakers and 40 healthy controls, while the maximum phonation time (MPT) of PAL speakers was slightly lower than that of the controls. The sound spectrographs of the patients using the PAL approximated those of the healthy controls. The PAL has the advantage of a high percentage of successful vocal rehabilitation. PAL speech is fluent and intelligible. The acoustic characteristics of the PAL are similar to those of a normal voice.
Intonation and dialog context as constraints for speech recognition.
Taylor, P; King, S; Isard, S; Wright, H
1998-01-01
This paper describes a way of using intonation and dialog context to improve the performance of an automatic speech recognition (ASR) system. Our experiments were run on the DCIEM Maptask corpus, a corpus of spontaneous task-oriented dialog speech. This corpus has been tagged according to a dialog analysis scheme that assigns each utterance to one of 12 "move types," such as "acknowledge," "query-yes/no" or "instruct." Most ASR systems use a bigram language model to constrain the possible sequences of words that might be recognized. Here we use a separate bigram language model for each move type. We show that when the "correct" move-specific language model is used for each utterance in the test set, the word error rate of the recognizer drops. Of course when the recognizer is run on previously unseen data, it cannot know in advance what move type the speaker has just produced. To determine the move type we use an intonation model combined with a dialog model that puts constraints on possible sequences of move types, as well as the speech recognizer likelihoods for the different move-specific models. In the full recognition system, the combination of automatic move type recognition with the move specific language models reduces the overall word error rate by a small but significant amount when compared with a baseline system that does not take intonation or dialog acts into account. Interestingly, the word error improvement is restricted to "initiating" move types, where word recognition is important. In "response" move types, where the important information is conveyed by the move type itself--for example, positive versus negative response--there is no word error improvement, but recognition of the response types themselves is good. The paper discusses the intonation model, the language models, and the dialog model in detail and describes the architecture in which they are combined.
Beyond semantic accuracy: preschoolers evaluate a speaker's reasons.
Koenig, Melissa A
2012-01-01
Children's sensitivity to the quality of epistemic reasons and their selective trust in the more reasonable of 2 informants was investigated in 2 experiments. Three-, 4-, and 5-year-old children (N = 90) were presented with speakers who stated different kinds of evidence for what they believed. Experiment 1 showed that children of all age groups appropriately judged looking, reliable testimony, and inference as better reasons for belief than pretense, guessing, and desiring. Experiment 2 showed that 3- and 4-year-old children preferred to seek and accept new information from a speaker who was previously judged to use the "best" way of thinking. The findings demonstrate that children distinguish certain good from bad reasons and prefer to learn from those who showcased good reasoning in the past. © 2012 The Author. Child Development © 2012 Society for Research in Child Development, Inc.
Can non-interactive language input benefit young second-language learners?
Au, Terry Kit-Fong; Chan, Winnie Wailan; Cheng, Liao; Siegel, Linda S; Tso, Ricky Van Yip
2015-03-01
To fully acquire a language, especially its phonology, children need linguistic input from native speakers early on. When interaction with native speakers is not always possible - e.g. for children learning a second language that is not the societal language - audios are commonly used as an affordable substitute. But does such non-interactive input work? Two experiments evaluated the usefulness of audio storybooks in acquiring a more native-like second-language accent. Young children, first- and second-graders in Hong Kong whose native language was Cantonese Chinese, were given take-home listening assignments in a second language, either English or Putonghua Chinese. Accent ratings of the children's story reading revealed measurable benefits of non-interactive input from native speakers. The benefits were far more robust for Putonghua than English. Implications for second-language accent acquisition are discussed.
The Arctic Visiting Speakers Program
NASA Astrophysics Data System (ADS)
Wiggins, H. V.; Fahnestock, J.
2013-12-01
The Arctic Visiting Speakers Program (AVS) is a program of the Arctic Research Consortium of the U.S. (ARCUS) and funded by the National Science Foundation. AVS provides small grants to researchers and other Arctic experts to travel and share their knowledge in communities where they might not otherwise connect. The program aims to: initiate and encourage arctic science education in communities with little exposure to arctic research; increase collaboration among the arctic research community; nurture communication between arctic researchers and community residents; and foster arctic science education at the local level. Individuals, community organizations, and academic organizations can apply to host a speaker. Speakers cover a wide range of arctic topics and can address a variety of audiences including K-12 students, graduate and undergraduate students, and the general public. Preference is given to tours that reach broad and varied audiences, especially those targeted to underserved populations. Between October 2000 and July 2013, AVS supported 114 tours spanning 9 different countries, including tours in 23 U.S. states. Tours over the past three and a half years have connected Arctic experts with over 6,600 audience members. Post-tour evaluations show that AVS consistently rates high for broadening interest and understanding of arctic issues. AVS provides a case study for how face-to-face interactions between arctic scientists and general audiences can produce high-impact results. Further information can be found at: http://www.arcus.org/arctic-visiting-speakers.
Generative Insights from the Eleanor Chelimsky Forum on Evaluation Theory and Practice
ERIC Educational Resources Information Center
Leviton, Laura C.
2014-01-01
Both speakers at the Eleanor Chelimsky Forum on Theory and Practice in Evaluation pointed out the complexity and messiness of evaluation practice, and thus potential limits on theory and generalizable knowledge. The concept of reflective practice offers one way forward to build evaluation theory. Building generalizable knowledge about practice…
The Role of Native-Language Knowledge in the Perception of Casual Speech in a Second Language
Mitterer, Holger; Tuinman, Annelie
2012-01-01
Casual speech processes, such as /t/-reduction, make word recognition harder. Additionally, word recognition is also harder in a second language (L2). Combining these challenges, we investigated whether L2 learners have recourse to knowledge from their native language (L1) when dealing with casual speech processes in their L2. In three experiments, production and perception of /t/-reduction was investigated. An initial production experiment showed that /t/-reduction occurred in both languages and patterned similarly in proper nouns but differed when /t/ was a verbal inflection. Two perception experiments compared the performance of German learners of Dutch with that of native speakers for nouns and verbs. Mirroring the production patterns, German learners’ performance strongly resembled that of native Dutch listeners when the reduced /t/ was part of a word stem, but deviated where /t/ was a verbal inflection. These results suggest that a casual speech process in a second language is problematic for learners when the process is not known from the leaner’s native language, similar to what has been observed for phoneme contrasts. PMID:22811675
Advances in audio source seperation and multisource audio content retrieval
NASA Astrophysics Data System (ADS)
Vincent, Emmanuel
2012-06-01
Audio source separation aims to extract the signals of individual sound sources from a given recording. In this paper, we review three recent advances which improve the robustness of source separation in real-world challenging scenarios and enable its use for multisource content retrieval tasks, such as automatic speech recognition (ASR) or acoustic event detection (AED) in noisy environments. We present a Flexible Audio Source Separation Toolkit (FASST) and discuss its advantages compared to earlier approaches such as independent component analysis (ICA) and sparse component analysis (SCA). We explain how cues as diverse as harmonicity, spectral envelope, temporal fine structure or spatial location can be jointly exploited by this toolkit. We subsequently present the uncertainty decoding (UD) framework for the integration of audio source separation and audio content retrieval. We show how the uncertainty about the separated source signals can be accurately estimated and propagated to the features. Finally, we explain how this uncertainty can be efficiently exploited by a classifier, both at the training and the decoding stage. We illustrate the resulting performance improvements in terms of speech separation quality and speaker recognition accuracy.
Arnold, Denis; Tomaschek, Fabian; Sering, Konstantin; Lopez, Florence; Baayen, R Harald
2017-01-01
Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20-44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a 'wide' yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.
Women in cell biology: a seat at the table and a place at the podium
Masur, Sandra Kazahn
2013-01-01
The Women in Cell Biology (WICB) committee of the American Society for Cell Biology (ASCB) was started in the 1970s in response to the documented underrepresentation of women in academia in general and cell biology in particular. By coincidence or causal relationship, I am happy to say that since WICB became a standing ASCB committee, women have been well represented in ASCB's leadership and as symposium speakers at the annual meeting. However, the need to provide opportunities and information useful to women in developing their careers in cell biology is still vital, given the continuing bias women face in the larger scientific arena. With its emphasis on mentoring, many of WICB's activities benefit the development of both men and women cell biologists. The WICB “Career Column” in the monthly ASCB Newsletter is a source of accessible wisdom. At the annual ASCB meeting, WICB organizes the career discussion and mentoring roundtables, childcare awards, Mentoring Theater, career-related panel and workshop, and career recognition awards. Finally, the WICB Speaker Referral Service provides a list of outstanding women whom organizers of scientific meetings, scientific review panels, and university symposia/lecture series can reach out to when facing the proverbial dilemma, “I just don't know any women who are experts.” PMID:23307103
Wong, Raymond
2013-01-01
Voice biometrics is one kind of physiological characteristics whose voice is different for each individual person. Due to this uniqueness, voice classification has found useful applications in classifying speakers' gender, mother tongue or ethnicity (accent), emotion states, identity verification, verbal command control, and so forth. In this paper, we adopt a new preprocessing method named Statistical Feature Extraction (SFX) for extracting important features in training a classification model, based on piecewise transformation treating an audio waveform as a time-series. Using SFX we can faithfully remodel statistical characteristics of the time-series; together with spectral analysis, a substantial amount of features are extracted in combination. An ensemble is utilized in selecting only the influential features to be used in classification model induction. We focus on the comparison of effects of various popular data mining algorithms on multiple datasets. Our experiment consists of classification tests over four typical categories of human voice data, namely, Female and Male, Emotional Speech, Speaker Identification, and Language Recognition. The experiments yield encouraging results supporting the fact that heuristically choosing significant features from both time and frequency domains indeed produces better performance in voice classification than traditional signal processing techniques alone, like wavelets and LPC-to-CC. PMID:24288684
Color categories are not universal: new evidence from traditional and western cultures
NASA Astrophysics Data System (ADS)
Roberson, Debi D.; Davidoff, Jules; Davies, Ian R. L.
2002-06-01
Evidence presented supports the linguistic relativity of color categories in three different paradigms. Firstly, a series of cross-cultural investigations, which had set out to replicate the seminal work of Rosch Heider with the Dani of New Guinea, failed to find evidence of a set of universal color categories. Instead, we found evidence of linguistic relativity in both populations tested. Neither participants from a Melanesian hunter-gatherer culture, nor those from an African pastoral tribe, whose languages both contain five color terms, showed a cognitive organization of color resembling that of English speakers. Further, Melanesian participants showed evidence of Categorical Perception, but only at their linguistic category boundaries. Secondly, in native English speakers verbal interference was found to selectively remove the defining features of Categorical Perception. Under verbal interference, the greater accuracy normally observed for cross-category judgements compared to within-category judgements disappeared. While both visual and verbal codes may be employed in the recognition memory of colors, participants only make use of verbal coding when demonstrating Categorical Perception. Thirdly, in a brain- damaged patient suffering from a naming disorder, the loss of labels radically impaired his ability to categorize colors. We conclude that language affects both the perception of and memory for colors.
Brain systems mediating voice identity processing in blind humans.
Hölig, Cordula; Föcker, Julia; Best, Anna; Röder, Brigitte; Büchel, Christian
2014-09-01
Blind people rely more on vocal cues when they recognize a person's identity than sighted people. Indeed, a number of studies have reported better voice recognition skills in blind than in sighted adults. The present functional magnetic resonance imaging study investigated changes in the functional organization of neural systems involved in voice identity processing following congenital blindness. A group of congenitally blind individuals and matched sighted control participants were tested in a priming paradigm, in which two voice stimuli (S1, S2) were subsequently presented. The prime (S1) and the target (S2) were either from the same speaker (person-congruent voices) or from two different speakers (person-incongruent voices). Participants had to classify the S2 as either a old or a young person. Person-incongruent voices (S2) compared with person-congruent voices elicited an increased activation in the right anterior fusiform gyrus in congenitally blind individuals but not in matched sighted control participants. In contrast, only matched sighted controls showed a higher activation in response to person-incongruent compared with person-congruent voices (S2) in the right posterior superior temporal sulcus. These results provide evidence for crossmodal plastic changes of the person identification system in the brain after visual deprivation. Copyright © 2014 Wiley Periodicals, Inc.
Protopapas, Athanassios; Orfanidou, Eleni; Taylor, J S H; Karavasilis, Efstratios; Kapnoula, Efthymia C; Panagiotaropoulou, Georgia; Velonakis, Georgios; Poulou, Loukia S; Smyrnis, Nikolaos; Kelekis, Dimitrios
2016-03-01
In this study predictions of the dual-route cascaded (DRC) model of word reading were tested using fMRI. Specifically, patterns of co-localization were investigated: (a) between pseudoword length effects and a pseudowords vs. fixation contrast, to reveal the sublexical grapho-phonemic conversion (GPC) system; and (b) between word frequency effects and a words vs. pseudowords contrast, to reveal the orthographic and phonological lexicon. Forty four native speakers of Greek were scanned at 3T in an event-related lexical decision task with three event types: (a) 150 words in which frequency, length, bigram and syllable frequency, neighborhood, and orthographic consistency were decorrelated; (b) 150 matched pseudowords; and (c) fixation. Whole-brain analysis failed to reveal the predicted co-localizations. Further analysis with participant-specific regions of interest defined within masks from the group contrasts revealed length effects in left inferior parietal cortex and frequency effects in the left middle temporal gyrus. These findings could be interpreted as partially consistent with the existence of the GPC system and phonological lexicon of the model, respectively. However, there was no evidence in support of an orthographic lexicon, weakening overall support for the model. The results are discussed with respect to the prospect of using neuroimaging in cognitive model evaluation. Copyright © 2016 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Hassanat, Ahmad B. A.; Jassim, Sabah
2010-04-01
In this paper, the automatic lip reading problem is investigated, and an innovative approach to providing solutions to this problem has been proposed. This new VSR approach is dependent on the signature of the word itself, which is obtained from a hybrid feature extraction method dependent on geometric, appearance, and image transform features. The proposed VSR approach is termed "visual words". The visual words approach consists of two main parts, 1) Feature extraction/selection, and 2) Visual speech feature recognition. After localizing face and lips, several visual features for the lips where extracted. Such as the height and width of the mouth, mutual information and the quality measurement between the DWT of the current ROI and the DWT of the previous ROI, the ratio of vertical to horizontal features taken from DWT of ROI, The ratio of vertical edges to horizontal edges of ROI, the appearance of the tongue and the appearance of teeth. Each spoken word is represented by 8 signals, one of each feature. Those signals maintain the dynamic of the spoken word, which contains a good portion of information. The system is then trained on these features using the KNN and DTW. This approach has been evaluated using a large database for different people, and large experiment sets. The evaluation has proved the visual words efficiency, and shown that the VSR is a speaker dependent problem.
Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers
Mustafa, Mumtaz Begum; Salim, Siti Salwah; Mohamed, Noraini; Al-Qatab, Bassam; Siong, Chng Eng
2014-01-01
Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data. PMID:24466004
NASA Technical Reports Server (NTRS)
Carlson, Randal D.
1994-01-01
The Teacher Enhancement Institute (TEI) at NASA Langley Research Center was developed in response to Executive Order 12821 which mandates national laboratories to 'assist in the mathematics and science education of our Nation's students, teachers, parents, and the public by establishing programs at their agency to provide for training elementary and secondary school teachers to improve their knowledge of mathematics and science. Such programs, to the maximum extent possible, shall involve partnerships with universities, state and local elementary and secondary school authorities, corporations and community based organizations'. The faculty worked closely with one another and the invited speakers to insure that the sessions supported the objectives. Speakers were informed of the objectives and given guidance concerning form and function for the session. Faculty members monitored sessions to assist speakers and to provide a quality control function. Faculty provided feedback to speakers concerning general objective accomplishment. Participant comments were also provided when applicable. Post TEI surveys asked for specific comments about each TEI session. During the second of the two, two week institutes, daily critiques were provided to the participants for their reflection. This seemed to provide much improved feedback to speakers and faculty because the sessions were fresh in each participant's mind. Between sessions one and two, some changes were made to the program as a result of the formative evaluation process. Those changes, though, were minor in nature and comprised what may be called 'fine tuning' a well conceived and implemented program. After the objectives were written, an assessment instrument was developed to test the accomplishment of the objectives. This instrument was actually two surveys, one given before the TEI and one given after the TEI. In using such a series, it was expected that changes in the participants induced by attendance at TEI may be discovered. Because the institute was limited in time and depth of exposure, attitudinal changes (self-assessment of ability and confidence) were chosen to be surveyed. On the pre-survey, seven general categories of questions were asked. The post-survey repeated three of these categories, providing a pre and post evaluation of the same questions and added a fourth category which asked the participant to self-assess objective accomplishment. The assessment process for TEI was valuable when one looks at the final accomplishments of the TEI. A number of aspects stand out: (1) formative evaluation during project development allowed the goals and objectives to guide the development of the institute; (2) formative evaluation provided positive guidance to presenters in developing and implementing their session; (3) formative evaluation helped presenters to improve or focus their sessions; (4) summative evaluation provided managers a way to gauge the success of the institute; (5) summative evaluation provided a benchmark for future programs to be measured against.
a Study of Multiplexing Schemes for Voice and Data.
NASA Astrophysics Data System (ADS)
Sriram, Kotikalapudi
Voice traffic variations are characterized by on/off transitions of voice calls, and talkspurt/silence transitions of speakers in conversations. A speaker is known to be in silence for more than half the time during a telephone conversation. In this dissertation, we study some schemes which exploit speaker silences for an efficient utilization of the transmission capacity in integrated voice/data multiplexing and in digital speech interpolation. We study two voice/data multiplexing schemes. In each scheme, any time slots momentarily unutilized by the voice traffic are made available to data. In the first scheme, the multiplexer does not use speech activity detectors (SAD), and hence the voice traffic variations are due to call on/off only. In the second scheme, the multiplexer detects speaker silences using SAD and transmits voice only during talkspurts. The multiplexer with SAD performs digital speech interpolation (DSI) as well as dynamic channel allocation to voice and data. The performance of the two schemes is evaluated using discrete-time modeling and analysis. The data delay performance for the case of English speech is compared with that for the case of Japanese speech. A closed form expression for the mean data message delay is derived for the single-channel single-talker case. In a DSI system, occasional speech losses occur whenever the number of speakers in simultaneous talkspurt exceeds the number of TDM voice channels. In a buffered DSI system, speech loss is further reduced at the cost of delay. We propose a novel fixed-delay buffered DSI scheme. In this scheme, speech fill-in/hangover is not required because there are no variable delays. Hence, all silences that naturally occur in speech are fully utilized. Consequently, a substantial improvement in the DSI performance is made possible. The scheme is modeled and analyzed in discrete -time. Its performance is evaluated in terms of the probability of speech clipping, packet rejection ratio, DSI advantage, and the delay.
A dynamic multi-channel speech enhancement system for distributed microphones in a car environment
NASA Astrophysics Data System (ADS)
Matheja, Timo; Buck, Markus; Fingscheidt, Tim
2013-12-01
Supporting multiple active speakers in automotive hands-free or speech dialog applications is an interesting issue not least due to comfort reasons. Therefore, a multi-channel system for enhancement of speech signals captured by distributed distant microphones in a car environment is presented. Each of the potential speakers in the car has a dedicated directional microphone close to his position that captures the corresponding speech signal. The aim of the resulting overall system is twofold: On the one hand, a combination of an arbitrary pre-defined subset of speakers' signals can be performed, e.g., to create an output signal in a hands-free telephone conference call for a far-end communication partner. On the other hand, annoying cross-talk components from interfering sound sources occurring in multiple different mixed output signals are to be eliminated, motivated by the possibility of other hands-free applications being active in parallel. The system includes several signal processing stages. A dedicated signal processing block for interfering speaker cancellation attenuates the cross-talk components of undesired speech. Further signal enhancement comprises the reduction of residual cross-talk and background noise. Subsequently, a dynamic signal combination stage merges the processed single-microphone signals to obtain appropriate mixed signals at the system output that may be passed to applications such as telephony or a speech dialog system. Based on signal power ratios between the particular microphone signals, an appropriate speaker activity detection and therewith a robust control mechanism of the whole system is presented. The proposed system may be dynamically configured and has been evaluated for a car setup with four speakers sitting in the car cabin disturbed in various noise conditions.
Cannito, Michael P; Chorna, Lesya B; Kahane, Joel C; Dworkin, James P
2014-05-01
This study evaluated the hypotheses that sentence production by speakers with adductor (AD) and abductor (AB) spasmodic dysphonia (SD) may be differentially influenced by consonant voicing and manner features, in comparison with healthy, matched, nondysphonic controls. This was a prospective, single blind study, using a between-groups, repeated measures design for the independent variables of perceived voice quality and sentence duration. Sixteen subjects with ADSD and 10 subjects with ABSD, as well as 26 matched healthy controls produced four short, simple sentences that were systematically loaded with voiced or voiceless consonants of either obstruant or continuant manner categories. Experienced voice clinicians, who were "blind" as to speakers' group affixations, used visual analog scaling to judge the overall voice quality of each sentence. Acoustic sentence durations were also measured. Speakers with ABSD or ADSD demonstrated significantly poorer than normal voice quality on all sentences. Speakers with ABSD exhibited longer than normal duration for voiceless consonant sentences. Speakers with ADSD had poorer voice quality for voiced than for voiceless consonant sentences. Speakers with ABSD had longer durations for voiceless than for voiced consonant sentences. The two subtypes of SD exhibit differential performance on the basis of consonant voicing in short, simple sentences; however, each subgroup manifested voicing-related differences on a different variable (voice quality vs sentence duration). Findings suggest different underlying pathophysiological mechanisms for ABSD and ADSD. Findings also support inclusion of short, simple sentences containing voiced or voiceless consonants as part of the diagnostic protocol for SD, with measurement of sentence duration in addition to judments of voice quality severity. Copyright © 2014 The Voice Foundation. Published by Mosby, Inc. All rights reserved.
Segal, Osnat; Kishon-Rabin, Liat
2017-12-20
The stressed word in a sentence (narrow focus [NF]) conveys information about the intent of the speaker and is therefore important for processing spoken language and in social interactions. The ability of participants with severe-to-profound prelingual hearing loss to comprehend NF has rarely been investigated. The purpose of this study was to assess the recognition and comprehension of NF by young adults with prelingual hearing loss compared with those of participants with normal hearing (NH). The participants included young adults with hearing aids (HA; n = 10), cochlear implants (CI; n = 12), and NH (n = 18). The test material included the Hebrew Narrow Focus Test (Segal, Kaplan, Patael, & Kishon-Rabin, in press), with 3 subtests, which was used to assess the recognition and comprehension of NF in different contexts. The following results were obtained: (a) CI and HA users successfully recognized the stressed word, with the worst performance for CI; (b) HA and CI comprehended NF less well than NH; and (c) the comprehension of NF was associated with verbal working memory and expressive vocabulary in CI users. Most CI and HA users were able to recognize the stressed word in a sentence but had considerable difficulty understanding it. Different factors may contribute to this difficulty, including the memory load during the task itself and linguistic and pragmatic abilities. https://doi.org/10.23641/asha.5572792.
Schneider, Jill E; Deviche, Pierre
2017-12-01
Life history strategies are composed of multiple fitness components, each of which incurs costs and benefits. Consequently, organisms cannot maximize all fitness components simultaneously. This situation results in a dynamic array of trade-offs in which some fitness traits prevail at the expense of others, often depending on context. The identification of specific constraints and trade-offs has helped elucidate physiological mechanisms that underlie variation in behavioral and physiological life history strategies. There is general recognition that trade-offs are made at the individual and population level, but much remains to be learned concerning the molecular neuroendocrine mechanisms that underlie trade-offs. For example, we still do not know whether the mechanisms that underlie trade-offs at the individual level relate to trade-offs at the population level. To advance our understanding of trade-offs, we organized a group of speakers who study neuroendocrine mechanisms at the interface of traits that are not maximized simultaneously. Speakers were invited to represent research from a wide range of taxa including invertebrates (e.g., worms and insects), fish, nonavian reptiles, birds, and mammals. Three general themes emerged. First, the study of trade-offs requires that we investigate traditional endocrine mechanisms that include hormones, neuropeptides, and their receptors, and in addition, other chemical messengers not traditionally included in endocrinology. The latter group includes growth factors, metabolic intermediates, and molecules of the immune system. Second, the nomenclature and theory of neuroscience that has dominated the study of behavior is being re-evaluated in the face of evidence for the peripheral actions of so-called neuropeptides and neurotransmitters and the behavioral repercussions of these actions. Finally, environmental and ecological contexts continue to be critical in unmasking molecular mechanisms that are hidden when study animals are housed in enclosed spaces, with unlimited food, without competitors or conspecifics, and in constant ambient conditions. © The Author 2017. Published by Oxford University Press on behalf of the Society for Integrative and Comparative Biology.
The Communication of Public Speaking Anxiety: Perceptions of Asian and American Speakers.
ERIC Educational Resources Information Center
Martini, Marianne; And Others
1992-01-01
Finds that U.S. audiences perceive Asian speakers to have more speech anxiety than U.S. speakers, even though Asian speakers do not self-report higher anxiety levels. Confirms that speech state anxiety is not communicated effectively between speakers and audiences for Asian or U.S. speakers. (SR)
An Investigation of Syntactic Priming among German Speakers at Varying Proficiency Levels
ERIC Educational Resources Information Center
Ruf, Helena T.
2011-01-01
This dissertation investigates syntactic priming in second language (L2) development among three speaker populations: (1) less proficient L2 speakers; (2) advanced L2 speakers; and (3) LI speakers. Using confederate scripting this study examines how German speakers choose certain word orders in locative constructions (e.g., "Auf dem Tisch…
Modeling Speaker Proficiency, Comprehensibility, and Perceived Competence in a Language Use Domain
ERIC Educational Resources Information Center
Schmidgall, Jonathan Edgar
2013-01-01
Research suggests that listener perceptions of a speaker's oral language use, or a speaker's "comprehensibility," may be influenced by a variety of speaker-, listener-, and context-related factors. Primary speaker factors include aspects of the speaker's proficiency in the target language such as pronunciation and…
ERIC Educational Resources Information Center
Cotos, Elena
2010-01-01
This dissertation presents an innovative approach to the development and empirical evaluation of Automated Writing Evaluation (AWE) technology used for teaching and learning. It introduces IADE (Intelligent Academic Discourse Evaluator), a new web-based AWE program that analyzes research article Introduction sections and generates immediate,…
Developing Appreciation for Sarcasm and Sarcastic Gossip: It Depends on Perspective.
Glenwright, Melanie; Tapley, Brent; Rano, Jacqueline K S; Pexman, Penny M
2017-11-09
Speakers use sarcasm to criticize others and to be funny; the indirectness of sarcasm protects the addressee's face (Brown & Levinson, 1987). Thus, appreciation of sarcasm depends on the ability to consider perspectives. We investigated development of this ability from late childhood into adulthood and examined effects of interpretive perspective and parties present. We presented 9- to 10-year-olds, 13- to 14-year-olds, and adults with sarcastic and literal remarks in three parties-present conditions: private evaluation, public evaluation, and gossip. Participants interpreted the speaker's attitude and humor from the addressee's perspective and, when appropriate, from the bystander's perspective. Children showed no influence of interpretive perspective or parties present on appreciation of the speaker's attitude or humor. Adolescents and adults, however, shifted their interpretations, judging that addressees have less favorable views of criticisms than bystanders. Further, adolescents and adults differed in their perceptions of the social functions of gossip, with adolescents showing more positive attitudes than adults toward sarcastic gossip. We suggest that adults' disapproval of sarcastic gossip shows a deeper understanding of the utility of sarcasm's face-saving function. Thus, the ability to modulate appreciation of sarcasm according to interpretive perspective and parties present continues to develop in adolescence and into adulthood.
The perception of FM sweeps by Chinese and English listeners.
Luo, Huan; Boemio, Anthony; Gordon, Michael; Poeppel, David
2007-02-01
Frequency-modulated (FM) signals are an integral acoustic component of ecologically natural sounds and are analyzed effectively in the auditory systems of humans and animals. Linearly frequency-modulated tone sweeps were used here to evaluate two questions. First, how rapid a sweep can listeners accurately perceive? Second, is there an effect of native language insofar as the language (phonology) is differentially associated with processing of FM signals? Speakers of English and Mandarin Chinese were tested to evaluate whether being a speaker of a tone language altered the perceptual identification of non-speech tone sweeps. In two psychophysical studies, we demonstrate that Chinese subjects perform better than English subjects in FM direction identification, but not in an FM discrimination task, in which English and Chinese speakers show similar detection thresholds of approximately 20 ms duration. We suggest that the better FM direction identification in Chinese subjects is related to their experience with FM direction analysis in the tone-language environment, even though supra-segmental tonal variation occurs over a longer time scale. Furthermore, the observed common discrimination temporal threshold across two language groups supports the conjecture that processing auditory signals at durations of approximately 20 ms constitutes a fundamental auditory perceptual threshold.
On Short-Time Estimation of Vocal Tract Length from Formant Frequencies
Lammert, Adam C.; Narayanan, Shrikanth S.
2015-01-01
Vocal tract length is highly variable across speakers and determines many aspects of the acoustic speech signal, making it an essential parameter to consider for explaining behavioral variability. A method for accurate estimation of vocal tract length from formant frequencies would afford normalization of interspeaker variability and facilitate acoustic comparisons across speakers. A framework for considering estimation methods is developed from the basic principles of vocal tract acoustics, and an estimation method is proposed that follows naturally from this framework. The proposed method is evaluated using acoustic characteristics of simulated vocal tracts ranging from 14 to 19 cm in length, as well as real-time magnetic resonance imaging data with synchronous audio from five speakers whose vocal tracts range from 14.5 to 18.0 cm in length. Evaluations show improvements in accuracy over previously proposed methods, with 0.631 and 1.277 cm root mean square error on simulated and human speech data, respectively. Empirical results show that the effectiveness of the proposed method is based on emphasizing higher formant frequencies, which seem less affected by speech articulation. Theoretical predictions of formant sensitivity reinforce this empirical finding. Moreover, theoretical insights are explained regarding the reason for differences in formant sensitivity. PMID:26177102
Comparison of different speech tasks among adults who stutter and adults who do not stutter
Ritto, Ana Paula; Costa, Julia Biancalana; Juste, Fabiola Staróbole; de Andrade, Claudia Regina Furquim
2016-01-01
OBJECTIVES: In this study, we compared the performance of both fluent speakers and people who stutter in three different speaking situations: monologue speech, oral reading and choral reading. This study follows the assumption that the neuromotor control of speech can be influenced by external auditory stimuli in both speakers who stutter and speakers who do not stutter. METHOD: Seventeen adults who stutter and seventeen adults who do not stutter were assessed in three speaking tasks: monologue, oral reading (solo reading aloud) and choral reading (reading in unison with the evaluator). Speech fluency and rate were measured for each task. RESULTS: The participants who stuttered had a lower frequency of stuttering during choral reading than during monologue and oral reading. CONCLUSIONS: According to the dual premotor system model, choral speech enhanced fluency by providing external cues for the timing of each syllable compensating for deficient internal cues. PMID:27074176
Speech rate reduction and "nasality" in normal speakers.
Brancewicz, T M; Reich, A R
1989-12-01
This study explored the effects of reduced speech rate on nasal/voice accelerometric measures and nasality ratings. Nasal/voice accelerometric measures were obtained from normal adults for various speech stimuli and speaking rates. Stimuli included three sentences (one obstruent-loaded, one semivowel-loaded, and one containing a single nasal), and /pv/ syllable trains.. Speakers read the stimuli at their normal rate, half their normal rate, and as slowly as possible. In addition, a computer program paced each speaker at rates of 1, 2, and 3 syllables per second. The nasal/voice accelerometric values revealed significant stimulus effects but no rate effects. The nasality ratings of experienced listeners, evaluated as a function of stimulus and speaking rate, were compared to the accelerometric measures. The nasality scale values demonstrated small, but statistically significant, stimulus and rate effects. However, the nasality percepts were poorly correlated with the nasal/voice accelerometric measures.
Bergstra, Myrthe; DE Mulder, Hannah N M; Coopmans, Peter
2018-04-06
This study investigated how speaker certainty (a rational cue) and speaker benevolence (an emotional cue) influence children's willingness to learn words in a selective learning paradigm. In two experiments four- to six-year-olds learnt novel labels from two speakers and, after a week, their memory for these labels was reassessed. Results demonstrated that children retained the label-object pairings for at least a week. Furthermore, children preferred to learn from certain over uncertain speakers, but they had no significant preference for nice over nasty speakers. When the cues were combined, children followed certain speakers, even if they were nasty. However, children did prefer to learn from nice and certain speakers over nasty and certain speakers. These results suggest that rational cues regarding a speaker's linguistic competence trump emotional cues regarding a speaker's affective status in word learning. However, emotional cues were found to have a subtle influence on this process.
Laukka, Petri; Elfenbein, Hillary Anger; Thingujam, Nutankumar S; Rockstuhl, Thomas; Iraki, Frederick K; Chui, Wanda; Althoff, Jean
2016-11-01
This study extends previous work on emotion communication across cultures with a large-scale investigation of the physical expression cues in vocal tone. In doing so, it provides the first direct test of a key proposition of dialect theory, namely that greater accuracy of detecting emotions from one's own cultural group-known as in-group advantage-results from a match between culturally specific schemas in emotional expression style and culturally specific schemas in emotion recognition. Study 1 used stimuli from 100 professional actors from five English-speaking nations vocally conveying 11 emotional states (anger, contempt, fear, happiness, interest, lust, neutral, pride, relief, sadness, and shame) using standard-content sentences. Detailed acoustic analyses showed many similarities across groups, and yet also systematic group differences. This provides evidence for cultural accents in expressive style at the level of acoustic cues. In Study 2, listeners evaluated these expressions in a 5 × 5 design balanced across groups. Cross-cultural accuracy was greater than expected by chance. However, there was also in-group advantage, which varied across emotions. A lens model analysis of fundamental acoustic properties examined patterns in emotional expression and perception within and across groups. Acoustic cues were used relatively similarly across groups both to produce and judge emotions, and yet there were also subtle cultural differences. Speakers appear to have a culturally nuanced schema for enacting vocal tones via acoustic cues, and perceivers have a culturally nuanced schema in judging them. Consistent with dialect theory's prediction, in-group judgments showed a greater match between these schemas used for emotional expression and perception. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Improvements of ModalMax High-Fidelity Piezoelectric Audio Device
NASA Technical Reports Server (NTRS)
Woodard, Stanley E.
2005-01-01
ModalMax audio speakers have been enhanced by innovative means of tailoring the vibration response of thin piezoelectric plates to produce a high-fidelity audio response. The ModalMax audio speakers are 1 mm in thickness. The device completely supplants the need to have a separate driver and speaker cone. ModalMax speakers can perform the same applications of cone speakers, but unlike cone speakers, ModalMax speakers can function in harsh environments such as high humidity or extreme wetness. New design features allow the speakers to be completely submersed in salt water, making them well suited for maritime applications. The sound produced from the ModalMax audio speakers has sound spatial resolution that is readily discernable for headset users.
ITEM ANALYSIS OF THREE SPANISH NAMING TESTS: A CROSS-CULTURAL INVESTIGATION
de la Plata, Carlos Marquez; Arango-Lasprilla, Juan Carlos; Alegret, Montse; Moreno, Alexander; Tárraga, Luis; Lara, Mar; Hewlitt, Margaret; Hynan, Linda; Cullum, C. Munro
2009-01-01
Neuropsychological evaluations conducted in the United States and abroad commonly include the use of tests translated from English to Spanish. The use of translated naming tests for evaluating predominately Spanish-speakers has recently been challenged on the grounds that translating test items may compromise a test’s construct validity. The Texas Spanish Naming Test (TNT) has been developed in Spanish specifically for use with Spanish-speakers; however, it is unlikely patients from diverse Spanish-speaking geographical regions will perform uniformly on a naming test. The present study evaluated and compared the internal consistency and patterns of item-difficulty and -discrimination for the TNT and two commonly used translated naming tests in three countries (i.e., United States, Colombia, Spain). Two hundred fifty two subjects (126 demented, 116 nondemented) across three countries were administered the TNT, Modified Boston Naming Test-Spanish, and the naming subtest from the CERAD. The TNT demonstrated superior internal consistency to its counterparts, a superior item difficulty pattern than the CERAD naming test, and a superior item discrimination pattern than the MBNT-S across countries. Overall, all three Spanish naming tests differentiated nondemented and moderately demented individuals, but the results suggest the items of the TNT are most appropriate to use with Spanish-speakers. Preliminary normative data for the three tests examined in each country are provided. PMID:19208960
Partially supervised speaker clustering.
Tang, Hao; Chu, Stephen Mingyu; Hasegawa-Johnson, Mark; Huang, Thomas S
2012-05-01
Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance.
Tip of the Tongue States Increase under Evaluative Observation
ERIC Educational Resources Information Center
James, Lori E.; Schmank, Christopher J.; Castro, Nichol; Buchanan, Tony W.
2018-01-01
We tested the frequent assumption that the difficulty of word retrieval increases when a speaker is being observed and evaluated. We modified the Trier Social Stress Test (TSST) so that participants believed that its evaluative observation components continued throughout the duration of a subsequent word retrieval task, and measured participants'…
ERIC Educational Resources Information Center
Rodriguez-Parra, Maria J.; Adrian, Jose A.; Casado, Juan C.
2011-01-01
Purpose: This study evaluates the effectiveness of two different programs of voice-treatment on a heterogeneous group of dysphonic speakers and the stability of therapeutic progress for longterm follow-up post-treatment period, using a limited multidimensional protocol of evaluation. Method: Forty-two participants with voice disorders were…
Deaf-And-Mute Sign Language Generation System
NASA Astrophysics Data System (ADS)
Kawai, Hideo; Tamura, Shinichi
1984-08-01
We have developed a system which can recognize speech and generate the corresponding animation-like sign language sequence. The system is implemented in a popular personal computer. This has three video-RAM's and a voice recognition board which can recognize only registered voice of a specific speaker. Presently, fourty sign language patterns and fifty finger spellings are stored in two floppy disks. Each sign pattern is composed of one to four sub-patterns. That is, if the pattern is composed of one sub-pattern, it is displayed as a still pattern. If not, it is displayed as a motion pattern. This system will help communications between deaf-and-mute persons and healthy persons. In order to display in high speed, almost programs are written in a machine language.
ERIC Educational Resources Information Center
Subtirelu, Nicholas Close; Lindemann, Stephanie
2016-01-01
While most research in applied linguistics has focused on second language (L2) speakers and their language capabilities, the success of interaction between such speakers and first language (L1) speakers also relies on the positive attitudes and communication skills of the L1 speakers. However, some research has suggested that many L1 speakers lack…
Temporal and acoustic characteristics of Greek vowels produced by adults with cerebral palsy
NASA Astrophysics Data System (ADS)
Botinis, Antonis; Orfanidou, Ioanna; Fourakis, Marios; Fourakis, Marios
2005-09-01
The present investigation examined the temporal and spectral characteristics of Greek vowels as produced by speakers with intact (NO) versus cerebral palsy affected (CP) neuromuscular systems. Six NO and six CP native speakers of Greek produced the Greek vowels [i, e, a, o, u] in the first syllable of CVCV nonsense words in a short carrier phrase. Stress could be on either the first or second syllable. There were three female and three male speakers in each group. In terms of temporal characteristics, the results showed that: vowels produced by CP speakers were longer than vowels produced by NO speakers; stressed vowels were longer than unstressed vowels; vowels produced by female speakers were longer than vowels produced by male speakers. In terms of spectral characteristics the results showed that the vowel space of the CP speakers was smaller than that of the NO speakers. This is similar to the results recently reported by Liu et al. [J. Acoust. Soc. Am. 117, 3879-3889 (2005)] for CP speakers of Mandarin. There was also a reduction of the acoustic vowel space defined by unstressed vowels, but this reduction was much more pronounced in the vowel productions of CP speakers than NO speakers.
Consistency between verbal and non-verbal affective cues: a clue to speaker credibility.
Gillis, Randall L; Nilsen, Elizabeth S
2017-06-01
Listeners are exposed to inconsistencies in communication; for example, when speakers' words (i.e. verbal) are discrepant with their demonstrated emotions (i.e. non-verbal). Such inconsistencies introduce ambiguity, which may render a speaker to be a less credible source of information. Two experiments examined whether children make credibility discriminations based on the consistency of speakers' affect cues. In Experiment 1, school-age children (7- to 8-year-olds) preferred to solicit information from consistent speakers (e.g. those who provided a negative statement with negative affect), over novel speakers, to a greater extent than they preferred to solicit information from inconsistent speakers (e.g. those who provided a negative statement with positive affect) over novel speakers. Preschoolers (4- to 5-year-olds) did not demonstrate this preference. Experiment 2 showed that school-age children's ratings of speakers were influenced by speakers' affect consistency when the attribute being judged was related to information acquisition (speakers' believability, "weird" speech), but not general characteristics (speakers' friendliness, likeability). Together, findings suggest that school-age children are sensitive to, and use, the congruency of affect cues to determine whether individuals are credible sources of information.
Ben-David, Boaz M; Icht, Michal
2017-05-01
Oral-diadochokinesis (oral-DDK) tasks are extensively used in the evaluation of motor speech abilities. Currently, validated normative data for older adults (aged 65 years and older) are missing in Hebrew. The effect of task stimuli (non-word versus real-word repetition) is also non-clear in the population of older adult Hebrew speakers. (1) To establish a norm for oral-DDK rate for older adult (aged 65 years and older) Hebrew speakers, and to investigate the possible effect of age and gender on performance rate; and (2) to examine the effects of stimuli (non-word versus real word) on oral-DDK rates. In experiment 1, 88 healthy older Hebrew speakers (60-95 years, 48 females and 40 males) were audio-recorded while performing an oral-DDK task (repetition of /pataka/), and repetition rates (syllables/s) were coded. In experiment 2, the effect of real-word repetition was evaluated. Sixty-eight older Hebrew speakers (aged 66-95 years, 43 females and 25 males) were asked to repeat 'pataka' (non-word) and 'bodeket' (Hebrew real word). Experiment 1: Oral-DDK performance for older adult Hebrew speakers was 5.07 syllables/s (SD = 1.16 syllables/s), across age groups and gender. Comparison of this data with Hebrew norms for younger adults (and equivalent data in English) shows the following gradient of oral-DDK rates: ages 15-45 > 65-74 > 75-86 years. Gender was not a significant factor in our data. Experiment 2: Repetition of real words was faster than that of non-words, by 13.5%. The paper provides normative values for oral-DDK rates for older Hebrew speakers. The data show the large impact of ageing on oro-motor functions. The analysis further indicates that speech and language pathologists should consider separate norms for clients of 65-74 years and those of 75-86 years. Hebrew rates were found to be different from English norms for the oldest group, shedding light on the impact of language on these norms. Finally, the data support using a dual-protocol (real- and non-word repetition) with older adults to improve differential diagnosis of normal and pathological ageing in this task. © 2016 Royal College of Speech and Language Therapists.
Potts, Lisa G.; Skinner, Margaret W.; Litovsky, Ruth A.; Strube, Michael J; Kuk, Francis
2010-01-01
Background The use of bilateral amplification is now common clinical practice for hearing aid users but not for cochlear implant recipients. In the past, most cochlear implant recipients were implanted in one ear and wore only a monaural cochlear implant processor. There has been recent interest in benefits arising from bilateral stimulation that may be present for cochlear implant recipients. One option for bilateral stimulation is the use of a cochlear implant in one ear and a hearing aid in the opposite nonimplanted ear (bimodal hearing). Purpose This study evaluated the effect of wearing a cochlear implant in one ear and a digital hearing aid in the opposite ear on speech recognition and localization. Research Design A repeated-measures correlational study was completed. Study Sample Nineteen adult Cochlear Nucleus 24 implant recipients participated in the study. Intervention The participants were fit with a Widex Senso Vita 38 hearing aid to achieve maximum audibility and comfort within their dynamic range. Data Collection and Analysis Soundfield thresholds, loudness growth, speech recognition, localization, and subjective questionnaires were obtained six–eight weeks after the hearing aid fitting. Testing was completed in three conditions: hearing aid only, cochlear implant only, and cochlear implant and hearing aid (bimodal). All tests were repeated four weeks after the first test session. Repeated-measures analysis of variance was used to analyze the data. Significant effects were further examined using pairwise comparison of means or in the case of continuous moderators, regression analyses. The speech-recognition and localization tasks were unique, in that a speech stimulus presented from a variety of roaming azimuths (140 degree loudspeaker array) was used. Results Performance in the bimodal condition was significantly better for speech recognition and localization compared to the cochlear implant–only and hearing aid–only conditions. Performance was also different between these conditions when the location (i.e., side of the loudspeaker array that presented the word) was analyzed. In the bimodal condition, the speech-recognition and localization tasks were equal regardless of which side of the loudspeaker array presented the word, while performance was significantly poorer for the monaural conditions (hearing aid only and cochlear implant only) when the words were presented on the side with no stimulation. Binaural loudness summation of 1–3 dB was seen in soundfield thresholds and loudness growth in the bimodal condition. Measures of the audibility of sound with the hearing aid, including unaided thresholds, soundfield thresholds, and the Speech Intelligibility Index, were significant moderators of speech recognition and localization. Based on the questionnaire responses, participants showed a strong preference for bimodal stimulation. Conclusions These findings suggest that a well-fit digital hearing aid worn in conjunction with a cochlear implant is beneficial to speech recognition and localization. The dynamic test procedures used in this study illustrate the importance of bilateral hearing for locating, identifying, and switching attention between multiple speakers. It is recommended that unilateral cochlear implant recipients, with measurable unaided hearing thresholds, be fit with a hearing aid. PMID:19594084
Multi-Lingual Deep Neural Networks for Language Recognition
2016-08-08
training configurations for the NIST 2011 and 2015 lan- guage recognition evaluations (LRE11 and LRE15). The best per- forming multi-lingual BN-DNN...very ef- fective approach in the NIST 2015 language recognition evaluation (LRE15) open training condition [4, 5]. In this work we evaluate the impact...language are summarized in Table 2. Two language recognition tasks are used for evaluating the multi-lingual bottleneck systems. The first is the NIST
"Chrysalis -- A Time for Change."
ERIC Educational Resources Information Center
Orloff, Jeffrey H., Comp.
Presented is an evaluation of the first annual Northern Virginia Conference on Gifted/Talented Education held April 30 - May 1, 1976. Listed are details of the agenda, keynote speakers, mini-lab leaders, mini-lab sessions, conference participants, budget, and geographical areas represented. Evaluation information is provided on specific program…
The Influence of Social Style in Evaluating Academic Presentations of Engineering Projects
ERIC Educational Resources Information Center
Ortiz València, Héctor; García Carrillo, Águeda; González Benítez, Margarita
2012-01-01
An individual's social style is determined by behavioral patterns in the interactions with their peers. Some studies suggest that social style may influence the way in which an individual's performance is evaluated. We studied the effects that speakers' and evaluators' social styles have on the marks given for end-of-term presentations in a…
The effect of tonal changes on voice onset time in Mandarin esophageal speech.
Liu, Hanjun; Ng, Manwa L; Wan, Mingxi; Wang, Supin; Zhang, Yi
2008-03-01
The present study investigated the effect of tonal changes on voice onset time (VOT) between normal laryngeal (NL) and superior esophageal (SE) speakers of Mandarin Chinese. VOT values were measured from the syllables /pha/, /tha/, and /kha/ produced at four tone levels by eight NL and seven SE speakers who were native speakers of Mandarin. Results indicated that Mandarin tones were associated with significantly different VOT values for NL speakers, in which high-falling tone was associated with significantly shorter VOT values than mid-rising tone and falling-rising tone. Regarding speaker group, SE speakers showed significantly shorter VOT values than NL speakers across all tone levels. This may be related to their use of pharyngoesophageal (PE) segment as another sound source. SE speakers appear to take a shorter time to start PE segment vibration compared to NL speakers using the vocal folds for vibration.
Ng, Manwa L; Chen, Yang
2011-12-01
The present study examined English sentence stress produced by native Cantonese speakers who were speaking English as a second language (ESL). Cantonese ESL speakers' proficiency in English stress production as perceived by English-speaking listeners was also studied. Acoustical parameters associated with sentence stress including fundamental frequency (F0), vowel duration, and intensity were measured from the English sentences produced by 40 Cantonese ESL speakers. Data were compared with those obtained from 40 native speakers of American English. The speech samples were also judged by eight native listeners who were native speakers of American English for placement, degree, and naturalness of stress. Results showed that Cantonese ESL speakers were able to use F0, vowel duration, and intensity to differentiate sentence stress patterns. Yet, both female and male Cantonese ESL speakers exhibited consistently higher F0 in stressed words than English speakers. Overall, Cantonese ESL speakers were found to be proficient in using duration and intensity to signal sentence stress, in a way comparable with English speakers. In addition, F0 and intensity were found to correlate closely with perceptual judgement and the degree of stress with the naturalness of stress.
Voice emotion recognition by cochlear-implanted children and their normally-hearing peers
Chatterjee, Monita; Zion, Danielle; Deroche, Mickael L.; Burianek, Brooke; Limb, Charles; Goren, Alison; Kulkarni, Aditya M.; Christensen, Julie A.
2014-01-01
Despite their remarkable success in bringing spoken language to hearing impaired listeners, the signal transmitted through cochlear implants (CIs) remains impoverished in spectro-temporal fine structure. As a consequence, pitch-dominant information such as voice emotion, is diminished. For young children, the ability to correctly identify the mood/intent of the speaker (which may not always be visible in their facial expression) is an important aspect of social and linguistic development. Previous work in the field has shown that children with cochlear implants (cCI) have significant deficits in voice emotion recognition relative to their normally hearing peers (cNH). Here, we report on voice emotion recognition by a cohort of 36 school-aged cCI. Additionally, we provide for the first time, a comparison of their performance to that of cNH and NH adults (aNH) listening to CI simulations of the same stimuli. We also provide comparisons to the performance of adult listeners with CIs (aCI), most of whom learned language primarily through normal acoustic hearing. Results indicate that, despite strong variability, on average, cCI perform similarly to their adult counterparts; that both groups’ mean performance is similar to aNHs’ performance with 8-channel noise-vocoded speech; that cNH achieve excellent scores in voice emotion recognition with full-spectrum speech, but on average, show significantly poorer scores than aNH with 8-channel noise-vocoded speech. A strong developmental effect was observed in the cNH with noise-vocoded speech in this task. These results point to the considerable benefit obtained by cochlear-implanted children from their devices, but also underscore the need for further research and development in this important and neglected area. PMID:25448167
Measures to Evaluate the Effects of DBS on Speech Production
Weismer, Gary; Yunusova, Yana; Bunton, Kate
2011-01-01
The purpose of this paper is to review and evaluate measures of speech production that could be used to document effects of Deep Brain Stimulation (DBS) on speech performance, especially in persons with Parkinson disease (PD). A small set of evaluative criteria for these measures is presented first, followed by consideration of several speech physiology and speech acoustic measures that have been studied frequently and reported on in the literature on normal speech production, and speech production affected by neuromotor disorders (dysarthria). Each measure is reviewed and evaluated against the evaluative criteria. Embedded within this review and evaluation is a presentation of new data relating speech motions to speech intelligibility measures in speakers with PD, amyotrophic lateral sclerosis (ALS), and control speakers (CS). These data are used to support the conclusion that at the present time the slope of second formant transitions (F2 slope), an acoustic measure, is well suited to make inferences to speech motion and to predict speech intelligibility. The use of other measures should not be ruled out, however, and we encourage further development of evaluative criteria for speech measures designed to probe the effects of DBS or any treatment with potential effects on speech production and communication skills. PMID:24932066
Holistic Speaker Evaluation--A Review and Discussion.
ERIC Educational Resources Information Center
Shelton, Karen; Shelton, Michael W.
The question of what variables affect success in debate has long been an area of interest and concern in the forensic community. For many years, it was thought that traditional performance variables--delivery, reasoning, organization, analysis, refutation and use of evidence--were the key factors influencing evaluations of debaters. Some…
The Speaker Gender Gap at Critical Care Conferences.
Mehta, Sangeeta; Rose, Louise; Cook, Deborah; Herridge, Margaret; Owais, Sawayra; Metaxa, Victoria
2018-06-01
To review women's participation as faculty at five critical care conferences over 7 years. Retrospective analysis of five scientific programs to identify the proportion of females and each speaker's profession based on conference conveners, program documents, or internet research. Three international (European Society of Intensive Care Medicine, International Symposium on Intensive Care and Emergency Medicine, Society of Critical Care Medicine) and two national (Critical Care Canada Forum, U.K. Intensive Care Society State of the Art Meeting) annual critical care conferences held between 2010 and 2016. Female faculty speakers. None. Male speakers outnumbered female speakers at all five conferences, in all 7 years. Overall, women represented 5-31% of speakers, and female physicians represented 5-26% of speakers. Nursing and allied health professional faculty represented 0-25% of speakers; in general, more than 50% of allied health professionals were women. Over the 7 years, Society of Critical Care Medicine had the highest representation of female (27% overall) and nursing/allied health professional (16-25%) speakers; notably, male physicians substantially outnumbered female physicians in all years (62-70% vs 10-19%, respectively). Women's representation on conference program committees ranged from 0% to 40%, with Society of Critical Care Medicine having the highest representation of women (26-40%). The female proportions of speakers, physician speakers, and program committee members increased significantly over time at the Society of Critical Care Medicine and U.K. Intensive Care Society State of the Art Meeting conferences (p < 0.05), but there was no temporal change at the other three conferences. There is a speaker gender gap at critical care conferences, with male faculty outnumbering female faculty. This gap is more marked among physician speakers than those speakers representing nursing and allied health professionals. Several organizational strategies can address this gender gap.
Reflecting on Native Speaker Privilege
ERIC Educational Resources Information Center
Berger, Kathleen
2014-01-01
The issues surrounding native speakers (NSs) and nonnative speakers (NNSs) as teachers (NESTs and NNESTs, respectively) in the field of teaching English to speakers of other languages (TESOL) are a current topic of interest. In many contexts, the native speaker of English is viewed as the model teacher, thus putting the NEST into a position of…
ERIC Educational Resources Information Center
Kersten, Alan W.; Meissner, Christian A.; Lechuga, Julia; Schwartz, Bennett L.; Albrechtsen, Justin S.; Iglesias, Adam
2010-01-01
Three experiments provide evidence that the conceptualization of moving objects and events is influenced by one's native language, consistent with linguistic relativity theory. Monolingual English speakers and bilingual Spanish/English speakers tested in an English-speaking context performed better than monolingual Spanish speakers and bilingual…
Evitts, Paul M; Starmer, Heather; Teets, Kristine; Montgomery, Christen; Calhoun, Lauren; Schulze, Allison; MacKenzie, Jenna; Adams, Lauren
2016-11-01
There is currently minimal information on the impact of dysphonia secondary to phonotrauma on listeners. Considering the high incidence of voice disorders with professional voice users, it is important to understand the impact of a dysphonic voice on their audiences. Ninety-one healthy listeners (39 men, 52 women; mean age = 23.62 years) were presented with speech stimuli from 5 healthy speakers and 5 speakers diagnosed with dysphonia secondary to phonotrauma. Dependent variables included processing speed (reaction time [RT] ratio), speech intelligibility, and listener comprehension. Voice quality ratings were also obtained for all speakers by 3 expert listeners. Statistical results showed significant differences between RT ratio and number of speech intelligibility errors between healthy and dysphonic voices. There was not a significant difference in listener comprehension errors. Multiple regression analyses showed that voice quality ratings from the Consensus Assessment Perceptual Evaluation of Voice (Kempster, Gerratt, Verdolini Abbott, Barkmeier-Kraemer, & Hillman, 2009) were able to predict RT ratio and speech intelligibility but not listener comprehension. Results of the study suggest that although listeners require more time to process and have more intelligibility errors when presented with speech stimuli from speakers with dysphonia secondary to phonotrauma, listener comprehension may not be affected.
Performance enhancement for audio-visual speaker identification using dynamic facial muscle model.
Asadpour, Vahid; Towhidkhah, Farzad; Homayounpour, Mohammad Mehdi
2006-10-01
Science of human identification using physiological characteristics or biometry has been of great concern in security systems. However, robust multimodal identification systems based on audio-visual information has not been thoroughly investigated yet. Therefore, the aim of this work to propose a model-based feature extraction method which employs physiological characteristics of facial muscles producing lip movements. This approach adopts the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of speaker; consequently, imitation of valid speakers could be reduced to a large extent. These parameters are applied to a hidden Markov model (HMM) audio-visual identification system. In this work, a combination of audio and video features has been employed by adopting a multistream pseudo-synchronized HMM training method. Noise robust audio features such as Mel-frequency cepstral coefficients (MFCC), spectral subtraction (SS), and relative spectra perceptual linear prediction (J-RASTA-PLP) have been used to evaluate the performance of the multimodal system once efficient audio feature extraction methods have been utilized. The superior performance of the proposed system is demonstrated on a large multispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. To evaluate the robustness of algorithms, some experiments were performed on genetically identical twins. Furthermore, changes in speaker voice were simulated with drug inhalation tests. In 3 dB signal to noise ratio (SNR), the dynamic muscle model improved the identification rate of the audio-visual system from 91 to 98%. Results on identical twins revealed that there was an apparent improvement on the performance for the dynamic muscle model-based system, in which the identification rate of the audio-visual system was enhanced from 87 to 96%.
Linguistic Stereotyping in Older Adults' Perceptions of Health Care Aides.
Rubin, Donald; Coles, Valerie Berenice; Barnett, Joshua Trey
2016-07-01
The cultural and linguistic diversity of the U.S. health care provider workforce is expanding. Diversity among health care personnel such as paraprofessional health care assistants (HCAs)-many of whom are immigrants-means that intimate, high-stakes cross-cultural and cross-linguistic contact characterizes many health interactions. In particular, nonmainstream HCAs may face negative patient expectations because of patients' language stereotypes. In other contexts, reverse linguistic stereotyping has been shown to result in negative speaker evaluations and even reduced listening comprehension quite independently of the actual language performance of the speaker. The present study extends the language and attitude paradigm to older adults' perceptions of HCAs. Listeners heard the identical speaker of Standard American English as they watched interactions between an HCA and an older patient. Ethnolinguistic identities-either an Anglo native speaker of English or a Mexican nonnative speaker-were ascribed to HCAs by means of fabricated personnel files. Dependent variables included measures of perceived HCA language proficiency, personal characteristics, and professional competence, as well as listeners' comprehension of a health message delivered by the putative HCA. For most of these outcomes, moderate effect sizes were found such that the HCA with an ascribed Anglo identity-relative to the Mexican guise-was judged more proficient in English, socially superior, interpersonally more attractive, more dynamic, and a more satisfactory home health aide. No difference in listening comprehension emerged, but the Anglo guise tended to engender a more compliant listening mind set. Results of this study can inform both provider-directed and patient-directed efforts to improve health care services for members of all linguistic and cultural groups.
Reasoning about knowledge: Children’s evaluations of generality and verifiability
Koenig, Melissa A.; Cole, Caitlin A.; Meyer, Meredith; Ridge, Katherine E.; Kushnir, Tamar; Gelman, Susan A.
2015-01-01
In a series of experiments, we examined 3- to 8-year-old children’s (N = 223) and adults’ (N = 32) use of two properties of testimony to estimate a speaker’s knowledge: generality and verifiability. Participants were presented with a “Generic speaker” who made a series of 4 general claims about “pangolins” (a novel animal kind), and a “Specific speaker” who made a series of 4 specific claims about “this pangolin” as an individual. To investigate the role of verifiability, we systematically varied whether the claim referred to a perceptually-obvious feature visible in a picture (e.g., “has a pointy nose”) or a non-evident feature that was not visible (e.g., “sleeps in a hollow tree”). Three main findings emerged: (1) Young children showed a pronounced reliance on verifiability that decreased with age. Three-year-old children were especially prone to credit knowledge to speakers who made verifiable claims, whereas 7- to 8-year-olds and adults credited knowledge to generic speakers regardless of whether the claims were verifiable; (2) Children’s attributions of knowledge to generic speakers was not detectable until age 5, and only when those claims were also verifiable; (3) Children often generalized speakers’ knowledge outside of the pangolin domain, indicating a belief that a person’s knowledge about pangolins likely extends to new facts. Findings indicate that young children may be inclined to doubt speakers who make claims they cannot verify themselves, as well as a developmentally increasing appreciation for speakers who make general claims. PMID:26451884
Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces
Bocquelet, Florent; Hueber, Thomas; Girin, Laurent; Savariaux, Christophe; Yvert, Blaise
2016-01-01
Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer. PMID:27880768
Gonzalez, Laura; Negrón, Rosalyn; Berry, Donna L.
2014-01-01
Spanish speakers in the United States encounter numerous communication barriers during cancer treatment. Communication-focused interventions may help Spanish speakers communicate better with healthcare providers and manage symptoms and quality of life issues (SQOL). For this study, we developed a Spanish version of the electronic self-report assessment for cancer (ESRA-C), a web-based program that helps people with cancer report, track, and manage cancer-related SQOL. Four methods were used to evaluate the Spanish version. Focus groups and cognitive interviews were conducted with 51 Spanish-speaking individuals to elicit feedback. Readability was assessed using the Fry readability formula. The cultural sensitivity assessment tool was applied by three bilingual, bicultural reviewers. Revisions were made to personalize the introduction using a patient story and photos and to simplify language. Focus group participants endorsed changes to the program in a second round of focus groups. Cultural sensitivity of the program was scored unacceptable (x¯=3.0) for audiovisual material and acceptable (x¯=3.0) for written material. Fry reading levels ranged from 4th to 10th grade. Findings from this study provide several next steps to refine ESRA-C for Spanish speakers with cancer. PMID:25045535
Evaluating language environment analysis system performance for Chinese: a pilot study in Shanghai.
Gilkerson, Jill; Zhang, Yiwen; Xu, Dongxin; Richards, Jeffrey A; Xu, Xiaojuan; Jiang, Fan; Harnsberger, James; Topping, Keith
2015-04-01
The purpose of this study was to evaluate performance of the Language Environment Analysis (LENA) automated language-analysis system for the Chinese Shanghai dialect and Mandarin (SDM) languages. Volunteer parents of 22 children aged 3-23 months were recruited in Shanghai. Families provided daylong in-home audio recordings using LENA. A native speaker listened to 15 min of randomly selected audio samples per family to label speaker regions and provide Chinese character and SDM word counts for adult speakers. LENA segment labeling and counts were compared with rater-based values. LENA demonstrated good sensitivity in identifying adult and child; this sensitivity was comparable to that of American English validation samples. Precision was strong for adults but less so for children. LENA adult word count correlated strongly with both Chinese characters and SDM word counts. LENA conversational turn counts correlated similarly with rater-based counts after the exclusion of three unusual samples. Performance related to some degree to child age. LENA adult word count and conversational turn provided reasonably accurate estimates for SDM over the age range tested. Theoretical and practical considerations regarding LENA performance in non-English languages are discussed. Despite the pilot nature and other limitations of the study, results are promising for broader cross-linguistic applications.